Unit IV
Unit IV
RDBMS vs Hadoop
24
HADOOP OVERVIEW
Open-source software framework to store and
processmassive amounts of data in a distributed fashion onlarge
clusters of commodity hardware.
Basically, Hadoop accomplishes two tasks:
1. Massive data storage.
2. Faster data processing.
Hadoop Components
24
Hadoop Conceptual Layer
24
It is conceptually divided into
1. Data Storage Layer: Stores huge volumes of data
2. Data Processing Layer: Processes data in parallelto
extract richer and meaningful insights from data
High Level Architecture of Hadoop
Hadoop is a distributed Master-Slave Architecture.
Master node is known as NameNode and slave nodesare
known as DataNodes.
24
Key components of the Master Node
1. Master HDFS: Its main responsibility is partitioningthe data
storage across the slave nodes.It also keeps track
of locations of data on DataNodes.
2. Master MapReduce:It decides and schedulescomputation task
on slave nodes
24
Hadoop Distributors
24
process of creating a namespace in DFS is transparent to
the clients. DFS has two components in its services such as
Local Transparency, Redundancy.
The features of DFS includes such as
Structure transparency: The client need not know about the
number or locations of file servers and the storage devices.
Multiple file servers should be provided for performance,
adaptability, and dependability.
Access transparency: Both local and remote files should be
accessible in the same manner. The file system should be
automatically located on the accessed file and send it to the
client’s side.
Naming transparency: There should not be any hint in the
name of the file to the location of the file. Once a name is
given to the file, it should not be changed during
transferring from one node to another.
Replication transparency: If a file is copied on multiple
nodes, both the copies of the file and their locations should
be hidden from one node to another.
User mobility: It will automatically bring the user’s home
24
directory to the node where the user logs in.
Performance: Performance is based on the average amount
of time needed to convince the client requests.
Simplicity and ease of use: The user interface of a file
system should be simple and the number of commands in
the file should be small.
Data integrity: Multiple users frequently share a file
system. The integrity of data saved in a shared file must be
guaranteed by the file system. That is, concurrent access
requests from many users who are competing for access to
the same file must be correctly synchronized using a
concurrency control method. Atomic transactions are a
high-level concurrency management mechanism for data
integrity that is frequently offered to users by a file system.
Security: A distributed file system should be secure so that
its users may trust that their data will be kept private. To
safeguard the information contained in the file system from
unwanted & unauthorized access, security mechanisms
must be implemented.
It enables large-scale storage and data management by
distributing the load across many machines, improving
24
performance, scalability, and reliability. Distributed file systems
are commonly used in cloud computing, big data platforms, and
environments where high availability and fault tolerance are
critical.
3. Modeled after Google File System
4. Optimized for high throughput(HDFS leverages large block
size and moves computation where daa is stored)
5. You can replicate a file for a configured number of times,
which is tolerant in terms of both software and hardware
6. Re-replicates data blocks automatically on nodes that have
failed
7. You can realize the power of HDFS when you perform read
or write on large files(gigabytes and larger)
8. Sits on top of native file system such as ext3 and ext4
which is
HDFS
Native OS file system
Disk storage
24
HDFS Architecture
Apache Hadoop HDFS Architecture follows aMaster/Slave
Architecture, where a cluster comprises of a single NameNode
(Master node) and all the other nodes are DataNodes (Slave
nodes).
HDFSis a block-structured file systemwhereeachfile
isdivided into blocks of apre-determined size. These blocks are
stored across a cluster of one or several machines.
HDFS can be deployedon a broad spectrum ofmachinesthat
support
Java.ThoughonecanrunseveralDataNodesonasinglemachine,buti
nthepracticalworld,these DataNodes are spread across various
machines.
24
StoringdataintoHDFS
HDFS stores data in a reliable fashion using replication
and distribution. Here is the series of steps that happen when a
client writes a file in hdfs:
1. ClientrequestsNamenode
tocreatethefile.Itpassessizeoffileasaparameter
2. Namenode responds with location of nodes where client
can store data. By default there'll be 3 locations per block.
24
If file size is 200mb, there'll be 2 blocks, first 128 mb, 2nd
72 mb. Similarly depending on the size, you'll have n
number of blocks.
3.
Clientdirectlystartswritingdatatothefirstdatanodeoutofthree
givenbynamenode. Please note that if there are 2 blocks to
be written client can start writing them in parallel.
4. When the first datanode has stored the block, it replies to
the client with success and now it passes on the same
block to 2nd datanode. 2nd datanode will write this block
and pass it on to 3rd datanode.
5. So basically writing of blocks from client to datanodes
happens in parallel but replication happens in series.
6. Blocks of same file can go to different nodes, at least the
replicated blocks will always be on
differentnodes.Thefirstblockisalwaysonthedatanodewhichi
snearesttotheclient,2ndand 3rd blocks are stored based on
free capacity of the datanodes and/or rack awareness.
24
File Blocks:
Blocksarenothingbutthesmallestcontinuouslocationon
yourharddrivewheredataisstored. In general, in any of theFile
System, you store the data as a collection ofblocks. Similarly,
HDFS
storeseachfileasblockswhicharescatteredthroughouttheApacheHa
doopcluster.Thedefault
sizeofeachblockis128MBinApacheHadoop2.x(64MBinApacheH
adoop 1.x)whichyoucan configure as per your requirement. All
blocks of the file are the same size except the last block,
whichcanbeeitherthesamesizeorsmaller.Thefilesaresplitinto128
MBblocksandthe stored into the Hadoop file system. The
Hadoop application is responsiblefor distributing the data block
across multiple nodes.
24
5blockswillbecreated.Thefirstfourblockswillbe
of128MB.But,thelastblockwillbeof2MB size only.
Data Replication
HDFS is designed to reliably store very large files across
machines in a large cluster. It stores each file as a sequence of
blocks; all blocks in a file except the last block are the same
size. The blocks of a file are replicated for fault tolerance. The
block size and replication factor are configurable per file. An
application can specify the number of replicas of a file. The
replication factor can be specified at file creation time and can
be changed later. Files in HDFS are write-once and have strictly
one writer at any time.
The NameNode makes all decisions regarding replication
of blocks. It periodically receives a Heartbeat and a Blockreport
from each of the DataNodes in the cluster. Receipt of a
Heartbeat implies that the DataNode is functioning properly. A
Blockreport contains a list of all blocks on a DataNode.
24
In Figure above there are two files foo and bar respectively with
three data nodes and two blocks each. The blocks have been
scattered across data nodes with replication factor 2. The blocks
are scattered such that if any one of the data nodes is down still
all the file blocks are available in the remaining two nodes of
24
data. An image file known as FSImage contains data about the
file blocks, including their number and location. This file is kept
in the name node and the name node will always maintain an
updated FSImage file. If any of the data nodes is down, the
FSImage has to be updated and in this process, the name node
will come to know about the existence of data nodes through
heartbeat which is sent by all data nodes to the name node every
3 seconds. If the name node does not receive any data node’s
heartbeat beat then it will assume that the data node is down and
accordingly the FSImage file is also updated
WhatisReplicationManagement?
HDFSperformsreplicationtoprovideFaultTolerant
andtoimprovedata reliability.
There could be situations where the data is lost in many
ways-
o node is down,
o Node lost the network connectivity,
o a node is physically damaged, and
o a node is intentionally made unavailable for horizontal
scaling.
24
For any of the above-mentioned reasons, data will not be
available if the replication is not
made.HDFSusuallymaintains3copiesofeach
DataBlockindifferentnodesand different Racks. By doing
this, data is made available even if one of the systems is
down.
Downtime will be reduced by making data replications.
This improves the reliability and makes HDFs fault
tolerant.
Blockreplicationprovidesfaulttolerance.Ifonecopyisnotacce
ssibleandcorrupted,wecan read data from other copy.
The number of copies or replicas of each block of a file in
HDFS Architecture is replication factor. The default
replication factor is 3 which are again configurable. So,
each block replicates three times and stored on different
DataNodes.
So, as you can see inthefigurebelow where each block is
replicated three times and stored ondifferentDataNodes
24
(consideringthedefault replicationfactor):Ifweare
storingafile of
128MBinHDFSusingthedefaultconfiguration,wewillendupo
ccupyingaspaceof384MB (3*128 MB).
24
RackAwarenessinHDFSArchitecture:
Rack-Itthecollectionofmachinesaround30-
40.Allthesemachinesareconnectedusingthe
samenetworkswitchandifthatnetworkgoesdownthenallmac
hinesinthatrackwillbeout of service. Thus we say rack is
down.
RackAwarenesswasintroducedbyApacheHadoop
toovercomethisissue. Rack awareness is the knowledge
that how the data nodes are distributed across the rack of
Hadoop cluster.
In the large cluster of Hadoop, in order to improve
thenetworktrafficwhile reading/writing HDFS
24
file,NameNode choosestheDataNodewhich isclosertothe
same rackor nearbyrack to Read /write request.NameNode
achieves rack information bymaintainingthe rack ids of
each DataNode. This concept that chooses Datanodes
based on the rack information is called Rack Awareness
in Hadoop.
InHDFS,NameNodemakessurethatallthereplicasarenotstor
edonthesamerackorsingle rack; it follows Rack Awareness
Algorithm to reduce latency as well as fault tolerance.
As we know the defaultReplication Factor is3 and Client
want to place a file in HDFS, then Hadoop places the
replicas as follows:
1)
Thefirstreplicaiswrittentothedatanodecreatingthefile,to
improvethewrite performance because of the write
affinity.
2)
Thesecondreplicaiswrittentoanotherdatanodewithinthe
samerack,to minimize the cross-rack network traffic.
24
3)
Thethirdreplicaiswrittentoadatanodeinadifferentrack,e
nsuringthatevenif a switch or rack fails, the data is not
lost (Rack awareness).
Thisconfigurationismaintained to makesurethattheFileis
never lost incaseofa Node Failure or even an entire Rack
Failure.
AdvantagesofRack Awareness:
Minimize the writing cost andMaximize read speed –Rack
awareness places read/write requests to replicas on the
same or nearby rack. Thus minimizing writing cost and
24
maximizing reading speed.
Provide maximize network bandwidth and low latency
–Rack awareness maximizes
networkbandwidthbyblockstransferwithinarack.Thisisparti
cularlybeneficialincases where tasks cannot be assigned to
nodes where their data is stored locally.
Dataprotectionagainstrackfailure–Bydefault,thenamenodea
ssigns2nd&3rdreplicas
ofablocktonodesinarackdifferentfromthefirstreplica.Thispr
ovidesdataprotection even against rack failure
Replica Selection
To minimize global bandwidth consumption and read
latency, HDFS tries to satisfy a read request from a replica that
is closest to the reader. If there exists a replica on the same rack
as the reader node, then that replica is preferred to satisfy the
read request.
Features of HDFS
24
The following are the main advantages of HDFS:
HDFS can be configured to create multiple replicas for a
particular file. If any one replica fails, the user can still
access the data from other replicas. HDFS provides the
option to configure automatic failover in case of a failure.
So, in case of any hardware failure or an error, the user can
get his data from another node where the data has been
replicated. HDFS provides the facility to perform software
failover. This is similar to automatic failover; however, it is
performed at the data provider level. So, in case of any
hardware failure or an error, the user can get his data from
another node where the data has been replicated.
Horizontal scalability means that the data stored on
multiple nodes can be stored in a single file system.
Vertical scalability means that data can be stored on
multiple nodes. Data can be replicated to ensure data
integrity. Replication occurs through the use of replication
factors rather than the data itself. HDFS can store up to
5PB of data in a single cluster and handles the load by
automatically choosing the best data node to store data on.
Data can be read/updated quickly as it is stored on multiple
24
nodes. Data stored on multiple nodes through replication
increases the reliability of data.
Data is stored on HDFS, not on the local filesystem of your
computer. In the event of a failure, the data is stored on a
separate server, and can be accessed by the application
running on your local computer. Data is replicated on
multiple servers to ensure that even in the event of a server
failure, your data is still accessible. Data can be accessed
via a client tool such as the Java client, the Python client, or
the CLI. Access to data is accessible via a wide variety of
client tools. This makes it possible to access data from a
wide variety of programming languages.
24
file system.
3. This makes Hadoop a great fit for a wide range of data
applications. The most common one is analytics. You can
use Hadoop to process large amounts of data quickly, and
then analyze it to find trends or make recommendations.
The most common type of application that uses Hadoop
analytics is data crunching.
4. You can increase the size of the cluster by adding more
nodes or increase the size of the cluster by adding more
nodes. If you have many clients that need to be stored on
HDFS you can easily scale your cluster horizontally by
adding more nodes to the cluster. To scale your cluster
vertically, you can increase the size of the cluster. Once the
size of the cluster is increased, it can serve more clients.
5. This can be done by setting up a centralized database, or by
distributing data across a cluster of commodity personal
computers, or a combination of both. The most common
setup for this type of virtualization is to create a virtual
machine on each of your servers.
6. Specialization reduces the overhead of data movement
across the cluster and provides high availability of data.
24
7. Automatic data replication can be accomplished with a
variety of technologies, including RAID, Hadoop, and
database replication. Logging data and monitoring it for
anomalies can also help to detect and respond to hardware
and software failures.
24
HDFS Daemons:
NameNode:
NameNode is the master node in the Apache Hadoop
HDFS Architecture that maintains and
managestheblockspresentontheDataNodes(slavenodes).
It also manages the file system namespace and regulates
access to files by clients. NameNodeisaveryhighlyavailable
serverthatmanagestheFileSystem.
Namespaceandcontrolsaccesstofilesbyclients.Internally, a file is
split into one or more blocks and these blocks are stored in a set of
DataNodes. The NameNode executes file system namespace
operations like opening, closing, and renaming files and directories.
It also determines the mapping of blocks to DataNodes. The
DataNodes are responsible for serving read and write requests from
the file system’s clients. The DataNodes also perform block
creation, deletion, and replication upon instruction from the
NameNode.TheHDFS
architectureisbuiltinsuchawaythattheuserdataneverresidesontheN
ameNode.Namenode contains metadata and the data resides on
DataNodes only.
24
The NameNode and DataNode are pieces of software designed
to run on commodity machines. These machines typically run a
GNU/Linux operating system (OS). HDFS is built using the Java
language; any machine that supports Java can run the NameNode or
the DataNode software. Usage of the highly portable Java language
means that HDFS can be deployed on a wide range of machines. A
typical deployment has a dedicated machine that runs only the
NameNode software. Each of the other machines in the cluster runs
one instance of the DataNode software. The architecture does not
24
preclude running multiple DataNodes on the same machine but in a
real deployment that is rarely the case.
The existence of a single NameNode in a cluster greatly
simplifies the architecture of the system. The NameNode is the
arbitrator and repository for all HDFS metadata. The system is
designed in such a way that user data never flows through the
NameNode.
FunctionsofNameNode:
It
isthemasterdaemonthatmaintainsandmanagestheDataNodes(
slave nodes)
Name node manages file-related operations such as read,
write, create and delete. Managesthefilesystemnamespace.
File System Namespace is a collection of files in the
cluster.HDFS supports a traditional hierarchical file
organization. A user or an application can create directories
and store files inside these directories. The file system
namespace hierarchy is similar to most other existing file
systems; one can create and remove files, move a file from
one directory to another, or rename a file. HDFS does not
yet implement user quotas. HDFS does not support hard
24
links or soft links. However, the HDFS architecture does not
preclude implementing these features.Any change to the file
system namespace or its properties is recorded by the
NameNode. An application can specify the number of
replicas of a file that should be maintained by HDFS. The
number of copies of a file is called the replication factor of
that file. This information is stored by the NameNode.
It records the metadata of all the files stored in the cluster,
e.g. thelocation of blocks stored,thesize of the files,
permissions, hierarchy, etc. There are two files associated
with the metadata:
FsImage: It contains the complete state of the file
system namespace since the start of the NameNode.
File System Namespace includes mapping of blocks to
file, file properties and is stored in a file called
FSImage.
EditLogs:It contains all the recent modifications made
to the file system with respect to the most recent
FsImage. It records every transaction(change) that
happens to the files system metadata.
Forexample,ifafile is deleted in HDFS, the NameNode
24
will immediately record this in the EditLog.
It regularlyreceives aHeartbeatand
ablockreportfromalltheDataNodesinthe cluster to ensure
that the DataNodes are live.
Itkeepsarecordof
alltheblocksinHDFSandinwhichnodestheseblocksare
located.
The NameNode is also responsible to take care of the
replicationfactorof all the blocks
In
caseoftheDataNodefailure,theNameNodechoosesnewData
Nodesfornewreplicas, balance disk usage and manages
the communication traffic to the DataNodes.
DataNode:
Data Nodes are the slave nodes in HDFS. Unlike
NameNode, DataNode is commodity hardware, thatis,anon-
expensivesystemwhichisnotofhighqualityorhigh-
availability.TheDataNodeis a block server that stores the data in
the local file ext3 or ext4.
24
FunctionsofDataNode:
TheactualdataisstoredonDataNodes.
Datanodesperformread-
writeoperationsonthefilesystems,asperclientrequest.
Theyalsoperformoperationssuchasblockcreation,deletion,an
dreplicationaccording to the instructions of the namenode.
TheysendheartbeatstotheNameNodeperiodicallytoreportthe
overallhealthofHDFS, by default; this frequency is set to 3
seconds.
24
SecondaryNameNode:
It is a separate physical machine which acts as a helper of
name node. It performs periodic check points. It communicates
with the name node and take snapshot of meta data which
helpsminimizedowntimeandlossofdata.
TheSecondaryNameNodeworksconcurrentlywith the primary
NameNode as a helper daemon.
24
FunctionsofSecondaryNameNode:
TheSecondaryNameNodeisonewhichconstantlyreadsallthefi
lesystemsandmetadata from the RAM of the NameNode
and writes it into the hard disk or the file system.
ItisresponsibleforcombiningtheEditLogswithFsImagefromt
24
heNameNode.
It downloads the EditLogs from the NameNode at regular
intervals and applies to
FsImage.ThenewFsImageiscopiedbacktotheNameNode,whi
chisusedwheneverthe NameNode is started the next time.
Hence,SecondaryNameNodeperformsregularcheckpointsin
HDFS.Therefore,it isalso called CheckpointNode.
Hadoop Filesystems
Hadoop has an abstract notion of filesystems, of which
HDFS is just one implementation. The Java abstract class
org.apache.hadoop.fs.FileSystem represents the client interface
to a filesystem in Hadoop, and there are several concrete
implementations.
The main ones that ship with Hadoop are described in Table
24
To list the files in the root directory of the local filesystem,
type:
hadoop fs -ls /tmp
A file in a Hadoop filesystem is represented by a Hadoop
Path object (and not a java.io.File object, since its semantics are
too closely tied to the local filesystem).
You can think of a Path as a Hadoop filesystem URI, such
24
as hdfs://localhost/user/ tom/quangle.txt.
FileSystem is a general filesystem API, so the first step is
to retrieve an instance for the filesystem we want to
use—HDFS, in this case.
A Configuration object encapsulates a client or server’s
configuration, which is set using configuration files read from
the classpath, such as etc/hadoop/core-site.xml.
There are several static factory methods for getting a
FileSystem instance:
public static FileSystemget(Configuration conf)
throws IOException
returns the default filesystem (as specified in core-
site.xml, or the default local filesystem if not specified
there)
public static FileSystemget(URI uri, Configuration
conf) throws IOException
uses the given URI’s scheme and authority to
determine the filesystem to use, falling back to the
default filesystem if no scheme is specified in the given
URI.
public static FileSystemget(URI uri, Configuration
24
conf, String user) throws IOException
retrieves the filesystem as the given user, which
is important in the context of security
Data Flow
Anatomy of a File Read
To get an idea of how data flows between the client
interacting with HDFS, the namenode, and the datanodes,
consider Figure below, which shows the main sequence of
events when reading a file.
24
Step 1:The client opens the file it wishes to read by
calling open() on the
FileSystem object, which for HDFS is an instance
of DistributedFileSystem
Step 2:DistributedFileSystem calls the namenode, using
remote procedure
calls (RPCs), to determine the locations of the first
few blocks in the file. For each block, the namenode
returns the addresses of the datanodes that have a
copy of that block. Furthermore, the datanodes are
sorted according to their proximity to the client. If
the client is itself a datanode (in the case of a
MapReduce task, for instance), the client will read
from the local datanode if that datanode hosts a copy
of the block.
The DistributedFileSystem returns an
FSDataInputStream (an input
stream that supports file seeks) to the client for it to
read data from.FSDataInputStream in turn wraps a
24
DFSInputStream, which manages the datanode and
namenode I/O.
Step 3:The client then calls read() on the stream.
DFSInputStream, which
has stored the datanode addresses for the first few
blocks in the file,
then connects to the first(closest) datanode for the
first block in the
file.
Step 4:Data is streamed from the datanode back to the
client, which calls
read() repeatedly on the stream.
Step 5: When the end of the block is reached,
DFSInputStream will close
the connection to the datanode, then find the best
datanode for thenext block. This happens
transparently to the client, which from its point of
view is just reading a continuous stream. Blocks are
read in order, with the DFSInputStream opening
new connections to datanodes as the client reads
through the stream. It will also call the namenode to
24
retrieve the datanode locations for the next batch of
blocks as needed.
Step 6:When the client has finished reading, it calls
close() on the
FSDataInputStream.
24
merely has to service block location requests (which it
stores in memory, making them very efficient) and does
not, for example, serve data, which would quickly become
a bottle neck as the number of clients grew.
24
The idea is that the bandwidth available for each of the
following scenarios becomes progressively less:
• Processes on the same node
• Different nodes on the same rack
•Nodes on different racks in the same data center
•Nodes in different data centers
For example, imagine a node n1 on rack r1 in data center
d1. This can be represented as /d1/r1/n1. Using this notation,
here are the distances for the four scenarios:
• distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same
node)
• distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the
same rack)
• distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different
racks in the same data center)
• distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data
centers)
This is illustrated schematically in Figure 3-3.
(Mathematically inclined readers will notice that this is an
example of a distance metric.)
24
Finally, it is important to realize that Hadoop cannot magically
discover your network topology for you; it needs some help. By
default, though, it assumes that the network is flat—a single
level hierarchy—or in other words, that all nodes are on a single
rack in a single data center. For small clusters, this may actually
be the case, and no further configuration is required
24
Step 1:The client creates the file by calling create() on
DistributedFile
System.
Step 2:DistributedFileSystem makes an RPC call to the
namenode to create
a new file in the filesystem’s namespace, with no
blocks associated with it. The namenode performs
various checks to make sure the file doesn’t already
24
exist and that the client has the right permissions to
create the file. If these checks pass, the namenode
makes a record of the new file; otherwise, file creation
fails and the client is thrown an IOException. The
DistributedFileSystem returns an
FSDataOutputStream for the client to start writing
data to. Just as in the read case, FSDataOutputStream
wraps a DFSOutputStream, which handles
communication with the datanodes and namenode.
Step 3:As the client writes data, the DFSOutputStream
splits it into packets,
which it writes to an internal queue called the data
queue. The data queue is consumed by the
DataStreamer, which is responsible for asking the
namenode to allocate new blocks by picking a list of
suitable datanodes to store the replicas. The list of
datanodes forms a pipeline, and here we’ll assume
the replication level is three, so there are three nodes
in the pipeline.
Step 4:The DataStreamer streams the packets to the first
datanode inthe
24
pipeline, which stores each packet and forwards it
to the second datanode in the pipeline. Similarly,
the second datanode stores the packet and forwards
it to the third (and last) datanode in the pipeline.
Step 5:The DFSOutputStream also maintains an internal
queue of packets
that are waiting to be acknowledged by datanodes,
called the ack queue. A packet is removed from the
ack queue only when it has been acknowledged by
all the datanodes in the pipeline. If any datanode
fails while data is being written to it, then the
following actions are taken, which are transparent
to the client writing the data. First, the pipeline is
closed, and any packets in the ack queue are added
to the front of the data queue so that datanodes that
are downstream from the failed node will not miss
any packets. The current block on the good
datanodes is given a new identity, which is
communicated to the namenode, so that the partial
block on the failed datanode will be deleted if the
failed datanode recovers later on. The failed
24
datanode is removed from the pipeline, and a new
pipeline is constructed from the two good
datanodes. The remainder of the block’s data is
written to the good datanodes in the pipeline. The
namenode notices that the block is under-replicated,
and it arranges for a further replica to be created on
another node. Subsequent blocks are then treated as
normal. It’s possible, but unlikely, for multiple
datanodes to fail while a block is being written. As
long as dfs.namenode.replication.min replicas
(which defaults to 1) are written, the write will
succeed, and the block will be asynchronously
replicated across the cluster until its target
replication factor is reached (dfs.replication, which
defaults to 3).
Step 6:When the client has finished writing data, it calls
close() on the stream.
Step 7:This action flushes all the remaining packets to the
datanode pipeline
and waits for acknowledgments before contacting the
namenode to signal that the file is complete. The
24
namenode already knows which blocks the file is
made up of (because Data Streamer asks for block
allocations), so it only has to wait for blocks to be
minimally replicated before returning successfully.
Replica Placement
How does the namenode choose which datanodes to store
replicas on?
There’s a trade off between reliability and write bandwidth
and read bandwidth here. For example, placing all replicas on a
single node incurs the lowest write bandwidth penalty (since the
replication pipeline runs on a single node), but this offers no real
redundancy (if the node fails, the data for that block is lost).
Also, the read bandwidth is high for off-rack reads. At the other
extreme, placing replicas in different data centers may maximize
redundancy, but at the cost of bandwidth. Even in the same data
center (which is what all Hadoop clusters to date have run in),
there are a variety of possible placement strategies.
Hadoop’s default strategy is to
place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random,
24
although the system tries not to pick nodes that are too full or
too busy).
The second replica is placed on a different rack from the
first (off-rack), chosen at random.
The third replica is placed on the same rack as the second,
but on a different node chosen at random.
Further replicas are placed on random nodes in the cluster,
although the system tries to avoid placing too many replicas on
the same rack. Once the replica locations have been chosen, a
pipeline is built, taking network topology into account. For a
replication factor of 3, the pipeline might look like in Figure
below
24
switch), read performance (there’s a choice of two racks to read
from), and block distribution across the cluster (clients only
write a single block on the local rack)
24
5. To display contents in a file:
Syntax: hdfsdfs –cat <path>
Eg. hdfsdfs –cat /chp/abc1.txt
24
Eg. hdfsdfs –count /chp
o/p: 1 1 60
10. Remove a directory from hdfs
Syntax: hdfsdfs –rmr<path>
Eg. hdfsdfsrmr /chp
Hadoop Configuration
Hadoop must have its configuration set appropriately to run
in distributed mode on a cluster.
There are a handful of files for controlling the
configuration of a Hadoop installation; the most important ones
are listed in Table
24
These files are all found in the etc/hadoop directory of the
Hadoop distribution.
The configuration directory can be relocated to another part
of the filesystem (outside the Hadoop installation, which makes
upgrades marginally easier) as long as daemons are started with
the --config option (or, equivalently, with the
24
HADOOP_CONF_DIR environment variable set) specifying the
location of this directory on the local filesystem.
Configuration Management
Hadoop does not have a single, global location for
configuration information. Instead, each Hadoop node in the
cluster has its own set of configuration files, and it is up to
administrators to ensure that they are kept in sync across the
system. Using parallel shell tools such as dsh or pdsh. This is
an area where Hadoop cluster management tools like Cloudera
Manager and Apache Ambari really shine, since they take care
of propagating changes across the cluster.
Hadoop is designed so that it is possible to have a single set
of configuration files that are used for all master and worker
machines. The great advantage of this is simplicity, both
conceptually (since there is only one configuration to deal with)
and operationally (as the Hadoop scripts are sufficient to
manage a single configuration setup). For some clusters, the
one-size-fits-all configuration model breaks down. For example,
if you expand the cluster with new machines that have a
different hardware specification from the existing ones, you
24
need a different configuration for the new machines to take
advantage of their extra resources. In these cases, you need to
have the concept of a class of machine and maintain a separate
configuration for each class. Hadoop doesn’t provide tools to do
this, but there are several excellent tools for doing precisely this
type of configuration management, such as Chef, Puppet,
CFEngine, and Bcfg2. For a cluster of any size, it can be a
challenge to keep all of the machines in sync.
Consider what happens if the machine is unavailable when
you push out an update. Who ensures it gets the update when it
becomes available? This is a big problem and can lead to
divergent installations, so even if you use the Hadoop control
scripts for managing Hadoop, it may be a good idea to use
configuration management tools for maintaining the cluster.
These tools are also excellent for doing regular maintenance,
such as patching security holes and updating system packages
Environment Settings
We consider how to set the variables in hadoop-env.sh.
There are also analogous configuration files for
MapReduce and YARN (but not for HDFS), called mapred-
24
env.sh and yarn-env.sh, where variables pertaining to those
components can be set.
Note that the MapReduce and YARN files override the
values set in
hadoop-env.sh.
Java
The location of the Java implementation to use is
determined by the JAVA_HOME setting in hadoop-env.sh or
the JAVA_HOME shell environment variable, if not set in
hadoop-env.sh.
It’s a good idea to set the value in hadoop-env.sh, so that it
is clearly defined in one place and to ensure that the whole
cluster is using the same version of Java.
24
set YARN_RESOURCEMANAGER_HEAPSIZE in yarn-
env.sh to override the heap size for the resource manager.
Surprisingly, there are no corresponding environment
variables for HDFS daemons, despite it being very common to
give the namenode more heap space.
In addition to the memory requirements of the daemons,
the node manager allocates containers to applications, so we
need to factor these into the total memory footprint of a worker
machine;
24
of disk space per node, a block size of 128 MB, and a replication
factor of 3 has room for about 2 million blocks (or more): 200 ×
24,000,000 MB ⁄ (128 MB × 3). So in this case, setting the
namenode memory to 12,000 MB would be a good starting
point. You can increase the namenode’s memory without
changing the memory allocated to other Hadoop daemons by
setting HADOOP_NAMENODE_OPTS in hadoop-env.sh to
include a JVM option for setting the memory size.
HADOOP_NAMENODE_OPTS allows you to pass extra
options to the namenode’s JVM. So, for example, if you were
using a Sun JVM-Xmx2000m would specify that 2,000 MB of
memory should be allocated to the name node.
If you change the namenode’s memory allocation, don’t
forget to do the same for the secondary namenode (using the
HADOOP_SECONDARYNAMENODE_OPTS variable),
since its memory requirements are comparable to the primary
namenode’s.
System logfiles
System logfiles produced by Hadoop are stored in
$HADOOP_HOME/logs by default.
24
This can be changed using the HADOOP_LOG_DIR
setting in hadoop-env.sh. It’s a good idea to change this so that
logfiles are kept out of the directory that Hadoop is installed in.
Changing this keeps logfiles in one place, even after the
installation directory changes due to an upgrade. A common
choice is /var/log/hadoop, set by including the following line in
hadoop-env.sh:
export HADOOP_LOG_DIR=/var/log/hadoop
The log directory will be created if it doesn’t already exist.
(If it does not exist, confirm that the relevant Unix Hadoop user
has permission to create it.)
Each Hadoop daemon running on a machine produces two
logfiles.
The first is the log output written via log4j. This file, whose
name ends in .log, should be the first port of call when
diagnosing problems because most application log
messages are written here. The standard Hadoop log4j
configuration uses a daily rolling file appender to rotate
logfiles. Old logfiles are never deleted, so you should
arrange for them to be periodically deleted or archived, so
as to not run out of disk space on the local node.
24
The second logfile is the combined standard output and
standard error log. This logfile, whose name ends in .out,
usually contains little or no output, since Hadoop uses log4j
for logging. It is rotated only when the daemon is restarted,
and only the last five logs are retained. Old logfiles are
suffixed with a number between 1 and 5, with 5 being the
oldest file.
24
you wish to give the Hadoop instance a different identity for the
purposes of naming the logfiles, change
HADOOP_IDENT_STRING to be the identifier you want.
SSH settings
The control scripts allow you to run commands on (remote)
worker nodes from the master node using SSH. It can be useful
to customize the SSH settings, for various reasons. For example,
you may want to reduce the connection timeout (using the
ConnectTimeout option) so the control scripts don’t hang around
waiting to see whether a dead node is going to respond.
Obviously, this can be taken too far. If the timeout is too low,
then busy nodes will be skipped, which is bad.
Another useful SSH setting is StrictHostKeyChecking,
which can be set to no to automatically add new host keys to the
known hosts files. The default, ask, prompts the user to confirm
that the key fingerprint has been verified, which is not a suitable
setting in a large cluster environment.To pass extra options to
24
SSH, define the HADOOP_SSH_OPTS environment variable
in hadoop-env.sh.
24
Example 10-3. A typical yarn-site.xml configuration file
24
HDFS
To run HDFS, you need to designate one machine as a
namenode. In this case, the property fs.defaultFS is an HDFS
filesystem URI whose host is the namenode’s host name or IP
address and whose port is the port that the namenode will listen
on for RPCs. If no port is specified, the default of 8020 is used.
The fs.defaultFS property also doubles as specifying the
default filesystem. The default filesystem is used to resolve
relative paths, which are handy to use because they save typing
(and avoid hardcoding knowledge of a particular namenode’s
address). For example, with the default filesystem defined in
Example 10-1, the relative URI/a/b is resolved to
hdfs://namenode/a/b. There are a few other configuration
properties you should set for HDFS: those that set the storage
directories for the namenode and for datanodes.
The property dfs.namenode.name.dir specifies a list of
directories where the namenode stores persistent filesystem
metadata (the edit log and the filesystem image).
A copy of each metadata file is stored in each directory for
redundancy. It’s common to configure dfs.namenode.name.dir
so that the namenode metadata is written to one or two local
24
disks, as well as a remote disk, such as an NFS-mounted
directory. Such a setup guards against failure of a local disk and
failure of the entire namenode, since in both cases the files can
be recovered and used to start a new namenode. (The secondary
namenode takes only periodic checkpoints of the namenode, so
it does not provide an up-to-date backup of the namenode.)
You should also set the dfs.datanode.data.dir property,
which specifies a list of directories for a datanode to store its
blocks in. Unlike the namenode, which uses multiple directories
for redundancy, a datanode round-robins writes between its
storage directories, so for performance you should specify a
storage directory for each local disk. Read performance also
benefits from having multiple disks for storage, because blocks
will be spread across them and concurrent reads for distinct
blocks will be correspondingly spread across disks.
Finally, you should configure where the secondary
namenode stores its checkpoints of the filesystem. The
dfs.namenode.checkpoint.dir property specifies a list of
directories where the checkpoints are kept. Like the storage
directories for the namenode, which keep redundant copies of
the namenode metadata, the checkpointed filesystem image is
24
stored in each checkpoint directory for redundancy. Table 10-2
summarizes the important configuration properties for HDFS.
Table 10-2. Important HDFS daemon properties
YARN
To run YARN, you need to designate one machine as a
resource manager. The simplest way to do this is to set the
property yarn.resourcemanager.hostname to the hostname or
IP address of the machine running the resource manager. Many
of the resource manager’s server addresses are derived from this
property. For example, yarn.resourcemanager.address takes
24
the form of a host-port pair, and the host defaults to
yarn.resourcemanager.hostname.
In a MapReduce client configuration, this property is used
to connect to the resource manager over RPC. During a
MapReduce job, intermediate data and working files are written
to temporary local files. Because this data includes the
potentially very large output of map tasks, you need to ensure
that the yarn.nodemanager.local-dirs property, which controls
the location of local temporary storage for YARN containers, is
configured to use disk partitions that are large enough. The
property takes a comma-separated list of directory names, and
you should use all available local disks to spread disk I/O (the
directories are used in round-robin fashion).
Typically, you will use the same disks and partitions (but
different directories) for YARN local storage as you use for
datanode block storage, as governed by the
dfs.datanode.data.dir property.
Unlike MapReduce 1, YARN doesn’t have tasktrackers to
serve map outputs to reduce tasks, so for this function it relies
on shuffle handlers, which are long-running auxiliary services
running in node managers. Because YARN is a general-purpose
24
service, the MapReduce shuffle handlers need to be enabled
explicitly in yarn-site.xml by setting the
yarn.nodemanager.aux-services property to
mapreduce_shuffle. Table 10-3 summarizes the important
configuration properties for YARN. Table 10-3. Important
YARN daemon properties
24
Memory settings in YARN and MapReduce
YARN treats memory in a more fine-grained manner than
the slot-based model used in MapReduce 1. Rather than
specifying a fixed maximum number of map and reduce slots
that may run on a node at once, YARN allows applications to
24
request an arbitrary amount of memory (within limits) for a task.
In the YARN model, node managers allocate memory from a
pool, so the number of tasks that are running on a particular
node depends on the sum of their memory requirements, and not
simply on a fixed number of slots. The calculation for how much
memory to dedicate to a node manager for running containers
depends on the amount of physical memory on the machine.
Each Hadoop daemon uses 1,000 MB, so for a datanode and a
node manager, the total is 2,000 MB. Set aside enough for other
processes that are running on the machine, and the remainder
can be dedicated to the node manager’s containers by setting the
configuration property yarn.nodemanager.resource.memory-
mb to the total allocation in MB. (The default is 8,192 MB,
which is normally too low for most setups.).
The next step is to determine how to set memory options
for individual jobs. There are two main controls: one for the size
of the container allocated by YARN, and another for the heap
size of the Java process run in the container. Container sizes are
determined by mapreduce.map.memory.mb and
mapreduce.reduce.memory.mb; both default to 1,024 MB.
These settings are used by the application master when
24
negotiating for resources in the cluster, and also by the node
manager, which runs and monitors the task containers. The heap
size of the Java process is set by mapred.child.java.opts, and
defaults to 200 MB. You can also set the Java options separately
for map and reduce tasks (see Table 10-4).
Table 10-4. MapReduce job memory properties (set by the
client)
24
Note that the JVM process will have a larger memory
footprint than the heap size, and the overhead will depend on
such things as the native libraries that are in use, the size of the
permanent generation space, and so on. The important thing is
that the physical memory used by the JVM process, including
any processes that it spawns, such as Streaming processes, does
not exceed its allocation (1,024 MB). If a container uses more
memory than it has been allocated, then it may be terminated by
the node manager and marked as failed. YARN schedulers
impose a minimum or maximum on memory allocations. The
default minimum is 1,024 MB (set by
yarn.scheduler.minimum-allocation-mb), and the de fault
maximum is 8,192 MB (set by yarn.scheduler.maximum-
allocation-mb). There are also virtual memory constraints that a
container must meet. If a container’s virtual memory usage
exceeds a given multiple of the allocated physical memory, the
node manager may terminate the process. The multiple is
expressed by the yarn.nodemanager.vmem-pmem-ratio
property, which defaults to 2.1. In the example used earlier, the
virtual memory threshold above which the task may be
terminated is 2,150 MB, which is 2.1 × 1,024 MB. When
24
configuring memory parameters it’s very useful to be able to
monitor a task’s actual memory usage during a job run, and this
is possible via MapReduce task counters. The counters
PHYSICAL_MEMORY_BYTES,
VIRTUAL_MEMORY_BYTES, and
COMMITTED_HEAP_BYTES provide snapshot values of
memory usage and are therefore suitable for observation during
the course of a task attempt. Hadoop also provides settings to
control how much memory is used for MapReduce operations.
24
reduce containers by setting mapreduce.map.cpu.vcores and
mapreduce.reduce.cpu.vcores. Both de fault to 1, an
appropriate setting for normal single-threaded MapReduce tasks,
which can only saturate a single core.
24
24
and an HTTP server to provide web pages for human
consumption (Table 10-6).
24
because it is incompatible with setting cluster-wide firewall
policies. In general, the properties for setting a server’s RPC and
HTTP addresses serve double duty: they determine the network
interface that the server will bind to, and they are used by clients
or other machines in the cluster to connect to the server. For
example, node managers use the
yarn.resourcemanager.resource-tracker.address property to
find the address of their resource manager. It is often desirable
for servers to bind to multiple network interfaces, but setting the
network address to 0.0.0.0, which works for the server, breaks
the second case, since the address is not resolvable by clients or
other machines in the cluster. One solution is to have separate
configurations for clients and servers, but a better way is to set
the bind host for the server. By setting
yarn.resourcemanager.hostname to the (externally re
solvable) hostname or IP address and
yarn.resourcemanager.bind-host to 0.0.0.0, you ensure that
the resource manager will bind to all addresses on the machine,
while at the same time providing a resolvable address for node
managers and clients. In addition to an RPC server, datanodes
run a TCP/IP server for block transfers. The server address and
24
port are set by the dfs.datanode.address property , which has a
default value of 0.0.0.0:50010.
There is also a setting for controlling which network interfaces
the datanodes use as their IP addresses (for HTTP and RPC
servers). The relevant property is dfs.datanode.dns.interface,
which is set to default to use the default network interface. You
can set this explicitly to report the address of a particular
interface (eth0, for example)
MapReduce Framework
Hadoop MapReduce is a software framework that
makes it simple to write programs that process enormous
volumes of data (multi-terabyte data sets) in parallel
24
on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant way.
A MapReduce job divides the input data set into
separate chunks, which are then processed in parallel by the
map jobs. The reduction jobs are fed the outputs of the
maps, which are then sorted by the framework.
In most cases, the job’s input and output are stored in
a file system.
Tasks are scheduled, monitored, and failed tasks are re-
executed by the framework.
24
The output of the Mapper phase will also be in the
key-value format as (k’, v’).
3. Reducer
The output of the Shuffle and Sort phase (k, v[]) will be the
input of the Reducer phase.
In this phase reducer function’s logic is executed and all the
values are aggregated against their corresponding keys.
Reducer consolidates outputs of various mappers and
computes the final job output.
The final output is then written into a single file in an
output directory of HDFS.
24
4. Combiner
It is an optional phase in the MapReduce model.
The combiner phase is used to optimize the
performance of MapReduce jobs.
In this phase, various outputs of the mappers are
locally reduced at the node level.
For example, if different mapper outputs (k, v) coming
from a single node contains duplicates, then they get
combined i.e. locally reduced as a single (k, v[]) output.
This phase makes the Shuffle and Sort phase work even
quicker thereby enabling additional performance in
MapReduce jobs.
24
For example, MapReduce logic to find the word count on
an array of words can be shown as below:
fruits_array = [apple, orange, apple, guava, grapes,
orange, apple]
The mapper phase tokenizes the input array of words
into the ‘n’ number of words to give the output as (k,
v). For example, consider ‘apple’. Mapper output will
be (apple, 1), (apple, 1), (apple, 1).
Shuffle and Sort accept the mapper (k, v) output and group
all values according to their keys as (k, v[]). i.e. (apple,
[1, 1, 1]).
The Reducer phase accepts Shuffle and sort output and
gives the aggregate of the values (apple, [1+1+1]),
corresponding to their keys. i.e. (apple, 3).
24
Speculative Execution of MapReduce Work
The speed of MapReduce is dominated by the slowest
task. So, to up the speed, a new mapper will work on the same
dataset at the same time. Whichever completes the task first
is considered as the final output and the other one is killed.
It is an optimization technique.
MRUnit
MRUnit is a JUnit-based java library to facilitate unit
testing of Hadoop MapReduce jobs by providing drivers
24
and mock objects to simulate the Hadoop runtime
environment of a map reduce job.
This makes it easy to develop as well as to maintain
Hadoop MapReduce code bases. This code base already exists
as a subproject of the Apache Hadoop MapReduce project and
lives in the "contrib" directory of the source tree.
MRUnit supports testing Mappers and Reducers separately
as well as testing MapReduce computations as a whole.
With MRUnit, you can
craft test input
push it through your mapper and/or reducer
verify its output all in a JUnit test
24
Currently, partitioners do not have a test driver under
MRUnit.
MRUnit allows you to do TDD (Test Driven
Development) and write light weight unit tests which
accommodate Hadoop’s specific architecture and constructs.
24
In the provided figure, the entire procedure is depicted.
There are five distinct entities at the highest level:
MapReduce job submission by the client.
The cluster's compute resource allocation is managed by
the YARN resource management.
Launching and keeping an eye on the computing containers
on cluster machines are the YARN node managers.
The master MapReduce application, which manages the
24
tasks involved in running the MapReduce operation.
Running in containers that are scheduled by the resource
manager and controlled by the node managers are the
application master and the MapReduce tasks.
The distributed filesystem is used to distribute job files
among the many entities
24
Step 5a: The Resource Manager will then start the YARN
scheduler which will allocate a container for the resource
manager to launch an application master process in the
Node Manager.
Step 5b,6: The Node manager know in this example the
application is a Map Reduce program and initialise a
job with a MRAppMaster application.
Step 7: The MRAppMaster will then retrieves the data
blocks from the shared Filesystem (which is HDFS in this
example) and create a map task object for each split, and
the reduce task objects.
Step 8,9,10,11: Once a task has been assigned resources for
a container on a node by the resource manager scheduler,
the application master starts a container by contacting the
node manager. Finally the node manager will runs the map
or reduce task.
When the last task for your job is complete the application
master will get notified and changes the job status to
‘successfully. When the job polls this status, it will tell the
client and return from the waitForComplete() method.
24
Lastly the application master and task containers will clean
up their working status and remove any intermediate
outputs.
24
handled automatically, thanks to Apache Hadoop's distributed
multi-node High Availability cluster.
Run-time errors:
o Errors due to failure of tasks — child tasks
o Issues pertaining to resources
Data errors:
o Errors due to bad input records
o Malformed data errors
24
to have conditions to detect such records, log them, and
ignore them. One such example is the use of a counter to
keep track of such records. Apache provides a way to keep
track of different entities, through its counter mechanism.
There are system-provided counters, such as bytes read and
number of map tasks; we have seen some of them in
Job History APIs. In addition to that, users can also
define their own counters for tracking. So, your
mapper can be enriched to keep track of these counts;
look at the following example:
if (color not red condition true)
{
context.getCounter(COLOR.NOT_RED).increment(1);
}
Or, you can handle your exception, as follows:
catch (NullPointerException npe)
{
context.getCounter(EXCEPTION.NPE).increment(1);
24
You can then get the final count through job
history APIs or from the Job instance directly, as
follows:
….
job.waitForCompletion(true);
Counters counters = job.getCounters();
Counter cl = counters.findCounter(COLOR.NOT_RED);
System.out.println("Errors" + cl .getDisplayName()+":" +
cl.getValue());
24
mapper and reducer need to handle even the key and value
fields. For example, text data needs to have a maximum
length of line, to ensure that no junk is getting in.
Typically, such data is ignored by Hadoop programs,
as most of the applications of Hadoop look at analytics
over large-scale data, unlike any other transaction system,
which requires each data element and its dependencies
Other errors:
o System issues
o Cluster issues
o Network issues
24
24
24
Role of Hbase in Big Data Processing
HBase
HBase is a distributed column-oriented database built
on top of the Hadoop file system.
It is a part of the Hadoop ecosystem that provides random
real-time read/write access to data in the Hadoop File
System.
24
HBase provides low latency random read and write
access to petabytes of data by distributing requests
from applications across a cluster of hosts. Each host
has access to data in HDFS and S3, and serves read and
write requests in milliseconds.
Apache HBase is an open-source, distributed, versioned,
non-relational database modeled after Google's Bigtable
designed to provide quick random access to huge
amounts of structured data.
It is horizontally scalable.
It leverages the fault tolerance provided by the HDFS.
One can store the data in HDFS either directly or
through HBase.
Hadoop does not support CRUD operations
HBASE support CRUD operations
History
The HBase project was started toward the end of 2006
by Chad Walters and Jim Kellerman at Powerset. It
was modelled after Google’s Bigtable, which had just
been published.
24
In February 2007, Mike Cafarella made a code drop
of a mostly working system that Jim Kellerman then
carried forward.
The first HBase release was bundled as part of
Hadoop 0.1 in October 2007.
In May 2010, HBase graduated from a Hadoop
subproject to become an Apache Top Level Project.
Today, HBase is a mature technology used in
production across a wide range of industries.
24
HDFS does not support fast HBase provides fast lookups
individual record lookups. for larger tables.
It provides high latency It provides low latency
batch processing access to single rows from
billions of records (Random
access).
It provides only sequential HBase internally uses Hash
access of data. tables and provides random
access, and it stores the data
in indexed HDFS files for
faster lookups.
Features
Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables
Automatic fail over support between RegionServers.
It integrates with Hadoop, both as a source and a
destination.
24
Convenient base classes for backing Hadoop MapReduce
jobs with Apache HBase tables.
Easy to use Java API for client access.
It provides data replication across clusters.
Block cache (HBase supports block cache to improve read
performance. When performing a scan, if block cache is
enabled and there is room remaining, data blocks read from
StoreFiles on HDFS are cached in region server's Java heap
space, so that next time, accessing data in the same block
can be served by the cached block) is used for real-time
queries
Bloom Filters(A bloom filter is a probabilistic data
structure that is based on hashing) used for real-time
queries. Query predicate push down via server side
Filters.
Thrift gateway (HBase Thrift Gateway includes an API
and a service that accepts Thrift requests to connect to HPE
Ezmeral Data Fabric Database and HBase tables) and a
REST-ful(RESTful API is an interface that two computer
systems use to exchange information securely over the
24
internet) Web service that supports XML, Protobuf, and
binary data encoding options
Extensible jruby-based (JIRB) shell.
Companies such as Facebook, Twitter, Yahoo, and Adobe
use HBase internally.
Limitations of Hadoop
Hadoop can perform only batch processing, and data
will be accessed only in a sequential manner. That means
one has to search the entire dataset even for the simplest of
jobs.
A huge dataset when processed results in another huge data
set, which should also be processed sequentially. At this
point, a new solution is needed to access any point of data
in a single unit of time (random access).
24
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it
are sorted by row.
The table schema defines only column families, which are
the key value pairs.
A table have multiple column families and each
column family can have any number of columns.
Subsequent column values are stored contiguously on
the disk.
Each cell value of the table has a timestamp.
In short, in an HBase:
Table is a collection of rows.
Row is a collection of column families.
Column family is a collection of columns.
Column is a collection of key value pairs.
Given below is an example schema of table in HBase.
col col col col col col col col col col col col
1 2 3 1 2 3 1 2 3 1 2 3
24
1
24
The following image shows column families in a column-
oriented database:
24
24
24
HBase and RDBMS
24
HBase RDBMS
It is built for wide tables. HBase It is thin and built for small
is horizontally scalable. tables. Hard to scale.
24
24
HBase - Architecture
In HBase, tables are split into regions and are served by the
region servers. Regions are vertically divided by column
families into “Stores”. Stores are saved as files in HDFS. Shown
below is the architecture of HBase.
Note: The term ‘store’ is used for regions to explain the storage
structure.
24
24
HBase has three major components: the client library, a
master server, and region servers. Region servers can be added
or removed as per requirement.
Regions
Regions are nothing but tables that are split up and spread
across the region servers.
Regions Tables are automatically partitioned horizontally
by HBase into regions. Each region comprises a subset of a
table’s rows. A region is denoted by the table it belongs to.
Region server
The region servers have regions that -
24
Communicate with the client and handle data-related
operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size
thresholds.
The store contains memory store and HFiles.
Memstore is just like a cache memory. Anything that is
entered into the HBase is stored here initially.
Later, the data is transferred and saved in Hfiles as blocks
and the memstore is flushed.
When we take a deeper look into the region server, it
contain regions and stores as shown below:
24
Master Server
Assigns regions to the region servers and takes the help of
Apache ZooKeeper for this task.
Handles load balancing of the regions across region
servers. Maintains the state of the cluster by negotiating the
load balancing.
It unloads the busy servers and shifts the regions to less
occupied servers.
24
Is responsible for schema changes and other metadata
operations such as creation of tables and column families.
Zookeeper
Zookeeper is an open-source project that provides services
like maintaining configuration information, naming,
providing distributed synchronization, etc.
Zookeeper has ephemeral nodes representing different
region servers. Master servers use these nodes to discover
available servers.
In addition to availability, the nodes are also used to track
server failures or network partitions.
Clients communicate with region servers via zookeeper.
In pseudo and standalone modes, HBase itself will take
care of zookeeper.
24
HBase Responsibilities Summary
HBase Proper
HMaster Perform administrative operations on the
cluster
Apply DDL actions for creating or altering
tables
Assign and distribute regions among
RegionServers (stored in META system table)
Conduct region load balancing across
RegionServers
24
Handle RegionServer failures by assigning the
Region to another RegionServer
The master node is lightly loaded.
24
Rows are spread across Regions, in key
ranges, as data in the table grows
Leveraging Hadoop
Store HLogs; write ahead log (WAL) files
HDFS
Store HFiles persisting all columns in a column-family
Track location of META system table
Receive heartbeat signals from HMaster and
RegionServers
ZooKeeper
Provide HMaster with RegionServer failure
notifications
Initiate HMaster fail-over protocol
24
24
24
24
24
Creating a Table using HBase Shell
The syntax to create a table in HBase shell is shown below.
create ‘<table name>’,’<column family>’
EX:
create ‘CustomerContactInformation’,’ CustomerName’ , ’
ContactInfo’
24
In the HBase data model column qualifiers are specific
names assigned to your data values in order to make sure you’re
able to accurately identify them
24
24
24
24
24
PIG
Why Pig?
In the MapReduce framework, programs are required to be
translated into a sequence of Map and Reduce stages.To
24
eliminate the complexities associated with MapReduce an
abstraction called Pig was built on top of Hadoop.
Developers who are not good at Java struggle a lot while
working with Hadoop, especially when executing tasks
related to the MapReduce framework. Apache Pig is the
best solution for all such programmers.
Pig Latin simplifies the work of programmers by
eliminating the need to write complex codes in java for
performing MapReduce tasks.
The multi-query approach of Apache Pig reduces the length
of code drastically and minimizes development time.
Pig Latin is almost similar to SQL and if you are familiar
with SQL then it becomes very easy for you to learn
24
In 2008, the first release of Apache Pig came out.
In 2010, Apache Pig graduated as an Apache top-level
project.
Pig Philosophy
24
As shown in the above diagram, Apache pig consists of
various components.
24
The Parser components produce output as a DAG (directed
acyclic graph) which depicts the Pig Latin logical operators and
logical statements. In the DAG the data flows are shown as
edges and the logical operations represent Pig Latin statements.
24
1. PIG LATIN SCRIPT : contains syntax, data model and
operators. A Language used to express data flows
2. EXECUTION ENGINE: An engine on top of HADOOP-2
execution environment, takes the scripts written in Pig Latin
as an input and converts them into MapReduce jobs.
Pig Latin
Pig Latin is a Hadoop extension that simplifies hadoop
programming by giving a high-level data processing
language. In order to analyze the large volumes of data,
programmers write scripts using Pig Latin language. These
scripts are then transformed internally into Map and Reduce
tasks. This is a highly flexible language and supports users
24
in developing custom functions for writing, reading and
processing data.
It enables the resources to focus more on data analysis by
minimizing the time taken for writing Map-Reduce
programs.
24
o DUMP or STORE to display/store result.
Example 1:
A = load ‘students’(rollno,name,gpa);
A = filter A by gpa>4.0;
A=foreach A generate UPPER(name);
STORE A INTO ‘myreport’
A is a relation
As soon as you enter a Load statement in the Grunt shell,
its semantic checking will be carried out. To see the
contents of the schema, you need to use
the Dump operator. Only after performing
the dump operation, the MapReduce job for loading the
data into the file system will be carried out.
24
Comments in Pig Script
While writing a script in a file, we can include comments in
it as shown below.
Multi-line comments
We will begin the multi-line comments with '/*', end them
with '*/'.
/* These are the multi-line comments
In the pig script */
Single –line comments
24
We will begin the single-line comments with '--'.
--we can write single line comments like this.
24
give −10
Multiplication: Multiplies values on a * b will give
*
either side of the operator 200
Division: Divides left hand operand b / a will give
/
by right hand operand 2
Modulus: Divides left hand operand
b % a will
% by right hand operand and returns
give 0
remainder
b = (a == 1)?
20: 30;
Bincond: Evaluates the Boolean
if a = 1 the
operators. It has three operands as
value of b is
?: shown below.
20.
Variable x = (expression) ? value1
if a!=1 the
if true : value2 if false.
value of b is
30.
CASE f2 % 2
CASE
WHEN 0
WHEN
Case: The case operator is equivalent THEN 'even'
THEN
to nested bincond operator. WHEN 1
ELSE
THEN 'odd'
END
END
24
Equal: Checks if the values of two
(a = = b) is
== operands are equal or not; if yes, then
not true
the condition becomes true.
24
matches with the constant in the right-
hand side.
Operat
Description
or
Loading and Storing
24
To Load the data from the file system (local/HDFS)
LOAD
into a relation.
STOR
To save a relation to the file system (local/HDFS).
E
Filtering
FILTER To remove unwanted rows from a relation.
DISTINCT To remove duplicate rows from a relation.
FOREACH,
To generate data transformations based on
GENERAT
columns of data.
E
STREAM To transform a relation using an external program.
Grouping and Joining
JOIN To join two or more relations.
COGROU
To group the data in two or more relations.
P
GROUP To group the data in a single relation.
CROSS To create the cross product of two or more relations.
Sorting
ORDE To arrange a relation in a sorted order based on one or
R more fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
24
Combining and Splitting
UNIO
To combine two or more relations into a single relation.
N
SPLIT To split a single relation into two or more relations.
Diagnostic Operators
DUMP To print the contents of a relation on the console.
DESCRIBE To describe the schema of a relation.
To view the logical, physical, or MapReduce
EXPLAIN
execution plans to compute a relation.
ILLUSTRA To view the step-by-step execution of a series of
TE statements.
24
Example : 10.5
Represents a character array (string) in
5 chararray Unicode UTF-8 format.
Example : ‘tutorials point’
6 Bytearray Represents a bytearray (blob).
Represents a boolean value. Example : true/
7 Boolean
false.
Represents a date-time.
8 Datetime
Example : 1970-01-01T00:00:00.000+00:00
Represents a Java Biginteger.
9 Biginteger
Example : 60708090709
Represents a Java Bigdecimal
10 Bigdecimal
Example : 185.98376256272893883
Complex Types
A tuple is an ordered set of fields. Example :
11 Tuple
(raja, 30)
A bag is a collection of tuples.
12 Bag
Example : {(raju,30),(Mohhammad,45)}
A Map is a set of key-value pairs.
13 Map
Example : [ ‘name’#’Raju’, ‘age’#30]
24
Apache Pig Run Modes
Local Mode
o Here Pig language makes use of a local file system and runs
in a single JVM. The local mode is ideal for analyzing small
data sets.
o Here, files are installed and run using localhost.
24
o The local mode works on a local file system. The input and
output data stored in the local file system.
MapReduce Mode
o The MapReduce mode is also known as Hadoop Mode.
o It is the default mode.
o All the queries written using Pig Latin are converted into
MapReduce jobs and these jobs are run on a Hadoop
cluster.
o It can be executed against semi-distributed or fully
distributed Hadoop installation.
o Here, the input and output data are present on HDFS.
24
Ways to execute Pig Program
These are the following ways of executing a Pig program on
local and MapReduce mode:
o Interactive Mode / Grunt mode in anatomy of pig: In
this mode, the Pig is executed in the Grunt shell. To invoke
Grunt shell, run the pig command. Once the Grunt mode
executes, we can provide Pig Latin statements and
command interactively at the command line.
24
o Batch Mode / script mode in anatomy of pig: In this
mode, we can run a script file having a .pig extension.
These files contain Pig Latin commands.
o Embedded Mode: In this mode, we can define our
own functions. These functions can be called as UDF
(User Defined Functions). Here, we use programming
languages like Java and Python.
Command − Command −
$ ./pig –x local $ ./pig -x mapreduce
Output − Output −
24
Either of these commands gives you the Grunt shell
prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.
After invoking the Grunt shell, you can execute a Pig script
by directly entering the Pig Latin statements in it.
grunt> customers = LOAD 'customers.txt' USING
PigStorage(',');
24
Sample_script.pig
student= LOAD 'hdfs://localhost:9000/pig_data/student.txt'
USING PigStorage(',')
as (id:int, name:chararray, city:chararray);
Dump student;
Now, you can execute the script in the above file as shown
below.
$ pig -x $ pig -x
local Sample_script.pig mapreduce Sample_script.pig
24
It allows complex non-atomic datatypes such
as map and tuple. Given below is the diagrammatical
representation of Pig Latin’s data model.
Atom
It is a simple atomic value like int, long, double, or string.
Any single value in Pig Latin, irrespective of their data,
type is known as
an Atom.
It is stored as string and can be used as string and number.
int, long, float, double, chararray, and bytearray are the
atomic values of Pig.
24
A piece of data or a simple atomic value is known as
a field.
Example − ‘raja’ or ‘30’
Tuple
It is a sequence of fields that can be of any data type.
A record that is formed by an ordered set of fields is known
as a tuple, the fields can be of any type.
A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
It is a collection of tuples of potentially varying structures
and can contain duplicates.
A bag is an unordered set of tuples. In other words, a
collection of tuples (non-unique) is known as a bag.
Each tuple can have any number of fields (flexible
schema).
A bag is represented by ‘{ }’. It is similar to a table in
RDBMS, but unlike a table in RDBMS, it is not necessary that
24
every tuple contain the same number of fields or that the fields
in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is
known as inner bag.
Example − {Raja, 30, {9848022338,
[email protected],}}
Map
It is an associative array. A map (or data map) is a set of
key-value pairs.
The key needs to be of type chararray and should be
unique.
The value might be of any type. It is represented by ‘[ ]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples.
The relations in Pig Latin are unordered (there is no
guarantee that tuples are processed in any particular order).
24
Pig on Hadoop
24
However, Pig can also read input from and place output to
other sources.
In general, Apache Pig works on top of Hadoop. It is an
analytical tool that analyzes large datasets that exist in
the Hadoop File System.
To analyze data using Apache Pig, we have to initially load
the data into Apache Pig.
Student First
Last Name Phone City
ID Name
24
The Load and Store functions in Apache Pig are used to
determine how the data goes ad comes out of Pig. These
functions are used with the load and store operators.
Given below is the list of load and store functions available
in Pig.
24
the relation where we want to store the data, and on the right-
hand side, we have to define how we store the data.
Syntax:
Relation_name = LOAD 'Input file path' USING
function as schema;
Where,
relation_name : We have to mention the relation in which
we want to store the data.
Input file path: We have to mention the HDFS directory
where the file is stored. (In MapReduce mode)
Function: We have to choose a function from the set of
load functions provided by Apache Pig (BinStorage,
JsonLoader, PigStorage, TextLoader).
Schema: We have to define the schema of the data. We can
define the required schema as follows −(column1 : data
type, column2 : data type, column3 : data type);
Note − We load the data without specifying the schema. In that
case, the columns will be addressed as $01, $02, etc… (check).
Example
24
As an example, let us load the data in student_data.txt in
Pig under the schema named Student using
the LOAD command.
24
2015-10-01 12:33:38,242 [main] INFO
org.apache.pig.impl.util.Utils - Default bootup file
/home/Hadoop/.pigbootup not found
2015-10-01 12:33:39,630 [main]
INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEn
gine - Connecting to hadoop file system at: hdfs://localhost:9000
grunt>
24
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt'
as(id,firstname,lastname,phone,city);
We can take the input file separated with tab space for each
column with above one, and no need of specify the complete
schema (data types) of the relation also.
Relation
We have stored the data in the schema student.
name
24
We have stored the data using the following
schema.
Note − The load statement will simply load the data into the
specified relation in Pig. To verify the execution of
the Load statement, you have to use the Diagnostic Operators.
The PigStorage() function loads and stores data as
structured text files. It takes a delimiter using which each entity
of a tuple is separated as a parameter. By default, it takes ‘\t’ as
a parameter.
Syntax
grunt> PigStorage(field_delimiter)
24
Example
Let us suppose we have a file named student_data.txt in
the HDFS directory named /data/ with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
24
You can store the loaded data in the file system using
the store operator.
Syntax
STORE Relation_name INTO ' required_directory_path '
[USING function];
Example
Assume we have a file student_data.txt in HDFS with the
following content.
And we have read it into a relation student using the LOAD
operator as shown below.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as(id:int,firstname:chararray,lastname:chararray,phone:chararray
,
city:chararray);
24
grunt> STORE student INTO '
hdfs://localhost:9000/pig_Output/ ' USING
PigStorage(',');
Output
After executing the store statement, you will get the
following output.
A directory is created with the specified name and the data
will be stored in it.
The load statement will simply load the data into the specified
relation in Apache Pig. To verify the execution of
the Load statement, you have to use the Diagnostic Operators.
Dump Operator
24
The Dump operator is used to run the Pig Latin statements
and display the results on the screen.
It is generally used for debugging Purpose.
Syntax
grunt> Dump Relation_Name
Describe Operator
The describe operator is used to view the schema of a
relation.
Syntax
grunt> Describe Relation_name
Explain Operator
The explain operator is used to display the logical,
physical, and MapReduce execution plans of a relation.
Syntax
grunt> explain Relation_name;
Illustrate Operator
The illustrate operator gives you the step-by-step
execution of a sequence of statements.
24
Syntax
grunt> illustrate Relation_name;
Group Operator
24
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Apache Pig with the
relation name student_details as shown below.
grunt> student_details= LOAD
'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as(id:int,firstname:chararray,lastname:chararray,age:int,phone:c
hararray,
city:chararray);
24
We can verify the relation group_data using
the DUMP operator as shown below.
grunt> Dump group_data;
Output
Then you will get output displaying the contents of the
relation named group_data as shown below.
Here you can observe that the resulting schema has two
columns −
One is age, by which we have grouped the relation.
The other is a bag, which contains the group of tuples,
student records with the respective age.
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,2
1,9848022337,Hydera bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battac
harya,22,984802233 8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,M
ohanthy,23,9848022336 ,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal
,Nayak,24,9848022334, trivendram)})
24
You can see the schema of the table after grouping the data
using the describe command as shown below.
grunt> Describe group_data;
group_data: {group: int,student_details: {(id: int,firstname:
chararray,
lastname: chararray,age: int,phone: chararray,city: chararray)}}
Try:grunt>illustrate group_data;
Try:grunt>explain group_data;
24
Joins in Pig
24
Outer-join − left join, right join, and full join
customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
24
And we have loaded these two files into Pig with the
relations customers and orders as shown below.
grunt> customers = LOAD
'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(',')
as(id:int,name:chararray,age:int,address:chararray,salary:int);
Self - join
Self-join is used to join a table with itself as if the table
were two relations, temporarily renaming at least one relation.
Generally, in Apache Pig, to perform self-join, we will load
the same data multiple times, under different aliases (names).
24
Therefore let us load the contents of the file customers.txt as
two tables as shown below.
grunt> customers1 = LOAD
'hdfs://localhost:9000/pig_data/customers.txt'
USING PigStorage(',')
as(id:int,name:chararray,age:int,address:chararray,salary:int);
Syntax
Given below is the syntax of performing self-join operation
using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY
key, Relation2_name BY key ;
Example
Let us perform self-join operation on the
relation customers, by joining the two
relations customers1 and customers2 as shown below.
24
grunt> customers3 = JOIN customers1 BY id, customers2 BY
id;
Output
It will produce the following output, displaying the
contents of the relation customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000
)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)
Inner Join
24
Inner Join is used quite frequently; it is also referred to
as equijoin. An inner join returns rows when there is a match in
both tables.
It creates a new relation by combining column values of
two relations (say A and B) based upon the join-predicate. The
query compares each row of A with each row of B to find all
pairs of rows which satisfy the join-predicate. When the join-
predicate is satisfied, the column values for each matched pair of
rows of A and B are combined into a result row.
Syntax
Here is the syntax of performing inner join operation using
the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2
BY
columnname;
Example
Let us perform inner join operation on the two relations
customers and orders as shown below.
grunt> coustomer_orders= JOIN customers BY id, orders BY
customer_id;
24
Verify the relation coustomer_orders using
the DUMP operator as shown below.
grunt> Dump coustomer_orders;
Output
You will get the following output that will the contents of
the relation named coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Note:
Outer Join: Unlike inner join, outer join returns all the rows
from at least one of the relations.
An outer join operation is carried out in three ways −
Left outer join
Right outer join
Full outer join
24
Left Outer Join
The left outer Join operation returns all rows from the left
table, even if there are no matches in the right relation.
Syntax
Given below is the syntax of performing left outer
join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY id LEFT
OUTER, Relation2_name BY customer_id;
Example
Let us perform left outer join operation on the two relations
customers and orders as shown below.
grunt>outer_left= JOIN customers BY id LEFT OUTER, orders
BY customer_id;
Verification
Verify the relation outer_left using the DUMP operator as
shown below.
grunt>Dump outer_left;
Output
It will produce the following output, displaying the
contents of the relation outer_left.
24
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
24
Syntax
Given below is the syntax of performing full outer
join using the JOIN operator.
grunt>outer_full = JOIN customers BY id FULL OUTER,
orders BY customer_id;
Cross Operator
The CROSS operator computes the cross-product of two or
more relations. This chapter explains with example how to use
the cross operator in Pig Latin.
Syntax
Given below is the syntax of the CROSS operator.
24
grunt> Relation3_name = CROSS Relation1_name,
Relation2_name;
Foreach Operator
24
grunt> foreach_data= FOREACH student_details GENERATE
age>25;
Order By Operator
The ORDER BY operator is used to display the contents of
a relation in a sorted order based on one or more fields.
Syntax
Given below is the syntax of the ORDER BY operator.
grunt> Relation_name2 = ORDER Relatin_name1 BY
(ASC|DESC);
Limit Operator
The LIMIT operator is used to get a limited number of
tuples from a relation.
Syntax
Given below is the syntax of the LIMIT operator.
24
grunt> Result = LIMIT Relation_name required number of
tuples;
Example
grunt> student_details= LOAD
'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as(id:int,firstname:chararray,lastname:chararray,age:int,phone:c
hararray,
city:chararray);
24
these UDF’s, we can define our own functions and use them.
The UDF support is provided in six programming languages,
namely, Java, Jython, Python, JavaScript, Ruby and Groovy.
$ pig -x $ pig -x
local Sample_script.pig mapreduce Sample_script.pig
24
You can execute it from the Grunt shell as well using the
exec/run command as shown below.
grunt> exec/sample_script.pig
24
24
24
24
24
24
Features of Pig
It provides an engine for executing data flows (how your
data should flow).
Pig processes data in parallel on the Hadoop cluster
It provides a language called “Pig Latin” to express data
flows.
Rich set of operators − It provides many operators to
perform operations like join, sort, filer, etc.
Ease of programming − Pig Latin is similar to SQL and it
is easy to write a Pig script if you are good at SQL.
understand and maintain (1/20th the lines of code and 1/16th
the development time)
Optimization opportunities − The tasks in Apache Pig
optimize their execution automatically, so the programmers
need to focus only on semantics of the language.
Extensibility − Using the existing operators, users can
develop their own functions to read, process, and write
data.
24
UDF’s − Pig provides the facility to create User-defined
Functions in other programming languages such as Java
and invoke or embed them in Pig Scripts.
Handles all kinds of data − Apache Pig analyzes all kinds
of data, both structured as well as unstructured. It stores the
results in HDFS.
24
To perform data processing for search platforms.
To process time sensitive data loads.
24
Apache Pig Vs Hive
Both Apache Pig and Hive are used to create MapReduce
jobs. And in some cases, Hive operates on HDFS in a
similar way Apache Pig does. In the following table, we
have listed a few significant points that set Apache Pig
apart from Hive.
24
called Pig Latin. It was called HiveQL. It was
originally created at Yahoo. originally created at Facebook.
24
Differences between Apache MapReduce and PIG
24
operations in MapReduce. perform data operations like
union, sorting and ordering.
Pig SQL
24
opportunity for Query for query optimization in
optimization. SQL.
Hive
24
a funky hybrid of an elephant and a bee.
Hive is a data warehousing tool that sits on top of Hadoop.
Apache Warehouse is a Warehouse software.
Facebook initially created Hive component to manage their
ever growing volumes of log data. Later Apache software
foundation developed it as open-source and it came to be known
as Apache Hive.
Hive is suitable for
o Data Warehousing applications
o Processes batch jobs on huge data that is immutable
(data whose structure cannot be changed after it is
created is called immutable data)
o Examples: Web Logs, Application Logs
Hive is used to process structured data in Hadoop.
The three main tasks performed by Apache Hive are
1. Summarization
24
2. Querying
3. Analysis
Hive provides extensive data type functions and formats for
data summarization and analysis.
Hive makes use of the following
o HDFS for storage
o MapReduce for execution
o Stores metadata / schemas in an RDBMS.
Hive programs are written in the Hive Query language(HQL),
which is a declarative language similar to SQL. Hive translates
hive queries into MapReduce jobs and then runs the job in the
Hadoop cluster.
Hive supports OLAP(Online Analytical Processing)
Hive queries can be used to replace complicated java
MapReduce programs with structured and semi-structured data
processing and analysis. A person who is knowledgeable about
SQL statements can write the hive queries relatively easily.
The Hive platform makes it simple to perform tasks like:
Large-scale data analysis
Perform Ad-hoc queries
Perform Data encapsulation
24
Note:
1. Hive is not RDBMS
2. It is not designed to support OLTP
3. It is not designed for real time queries
4. It is not designed to support row-level updates
History of Hive
Features of Hive:
24
1.It provides SQL-like queries (i.e., HQL) that are implicitly
transformed to MapReduce or Spark jobs. We don’t need to
know any programming languages to work with Hive. We
can work with Hive using only basic SQL.
2.The table structure in Hive is the same as the table structure
in a relational database.
3.HQL is easy to code
4.Hive supports rich data types such as structs, lists and maps
5.Hive supports SQL filters, group-by and order-by clauses
6.Hive is fast and scalable.
7.It is capable of analyzing large datasets stored in HDFS and
other similar data storage systems such as HBase to access
data.
8.It allows different storage types such as plain text, RCFile,
and HBase.
9.It uses indexing to accelerate queries.
10. It can operate on compressed data stored in the
Hadoop ecosystem.
11. We can use Apache Hive for free. It is open-source.
12. Multiple users can perform queries on the data at the
same time.
24
13. The Apache Hive software perfectly matches the low-
level interface requirements of Apache Hadoop.
14. In order to improve performance, Apache Hive
partition and bucket data at the table level.
15. Hive provides support for a variety of file formats,
including textFile, orc, Avro, sequence file, parquet,
Copying, LZO Compression, and so on.
16. Hive has a variety of built-in functions.
17. The Oracle BI Client Developer’s Kit also provides
support for User-Defined Functions for data cleansing and
filtering. We can define UDFs according to our
requirements.
18. External tables are supported by Apache Hive. We can
process data without actually storing data in HDFS because
of this feature.
19. Hive support includes ETLs. Hive is an effective ETL
tool.
20. Hive can accommodate client applications written in
PHP, Python, Java, C++, and Ruby.
21. Hive has an optimizer that applies rules to logical
plans to improve performance.
24
22. We can run Ad-hoc queries in Hive, which are loosely
typed commands or queries whose values depend on some
variable for the data analysis.
23. Hive can be used to implement data visualisation in
Tez. Hive can be used to integrate with Apache Tez to
provide real-time processing capabilities.
24
Hive Data Units
1. Databases: The namespace for Tables
2. Tables: Set of records that have similar schema
3. Partitions: Logical separations of data based on
classification of given information as per specific attributes.
Once hive has partitioned the data based on a specified key,
it starts to assemble the records into specific folders as and
when the records are inserted
4. Buckets(or Clusters): Similar to partitions but uses hash
function to segregate data and determines the cluster or
bucket into which the record should be placed.
Partitioning in Hive
The partitioning in Hive means dividing the table into some
parts based on the values of a particular column like date,
course, city or country.
The advantage of partitioning is that since the data is stored
in slices, the query response time becomes faster.
As we know that Hadoop is used to handle the huge
amount of data, it is always required to use the best approach to
deal with it.
24
Example: Let's assume we have a data of 10 million
students studying in an institute. Now, we have to fetch the
students of a particular course. If we use a traditional approach,
we have to go through the entire data. This leads to performance
degradation. In such a case, we can adopt the better approach
i.e., partitioning in Hive and divide the data among the different
datasets based on particular columns.
The partitioning in Hive can be executed in two ways -
Static Partitioning
In static or manual partitioning, it is required to pass the
values of partitioned columns manually while loading the data
into the table. Hence, the data file doesn't contain the partitioned
columns.
Example of Static Partitioning
First, select the database in which we want to create a table.
hive> create table student (id int, name string, age int,
institute string)
24
>partitioned by (course string)
>row format delimited
>fields terminated by ',';
Let's retrieve the information associated with the table.
hive> describe student;
Load the data into the table and pass the values of partition
columns with it by using the following command: -
24
Here, we are partitioning the students of an institute based on
courses.
Load the data of another file into the same table and pass the
values of partition columns with it by using the following
command: -
hive> load data local inpath '/home/codegyani/hive/student_deta
ils2'
into table student
partition(course= "hadoop");
24
Let's retrieve the entire data of the table by using the following
command: -
hive> select * from student;
24
In this case, we are not examining the entire data. Hence, this
approach improves query response time.
Let's also retrieve the data of another partitioned dataset by
using the following command: -
hive> select * from student where course= "hadoop";
Dynamic Partitioning
In dynamic partitioning, the values of partitioned columns
exist within the table. So, it is not required to pass the values of
partitioned columns manually
First, select the database in which we want to create a table.
24
hive> use show;
24
into table stud_demo;
24
In the following screenshot, we can see that the table
student_part is divided into two categories.
24
Let's retrieve the entire data of the table by using the
following command:
hive> select * from student_part;
24
In this case, we are not examining the entire data. Hence, this
approach improves query response time.
Let's also retrieve the data of another partitioned dataset by
using the following command: -
hive> select * from student_part where course= "hadoop";
Bucketing in Hive
The bucketing in Hive is a data organizing technique.
It is similar to partitioning in Hive with an added
functionality that it divides large datasets into more manageable
parts known as buckets. So, we can use bucketing in Hive when
the implementation of partitioning becomes difficult. However,
we can also divide partitions further in buckets.
24
o The concept of bucketing is based on the hashing
technique.
o Here, modules of current column value and the number of
required buckets is calculated (let say, F(x) % 3).
o Now, based on the resulted value, the data is stored into the
corresponding bucket.
24
24
Hive Architecture:
24
o Hive Server: This is an optional server. This can be used to
submit Hive Jobs from a remote client.
o JDBC: Job can be submitted from a JDBC Client. One can
write a Java code to connect to Hive and submit jobs on it.
o ODBC: It allows the applications that support the ODBC
protocol to connect to Hive.
24
24
24
The figure above provides a glimpse of the architecture of
Apache Hive and its main sections. Apache Hive is a large and
complex software system. It has the following components:
Hive Client:
24
Hive drivers support applications written in any language
like Python, Java, C++, and Ruby, among others, using JDBC,
ODBC, and Thrift drivers, to perform queries on the Hive.
Therefore, one may design a hive client in any language of their
choice.
The three types of Hive clients are referred to as Hive clients:
1. Thrift Clients: The Hive server can handle requests from a
thrift client by using Apache Thrift.
2. JDBC client: A JDBC driver connects to Hive using the
Thrift framework. Hive Server communicates with the Java
applications using the JDBC driver.
3. ODBC client: The Hive ODBC driver is similar to the
JDBC driver in that it uses Thrift to connect to Hive.
However, the ODBC driver uses the Hive Server to
communicate with it instead of the Hive Server.
Hive Services:
Hive provides numerous services, including the Hive server2,
Beeline, etc. The services offered by Hive are:
1. Beeline: HiveServer2 supports the Beeline, a command
shell that which the user can submit commands and queries
24
to. It is a JDBC client that utilises SQLLINE CLI (a pure
Java console utility for connecting with relational databases
and executing SQL queries). The Beeline is based on
JDBC.
2. Hive Server 2: HiveServer2 is the successor to
HiveServer1. It provides clients with the ability to execute
queries against the Hive. Multiple clients may submit
queries to Hive and obtain the desired results. Open API
clients such as JDBC and ODBC are supported by
HiveServer2.
Note: Hive server1, which is also known as a Thrift server, is
used to communicate with Hive across platforms. Different
client applications can submit requests to Hive and receive the
results using this server.
HiveServer2 handled concurrent requests from more than one
client, so it was replaced by HiveServer1.
Hive Driver:
The Hive driver receives the HiveQL statements submitted
by the user through the command shell and creates session
handles for the query.
24
Hive Compiler:
Metastore and hive compiler both store metadata in order to
support the semantic analysis and type checking performed on
the different query blocks and query expressions by the hive
compiler. The execution plan generated by the hive compiler is
based on the parse results.
The DAG (Directed Acyclic Graph) is a DAG structure
created by the compiler. Each step is a map/reduce job on
HDFS, an operation on file metadata, and a data manipulation
step.
Optimizer:
The optimizer splits the execution plan before performing
the transformation operations so that efficiency and scalability
are improved.
Execution Engine:
After the compilation and optimization steps, the execution
engine uses Hadoop to execute the prepared execution plan,
which is dependent on the compiler’s execution plan.
24
Metastore:
Metastore stores metadata information about tables and
partitions, including column and column type information, in
order to improve search engine indexing.
The metastore also stores information about the serializer and
deserializer as well as HDFS files where data is stored and
provides data storage. It is usually a relational database. Hive
metadata can be queried and modified through Metastore.
We can either configure the metastore in either of the two
modes:
1. Remote: A metastore is not enabled in remote mode, and
non-Java applications cannot benefit from Thrift services.
2. Embedded: A client in embedded mode can directly access
the metastore via JDBC.
HCatalog:
HCatalog is a Hadoop table and storage management layer
that provides users with different data processing tools such as
Pig, MapReduce, etc. with simple access to read and write data
on the grid.
24
The data processing tools can access the tabular data of
Hive metastore through It is built on the top of Hive metastore
and exposes the tabular data to other data processing tools.
WebHCat:
The REST API for HCatalog provides an HTTP interface
to perform Hive metadata operations. WebHCat is a service
provided by the user to run Hadoop MapReduce (or YARN),
Pig, and Hive jobs.
Distributed Storage:
Hive is based on Hadoop, which means that it uses the
Hadoop Distributed File System for distributed storage.
24
Hive is a data warehouse system which is used to analyze
structured data.
It is built on the top of Hadoop.
It was developed by Facebook.
Hive provides the functionality of reading, writing, and
managing large datasets residing in distributed storage.
It runs SQL like queries called HQL (Hive query language)
which gets internally converted to MapReduce jobs.
Using Hive, we can skip the requirement of the traditional
approach of writing complex MapReduce programs.
Hive supports Data Definition Language (DDL), Data
Manipulation Language (DML), and User Defined Functions
(UDF).
Hive Architecture
The following architecture explains the flow of submission
of query into Hive.
24
Hive Client
Hive allows writing applications in various languages,
including Java, Python, and C++.
It supports different types of clients such as:-
o Thrift Server - It is a cross-language service provider
platform that serves the request from all those
programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between
hive and Java applications. The JDBC Driver is present in
the class org.apache.hadoop.hive.jdbc.HiveDriver.
24
o ODBC Driver - It allows the applications that support the
ODBC protocol to connect to Hive.
Hive Services
24
o Hive Compiler - The purpose of the compiler is to parse the
query and perform semantic analysis on the different query
blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical
plan in the form of DAG of map-reduce tasks and HDFS
tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.
Working of Hive
24
The following table defines how Hive interacts
with Hadoop framework:
Step
Operation
No.
Execute Query
The Hive interface such as Command Line or Web
1
UI sends query to Driver (any database driver such
as JDBC, ODBC, etc.) to execute.
Get Plan
The driver takes the help of query compiler that
2
parses the query to check the syntax and query plan
or the requirement of query.
Get Metadata
3 The compiler sends metadata request to Metastore
(any database).
Send Metadata
4 Metastore sends metadata as a response to the
compiler.
Send Plan
The compiler checks the requirement and resends the
5
plan to the driver. Up to here, the parsing and
compiling of a query is complete.
Execute Plan
6 The driver sends the execute plan to the execution
engine.
7 Execute Job
24
Internally, the process of execution job is a
MapReduce job. The execution engine sends the job
to JobTracker, which is in Name node and it assigns
this job to TaskTracker, which is in Data node. Here,
the query executes MapReduce job.
Metadata Ops
7.1 Meanwhile in execution, the execution engine can
execute metadata operations with Metastore.
Fetch Result
8 The execution engine receives the results from Data
nodes.
Send Results
9 The execution engine sends those resultant values to
the driver.
Send Results
10
The driver sends the results to Hive Interfaces.
24
2-byte signed
SMALLINT 32,768 to 32,767
integer
-
8-byte signed 9,223,372,036,854,775,808
BIGINT
integer to
9,223,372,036,854,775,807
Decimal
Single precision
FLOAT 4-byte floating point
number
Double precision
DOUBLE 8-byte floating point
number
24
String Types
STRING: The string is a sequence of characters. It values can
be enclosed within single quotes (') or double quotes (").
Varchar: The varchar is a variable length type whose range lies
between 1 and 65535, which specifies that the maximum
number of characters allowed in the character string.
CHAR: The char is a fixed-length type whose maximum length
is fixed at 255.
Date/Time Types
TIMESTAMP
o It supports traditional UNIX timestamp with optional
nanosecond precision.
o As Integer numeric type, it is interpreted as UNIX
timestamp in seconds.
o As Floating point numeric type, it is interpreted as UNIX
timestamp in seconds with decimal precision.
o As string, it follows java.sql.Timestamp format "YYYY-
MM-DD HH:MM:SS.fffffffff" (9 decimal place precision)
24
DATES
The Date value is used to specify a particular year, month and
day, in the form YYYY--MM--DD. However, it didn't provide
the time of the day. The range of Date type lies between 0000--
01--01 to 9999--12--31.
It is similar to
C struct or an
object where
Struct fields are struct('James','Roy')
accessed using
the "dot"
notation.
24
array notation.
It is a
collection of
similar type of
Array values that array('James','Roy')
indexable
using zero-
based integers.
Miscellaneous types:
Boolean and binary
File Formats
Apache Hive supports several familiar file formats used in
Apache Hadoop. Hive can load and query different data file
created by other Hadoop components such
as Pig or MapReduce.
24
Different file formats and compression codecs work better
for different data sets in Apache Hive.
Following are the Apache Hive different file formats:
Text File
Sequence File
RC File
AVRO File
ORC File
Parquet File
24
Examples
Below is the Hive CREATE TABLE command with
storage format specification:
Create table textfile_table
(column_specs)
stored as textfile;
24
Hive RC File Format
RCFile is row columnar file format. This is another form
of Hive file format which offers high row level compression
rates. If you have requirement to perform multiple rows at a time
then you can use RCFile format.
The RCFile are very much similar to the sequence file
format. This file format also stores the data as key-value pairs.
Create RCFile by specifying ‘STORED AS
RCFILE’ option at the end of a CREATE TABLE Command:
Example
Below is the Hive CREATE TABLE command with
storage format specification:
Create table RCfile_table
(column_specs)
stored as rcfile;
24
in any programming languages. Avro is one of the popular file
format in Big Data Hadoop based applications.
Create AVRO file by specifying ‘STORED AS
AVRO’ option at the end of a CREATE TABLE Command.
Example
Below is the Hive CREATE TABLE command with
storage format specification:
Create table avro_table
(column_specs)
stored as avro;
24
Below is the Hive CREATE TABLE command with
storage format specification:
Create table orc_table
(column_specs)
stored as orc;
24
Hive - Create Database
In Hive, the database is considered as a catalog or
namespace of tables. So, we can maintain multiple tables within
a database where a unique name is assigned to each table. Hive
also provides a default database with a name default.
Initially, we check the default database provided by Hive.
So, to check the list of existing databases, follow the below
command: -
24
Hive also allows assigning properties with the database in the
form of key-value pair.
Internal Table
The internal tables are also called managed tables as the
lifecycle of their data is controlled by the Hive.
24
By default, these tables are stored in a subdirectory under
the directory defined by hive.metastore.warehouse.dir(i.e.
/user/hive/warehouse).
The internal tables are not flexible enough to share with
other tools like Pig.
If we try to drop the internal table, Hive deletes both table
schema and data.
Let's create an internal table by using the following
command:-
hive> create table demo.employee (Id int, Name string , Salary f
loat)
row format delimited fields terminated by ',' ;
Let's see the metadata of the created table by using the following
command: -
External Table
The external table allows us to create and access a table and
a data externally. The external keyword is used to specify the
24
external table, whereas the location keyword is used to
determine the location of loaded data.
As the table is external, the data is not present in the Hive
directory. Therefore, if we try to drop the table, the metadata of
the table will be deleted, but the data still exists.
24
hive>select * from demo.employee;
Limitations of Hive
o Hive is not capable of handling real-time data.
o It is not designed for online transaction processing.
o Hive queries contain high latency.
Hive Pig
24
language.
24