BDA Notes Unit-4
BDA Notes Unit-4
Unit 4 Notes
UNIT IV
Hadoop Architecture: Hadoop: RDBMS Vs Hadoop, Hadoop Overview,
Hadoop distributors, HDFS, HDFS Daemons, Anatomy of File Write and
Read., Name Node, Secondary Name Node, and Data Node, HDFS
Architecture, Hadoop Configuration, Map Reduce Framework, Role of
HBase in Big Data processing, HIVE, PIG.
Hadoop Architecture: Hadoop: RDBMS Vs Hadoop
Hadoop Architecture:
As we all know Hadoop is a framework written in Java that utilizes a large
cluster of commodity hardware to maintain and store big size data. Hadoop works on
MapReduce Programming Algorithm that was introduced by Google. Today lots of
Big Brand Companies are using Hadoop in their Organization to deal with big data,
eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists
of 4 components.
MapReduce
HDFS(Hadoop Distributed File System)
YARN(Yet Another Resource Negotiator)
Common Utilities or Hadoop Common
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Here, we can see that the Input is provided to the Map() function then it’s output is
used as an input to the Reduce function and after that, we receive our final output.
Let’s understand What this Map() and Reduce() does.
As we can see that an Input is provided to the Map(), now as we are using Big Data.
The Input is a set of Data. The Map() function here breaks this DataBlocks
into Tuples that are nothing but a key-value pair. These key-value pairs are now
sent as input to the Reduce(). The Reduce() function then combines this broken
Tuples or key-value pair based on its Key value and form set of Tuples, and perform
some operation like sorting, summation type job, etc. which is then sent to the final
Output Node. Finally, the Output is Obtained.
The data processing is always done in Reducer depending upon the business
requirement of that industry. This is How First Map() and then Reduce is utilized
one by one.
Let’s understand the Map Task and Reduce Task in detail.
Map Task:
Combiner: Combiner is used for grouping the data in the Map workflow.
It is similar to a Local reducer. The intermediate key-value that are
generated in the Map is combined with the help of this combiner. Using
a combiner is not necessary as it is optional.
Partitionar: Partitional is responsible for fetching key-value pairs
generated in the Mapper Phases. The partitioner generates the shards
corresponding to each reducer. Hashcode of each key is also fetched by
this partition. Then partitioner performs it’s(Hashcode) modulus with the
number of reducers(key.hashcode()%(number of reducers)).
Reduce Task
Shuffle and Sort: The Task of Reducer starts with this step, the process in
which the Mapper generates the intermediate key-value and transfers
them to the Reducer task is known as Shuffling. Using the Shuffling
process the system can sort the data using its key value.
Once some of the Mapping tasks are done Shuffling begins that is why it
is a faster process and does not wait for the completion of the task
performed by Mapper.
Reduce: The main function or task of the Reduce is to gather the Tuple
generated from Map and then perform some sorting and aggregation sort
of process on those key-value depending on its key element.
OutputFormat: Once all the operations are performed, the key-value
pairs are written into the file with the help of record writer, each record
in a new line, and the key and value in a space-separated manner.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
2. HDFS
NameNode(Master)
DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data
about the data. Meta Data can be the transaction logs that keep track of the user’s
activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single
block of data is divided into multiple blocks of size 128MB which is default and you
can also change it manually.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Let’s understand this concept of breaking down of file in blocks with an example.
Suppose you have uploaded a file of 400MB to your HDFS then what happens is this
file got divided into blocks of 128MB+128MB+128MB+16MB = 400MB size. Means
4 blocks are created each of 128MB except the last one. Hadoop doesn’t know or it
doesn’t care about what data is stored in these blocks so it considers the final file
blocks as a partial record as it does not have any idea regarding it. In the Linux file
system, the size of a file block is about 4KB which is very much less than the default
size of file blocks in the Hadoop file system. As we all know Hadoop is mainly
configured for storing the large size data which is in petabyte, this is what makes
Hadoop file system different from other file systems as it can be scaled, nowadays
file blocks of 128MB to 256MB are considered in Hadoop.
Replication In HDFS Replication ensures the availability of the data. Replication is
making a copy of something and the number of times you make a copy of that
particular thing can be expressed as it’s Replication Factor.
As we have seen in File blocks that the HDFS stores the data in the form of
various blocks at the same time Hadoop is also configured to make a copy of those
file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured
means you can change it manually as per your requirement like in above example
we have made 4 file blocks which means that 3 Replica or copy of each file block is
made means total of 4×3 = 12 blocks are made for the backup purpose.
This is because for running Hadoop we are using commodity hardware (inexpensive
system hardware) which can be crashed at any time. We are not using the
supercomputer for our Hadoop setup. That is why we need such a feature in HDFS
which can make copies of that file blocks for backup purposes, this is known as fault
tolerance.
Now one thing we also need to notice that after making so many replica’s of our file
blocks we are wasting so much of our storage but for the big brand organization the
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
data is very much important than the storage so nobody cares for this extra storage.
You can configure the Replication factor in your hdfs-site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes in our
Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of so many
Racks . with the help of this Racks information Namenode chooses the closest
Datanode to achieve the maximum performance while performing the read/write
information which reduces the Network Traffic.
HDFS Architecture
Features of YARN
Multi-Tenancy
Scalability
Cluster-Utilization
Compatibility
Hadoop common or Common utilities are nothing but our java library and java files
or we can say the java scripts that we need for all the other components present in
a Hadoop cluster. these utilities are used by HDFS, YARN, and MapReduce for
running the cluster. Hadoop Common verify that Hardware failure in a Hadoop
cluster is common so it needs to be solved automatically in software by Hadoop
Framework.
Hadoop: It is an open-source software framework used for storing data and running
applications on a group of commodity hardware. It has large storage capacity and
high processing power. It can manage multiple concurrent processes at the same
time. It is used in predictive analysis, data mining and machine learning. It can
handle both structured and unstructured form of data. It is more flexible in storing,
processing, and managing data than traditional RDBMS. Unlike traditional systems,
Hadoop enables multiple analytical processes on the same data at the same time.
It supports scalability very flexibly.
Below is a table of differences between RDBMS and Hadoop:
3. It is best suited for OLTP environment. It is best suited for BIG data.
The data schema of RDBMS is static The data schema of Hadoop is dynamic
8.
type. type.
Hadoop Overview
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to
scale up from single server to thousands of machines, each offering local computation and
storage.
Hadoop Architecture
At its core, Hadoop has two major layers namely −
MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at
Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
The MapReduce program runs on Hadoop which is an Apache open-source framework.
Hadoop distributors
8) Datameer Professional
Amazon EMR processes big data across a Hadoop cluster of virtual servers
on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage
Service (S3). The Elastic in EMR's name refers to its dynamic resizing
ability, which enables administrators to increase or reduce resources,
depending on their current needs.
Amazon EMR is used for data analysis in log analysis, web indexing, data
warehousing, machine learning (ML), financial analysis, scientific simulation
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
The main goal of Hortonworks is to drive all its innovations through the
Hadoop open data platform and build an ecosystem of partners that speeds up
the process of Hadoop adoption amongst enterprises.
Cloudera is the market trend in Hadoop space and is the first one to
release commercial Hadoop distribution. It offers consulting services to
bridge the gap between – “what does Apache Hadoop provides” and “what
organizations need”.
The MapR Converged Data Platform supports big data storage and
processing through the Apache collection of Hadoop products, as well as
its added-value components. These components from MapR Technologies
provide several enterprise-grade proprietary tools to better manage and
ensure the resiliency and reliability of data in the Hadoop cluster.
Introduces two new Application Accelerators that may help you to a faster
ROI: The Social Data Analytics Accelerator and the Machine Data Analytics
Accelerator.
Provides analytics and enterprise functionality on top of Apache Hadoop
technology to meet big data enterprise requirements.
Lets you run with the IBM-provided Apache Hadoop distribution or deploy to
a Cloudera cluster.
Includes data discovery to enable exploratory analysis and modeling of
unconventional data types.
New capabilities in IBM InfoSphere BigInsights Enterprise Edition V2.0:
Enhanced big data tools: Visualization, monitoring, development
o With a consistent, unified, and extensible user interface, the big data
tools bring big data collaboration to the enterprise and can help users
unlock the value within data by enabling various roles of an
organization to collaboratively leverage, discover, and explore large
amounts of data. In this release, BigInsights adds the following new
features for four major roles:
o Business analysts and business users
A centralized dashboard to visualize analytic results, including
new charts, BigSheets workbooks, data operations, and to
visualize metrics from the monitoring service.
The ability to view BigSheets data flows between and across
data sets to quickly navigate and relate analysis and charts.
BigSheets usability enhancements, including inner outer joins,
enhanced filters for BigSheets columns, column data-type
mapping for collections, application of analytics to BigSheets
columns, data preparation enabling users to define and update
schemas for data sources to improve error checking and
analysis, and additional charting capabilities.
Workflow composition that enables users to compose new
applications from existing applications and BigSheets, and to
invoke analytics applications from the web console, including
integration within BigSheets.
New Apps providing enhanced data import capability: a new
REST data source App that enables users to load data from any
data source supporting REST APIs into BigInsights , including
popular social media services; a new Sampling App that enables
users to sample data for analysis; and a new Subsetting App that
enables users to subset data for data analysis.
o Data scientists
A unified tooling environment that supports the data analytics
lifecycle by enabling users to sample data and define, test, and
deploy analytics applications from the BigInsights Eclipse tools,
and to administer, execute, and monitor the deployed
applications from the BigInsights Web Console.
Integration with R with an App that allows users to execute R
scripts directly from the BigInsights Web Console.
Extended text analytics capability that performs global analysis
to address key use cases such as customer profiling and lead
generation.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
o Administrators
New monitoring capabilities that provide a centralized
dashboard view to visualize key performance indicators
including CPU, disk, and memory and network usage for the
cluster, data services such as HDFS, HBase, Zookeeper, and
Flume, and application services including MapReduce, Hive,
and Oozie.
Enhanced status information and control over the major cluster
capabilities, building on the existing server management
capabilities.
Usability improvements for application permissions and
deployment.
New capability to view and control all applications from a single
page.
o Developers
A workflow editor that greatly simplified the creation of complex
Oozie workflows with a consumable interface.
A Pig editor with content assist and syntax highlighting that
enables users to create and execute new applications using
Pig in local or cluster mode from the Eclipse IDE.
Usability improvements to the Jaql Editor, such as extended
support for Jaql syntax, extended content assist, and improved
execution feedback.
Enablement of BigSheets macro and BigSheets reader
development.
Enhanced Text Analytics development, including the support
for modular rule sets, populating external artifacts, and
providing workflow UI extensions for domain and language
extensions.
Enhanced scope of the development artifacts during the
deployment phase, including artifacts for Text Analytics, scripts
for Jaql, Hive SQL, Pig, and Derby, and BigSheets macros and
readers.
Enhanced enterprise integration
o InfoSphere Data Explorer: InfoSphere BigInsights includes a limited-
use license to the included IBM InfoSphere Data Explorer program,
which helps organizations discover, navigate, and visualize vast
amounts of structured and unstructured information across many
enterprise systems and data repositories. It also provides a cost-
effective and efficient entry point for exploring the value of big data
technologies through a powerful framework for developing applications
that leverage existing enterprise data.
o InfoSphere Streams: InfoSphere BigInsights includes a limited-use
license to the included InfoSphere Streams program, which enables
real-time, continuous analysis of data on the fly. Streams is an
enterprise-class stream processing system that can be used to extract
actionable insights from data as it arrives in the enterprise while
transforming data and ingesting it into BigInsights at high speeds.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Microsoft and IBM. These Hadoop vendors are facing a tough competition in
the open data platform. With the war heating up amongst big data vendors,
nobody is sure as to who will top the list of commercial Hadoop vendors. With
Hadoop buying cycle on the upswing, Hadoop vendors must capture the market
share at a rapid pace to make the venture investors happy.
8) Datameer Professional
DataStax made the choice to use Apache Cassandra, which provides an “always-
on” capability for DataStax Enterprise (DSE) Analytics. DataStax OpsCenter
also offers a web-based visual management system for DSE that allows cluster
management, point-and-click provisioning and administration, secured
administration, smart data protection, and visual monitoring and tuning.
Dell summarizes its hardware software requirements for your Hadoop cluster
simply as, 2 – 100 Linux servers for Hadoop Cluster, 6GB RAM, 2+ Core, 1TB
HDD per server. The point is that entry into a Hadoop solution is simple and
inexpensive. And as Dell puts it, “Gain robust big data analytics on an open and
easily deployed platform.”
HDFS
Hadoop Distributed File system – HDFS is the world’s most reliable storage
system. HDFS is a Filesystem of Hadoop designed for storing very large files
running on a cluster of commodity hardware. It is designed on the principle
of storage of less number of large files rather than the huge number of small
files.
Hadoop HDFS provides a fault-tolerant storage layer for Hadoop and its
other components. HDFS Replication of data helps us to attain this feature.
It stores data reliably, even in the case of hardware failure. It provides high
throughput access to application data by providing the data access in
parallel.
HDFS Nodes
As we know, Hadoop works in master-slave fashion, HDFS also has two
types of nodes that work in the same manner. These are
the NameNode(s) and the DataNodes.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
HDFS stores multiple copies of each block across the cluster on different
nodes. This is a replication of data. By default, the HDFS replication factor
is
3. Hadoop HDFS provides high availability, fault tolerance, and reliability.
HDFS splits a large file into n number of small blocks and stores them on
different DataNodes in the cluster in a distributed manner. It replicates
each block and stored them across different DataNodes in the cluster.
Name Node places replicas of a block on multiple racks for improved fault
tolerance.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Name Node tries to place at least one replica of a block in a different rack so
that if a complete rack goes down, then also the system will be highly
available.
In the remaining section of this tutorial, we will see how read and write
operations are performed in HDFS? To read or write a file in HDFS, the
client needs to interact with Name Node. HDFS applications need a write-
once-read-many access model for files. A file, once created and written,
cannot be edited.
Name Node stores metadata, and Data Node stores actual data. The client
interacts with Name Node for performing any tasks, as Name Node is the
center piece in the cluster.
There are several Data Nodes in the cluster which store HDFS data in the
local disk. Data Node sends a heartbeat message to Name Node periodically
to indicate that it is alive. Also, it replicates data to other Data Node as per
the replication factor.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
b. Blocks
HDFS splits huge files into small chunks known as blocks. Block is the
smallest unit of data in a filesystem. We (client and admin) do not have any
control on the block like block location. NameNode decides all such things.
HDFS default block size is 128 MB. We can increase or decrease the block
size as per our need. This is unlike the OS filesystem, where the block size is
4 KB.
If the data size is less than the block size of HDFS, then block size will be
equal to the data size.
For example, if the file size is 129 MB, then 2 blocks will be created for it.
One block will be of default size 128 MB, and the other will be 1 MB only
and not 128 MB as it will waste the space (here block size is equal to data
size). Hadoop is intelligent enough not to waste the rest of 127 MB. So it is
allocating 1 MB block only for 1 MB data.
The major advantage of storing data in such block size is that it saves disk
seek time and another advantage is in the case of processing
as mapper processes 1 block at a time. So 1 mapper processes large data at
a time.
c. Replication
Hadoop HDFS creates duplicate copies of each block. This is known as
replication. All blocks are replicated and stored on different DataNodes
across the cluster. It tries to put at least 1 replica in a different rack.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
HDFS Daemons:
Daemons mean Process. Hadoop Daemons are a set of processes that run on
Hadoop. Hadoop is a framework written in Java, so all these processes are Java
Processes.
Apache Hadoop 2 consists of the following Daemons:
NameNode
DataNode
Secondary Name Node
Resource Manager
Node Manager
Namenode, Secondary NameNode, and Resource Manager work on a Master
System while the Node Manager and DataNode work on the Slave machine.
1. NameNode
The namenode daemon is a master daemon and is responsible for storing all the
location information of the files present in HDFS. The actual data is never stored on
a namenode. In other words, it holds the metadata of the files in HDFS.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
The name node maintains the entire metadata in RAM, which helps clients receive
quick responses to read requests. Therefore, it is important to run name node from a
machine that has lots of RAM at its disposal. The higher the number of files in HDFS,
the higher the consumption of RAM. The name node daemon also maintains a
persistent checkpoint of the metadata in a file stored on the disk called the fsimage file.
In case the name node daemon is restarted, the following sequence of events occur
at name node boot up:
1. Read the fsimage file from the disk and load it into memory (RAM).
2. Read the actions that are present in the edits log and apply each action to the
in-memory representation of the fsimage file.
3. Write the modified in-memory representation to the fsimage file on the disk.
The preceding steps make sure that the in-memory representation is up to date.
The namenode daemon is a single point of failure in Hadoop 1.x, which means
that if the node hosting the namenode daemon fails, the filesystem becomes
unusable. To handle this, the administrator has to configure the namenode to
write the fsimage file to the local disk as well as a remote disk on the network.
This backup on the remote disk can be used to restore the namenode on a freshly
installed server. Newer versions of Apache Hadoop (2.x) now support High
Availability (HA), which deploys two namenodes in an active/passive
configuration, wherein if the active namenode fails, the control falls onto the
passive namenode, making it active. This configuration reduces the downtime
in case of a namenode failure.
Since the fsimage file is not updated for every operation, it is possible
the edits logfile would grow to a very large file. The restart of namenode service
would become very slow because all the actions in the large edits logfile will
have to be applied on the fsimage file. The slow boot up time could be avoided
using the secondary namenode daemon.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
The namespace image and the edit log stores information of the data and the metadata.
NameNode also determines the linking of blocks to DataNodes. Furthermore, the NameNode
is a single point of failure. The DataNode is a multiple instance server. There can be several
numbers of DataNode servers. The number depends on the type of network and the storage
system.
The DataNode servers, stores, and maintains the data blocks. The NameNode Server
provisions the data blocks on the basis of the type of job submitted by the client.
DataNode also stores and retrieves the blocks when asked by clients or the NameNode.
Furthermore, it reads/writes requests and performs block creation, deletion, and replication of
instruction from the NameNode. There can be only one Secondary NameNode server in a
cluster. Note that you cannot treat the Secondary NameNode server as a disaster recovery
server. However, it partially restores the NameNode server in case of a failure.
2. DataNode
DataNode works on the Slave system. The NameNode always instructs DataNode
for storing the Data. DataNode is a program that runs on the slave system that
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
serves the read/write request from the client. As the data is stored in this DataNode,
they should possess high memory to store more Data.
How to start Data Node?
hadoop-daemon.sh start datanode
How to stop Data Node?
hadoop-daemon.sh stop datanode
The datanode daemon acts as a slave node and is responsible for storing the actual
files in HDFS. The files are split as data blocks across the cluster. The blocks are
typically 64 MB to 128 MB size blocks. The block size is a configurable parameter.
The file blocks in a Hadoop cluster also replicate themselves to other datanodes for
redundancy so that no data is lost in case a datanode daemon fails. The datanode
daemon sends information to the namenode daemon about the files and blocks
stored in that node and responds to the namenode daemon for all filesystem
operations. The following diagram shows how files are stored in the cluster:
File blocks of files A, B, and C are replicated across multiple nodes of the cluster
for redundancy. This ensures availability of data even if one of the nodes fail.
You can also see that blocks of file A are present on nodes 2, 4, and 6; blocks of
file B are present on nodes 3, 5, and 7; and blocks of file C are present on 4, 6,
and 7. The replication factor configured for this cluster is 3, which signifies that
each file block is replicated three times across the cluster. It is the responsibility
of the namenode daemon to maintain a list of the files and their corresponding
locations on the cluster. Whenever a client needs to access a file, the namenode
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
daemon provides the location of the file to client and the client, then accesses
the file directly from the data node daemon.
3. secondary Name Node is used for taking the hourly backup of the data.
In case the Hadoop cluster fails, or crashes, the secondary Namenode will take
the hourly backup or checkpoints of that data and store this data into a file
name fsimage. This file then gets transferred to a new system.
A new Meta Data is assigned to that new system and a new Master is created
with this Meta Data, and the cluster is made to run again correctly.
This is the benefit of Secondary Name Node.
4. Resource Manager
Resource Manager is also known as the Global Master Daemon that works on the
Master System.
The Resource Manager Manages the resources for the applications that are running
in a Hadoop Cluster.
1. Applications Manager
2. Scheduler
An Application Manager is responsible for accepting the request for a client and also
makes a memory resource on the Slaves in a Hadoop cluster to host the Application
Master.
The scheduler is utilized for providing resources for applications in a Hadoop cluster
and for monitoring this application.
5. Node Manager
The Node Manager works on the Slaves System that manages the memory resource
within the Node and Memory Disk. Each Slave Node in a Hadoop cluster has a single
NodeManager Daemon running in it. It also sends this monitoring information to
the Resource Manager.
How to start Node Manager?
yarn-daemon.sh start node manager
How to stop Node Manager?
yarn-daemon.sh stop nodemanager
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
ResourceManager 8088
NodeManager 8042
The below diagram shows how Hadoop works.
Distributed Storage
Reliability
High availability
Cost-effective
High throughput
Building Blocks of Hadoop:
Name Node
Data Node
Secondary Name Node (SNN)
Job Tracker
Task Tracker
Let’s get an idea of how data flows between the client interacting with HDFS, the
name node, and the data nodes with the help of a diagram. Consider the figure:
Step 1: The client opens the file it wishes to read by calling open() on the File System
Object(which for HDFS is an instance of Distributed File System).
Step 2: Distributed File System (DFS) calls the name node, using remote procedure
calls (RPCs), to determine the locations of the first few blocks in the file. For each
block, the name node returns the addresses of the data nodes that have a copy of
that block.
The DFS returns an FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the data node
and name node I/O.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Step 3: The client then calls read () on the stream. DFSInputStream, which has
stored the info node addresses for the primary few blocks within the file, then
connects to the primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read()
repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next block.
This happens transparently to the client, which from its point of view is simply
reading an endless stream. Blocks are read as, with the DFSInputStream opening
new connections to data nodes because the client reads through the stream. It will
also call the name node to retrieve the data node locations for the next batch of
blocks as needed.
Step 6: When the client has finished reading the file, a function is called, close() on
the FSDataInputStream.
Anatomy of File Write in HDFS
Next, we’ll check out how files are written to HDFS. Consider figure 1.2 to get a
better understanding of the concept.
Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit
the files which are already stored in HDFS, but we can append data by reopening
the files.
record of the new file; otherwise, the file can’t be created and therefore the client
is thrown an error i.e. IOException. The DFS returns an FSDataOutputStream for the
client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets,
which it writes to an indoor queue called the info queue. The data queue is
consumed by the DataStreamer, which is liable for asking the name node to allocate
new blocks by picking an inventory of suitable data nodes to store the replicas. The
list of data nodes forms a pipeline, and here we’ll assume the replication level is
three, so there are three nodes in the pipeline. The DataStreamer streams the
packets to the primary data node within the pipeline, which stores each packet and
forwards it to the second data node within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third
(and last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting
to be acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and
waits for acknowledgments before connecting to the name node to signal whether
the file is complete or not.
HDFS follows Write Once Read Many models. So, we can’t edit files that are already
stored in HDFS, but we can include them by again reopening the file. This design
allows HDFS to scale to a large number of concurrent clients because the data traffic
is spread across all the data nodes in the cluster. Thus, it increases the availability,
scalability, and throughput of the system.
Name Node
NameNode works as Master in Hadoop cluster. Below listed are the main
function performed by NameNode:
1. Stores metadata of actual data. E.g. Filename, Path, No. of Data Blocks,
Block IDs, Block Location, No. of Replicas, Slave related configuration
2. Manages File system namespace.
3. Regulates client access request for actual file data file.
4. Assign work to Slaves(DataNode).
5. Executes file system name space operation like opening/closing files,
renaming files and directories.
6. As Name node keep metadata in memory for fast retrieval, the huge
amount of memory is required for its operation. This should be hosted on
reliable hardware.
HDFS NameNode
1. NameNode is the main central component of HDFS architecture
framework.
2. NameNode is also known as Master node.
3. HDFS Namenode stores meta-data i.e. number of data blocks, file name,
path, Block IDs, Block location, no. of replicas, and also Slave related
configuration. This meta-data is available in memory in the master for faster
retrieval of data.
4. NameNode keeps metadata related to the file system namespace in
memory, for quicker response time. Hence, more memory is needed. So
NameNode configuration should be deployed on reliable configuration.
5. NameNode maintains and manages the slave nodes, and assigns tasks to
them.
6. NameNode has knowledge of all the DataNodes containing data blocks for
a given file.
7. NameNode coordinates with hundreds or thousands of data nodes and
serves the requests coming from client applications.
Two files ‘FSImage’ and the ‘EditLog’ are used to store metadata information.
FsImage: It is the snapshot the file system when Name Node is started. It is
an “Image file”. FsImage contains the entire filesystem namespace and stored
as a file in the NameNode’s local file system. It also contains a serialized form
of all the directories and file inodes in the filesystem. Each inode is an
internal representation of file or directory’s metadata.
EditLogs: It contains all the recent modifications made to the file system on
the most recent FsImage. NameNode receives a create/update/delete
request from the client. After that this request is first recorded to edits file.
1. It is the master daemon that maintains and manages the DataNodes (slave
nodes).
2. It records the metadata of all the files stored in the cluster, e.g. The location
of blocks stored, the size of the files, permissions, hierarchy, etc.
3. It records each change that takes place to the file system metadata. For
example, if a file is deleted in HDFS, the NameNode will immediately record
this in the EditLog.
4. It regularly receives a Heartbeat and a block report from all the DataNodes
in the cluster to ensure that the DataNodes are live.
5. It keeps a record of all the blocks in HDFS and in which nodes these blocks
are located.
6. The NameNode is also responsible to take care of the replication factor of
all the blocks.
7. In case of the DataNode failure, the NameNode chooses new DataNodes for
new replicas, balance disk usage and manages the communication traffic to
the DataNodes.
Each and every transaction that occurs on the file system is recorded within the
edit log file. At some point of time this file becomes very large.
Namenode holds the metadata for HDFS like Block information, size etc. This
Information is stored in main memory as well as disk for persistence storage .
The information is stored in 2 different files .They are
Any changes done to HDFS gets noted in the edit logos the file size grows
where as the size of fsimage remains same.
This not have any impact until we restart the server. When we restart the
server the edit file logs are written into fsimage file and loaded into main
memory which takes some time.
If we restart the cluster after a long time there will be a vast down time
since the edit log file would have grown. Secondary name node would come
into picture in rescue of this problem.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Secondary Name node simply gets edit logs from name node periodically
and copies to fsimage. This new fsimage is copied back to namenode.
Namenode now, this uses this new fsimage for next restart which reduces the
startup time.
It is a helper node to Name node and to precise Secondary Name node
whole purpose is to have checkpoint in HDFS, which helps name node to
function effectively. Hence, It is also called as Checkpoint node.
Now there are two important files which reside in the namenode’ s current
directory,
1. FsImage file :-This file is the snapshot of the HDFS metadata at a certain
point of time .
2. Edits Log file :-This file stores the records for changes that have been
made in the HDFS namespace .
The main function of the Secondary namenode is to store the latest copy of
the FsImage and the Edits Log files.
When the namenode is restarted , the latest copies of the Edits Log files are
applied to the FsImage file in order to keep the HDFS metadata latest.
So it becomes very important to store a copy of these two files , which is done
by secondary namenode.
Now to keep latest versions of these two files, the secondary name node takes
the checkpoints at hourly basis which is the default time gap .
Checkpoint:-
A checkpoint is nothing but the updation of the latest FsImage file by applying
the latest Edits Log files to it .If the time gap of a checkpoint is large the there
will be too many Edits Log files generated and it will be very cumbersome and
time consuming to apply them all at once on the latest FsImage file . And this
may lead to acute start time for the primary namenode after a reboot .
Note:-
There are two options to which can be used along with secondary namenode
command
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
1. -geteditsize:- this option helps to find the current size of the edit_ingress
file present in namenode’ s current directory.Here edit_ingress file is the
ongoing in progress Edits Log file .
2. -checkpoint [force]:- this option forcefully checkpoints the secondary
namenode to the latest state of the primary namenode , whatever may the size
of the Edits Log file may be. But ideally the size of the Edits Log file should be
greater than or equal to the checkpoint file size .
Namenode holds the meta data for the HDFS like Namespace information,
block information etc. When in use, all this information is stored in main
memory. But these information also stored in disk for persistence storage.
Only in the restart of namenode , edit logs are applied to fsimage to get the
latest snapshot of the file system. But namenode restart are rare in production
clusters which means edit logs can grow very large for the clusters where
namenode runs for a long period of time. The following issues we will
encounter in this situation.
Things have been changed over the years especially with Hadoop 2.x. Now
Namenode is highly available with fail over feature. Secondary Namenode is
optional now & Standby Namenode has been to used for failover
process. Standby NameNode will stay up-to-date with all the file system
changes the Active NameNode makes .
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
When the namenode is in the active state the edit logs size grows continuously
as the edit logs can only be applied to the fsimage at the time of name node
restart, to get the latest state of the HDFS. If edit logs grows significantly and
name node tries to apply it on fsimage at the time of name node restart, the
process can take very long, here secondary node come into the play.
Secondary namenode keeps the checkpoint on the name node, It reads
the edit logs from the namenode continuously after a specific interval and
applies it to the fsimage copy of secondary name node. In this way the
fsimage file will have the most recent state of HDFS.
The secondary namenode copies new fsimage to primary, so fsimage is
updated.
Data Node
HDFS Data Node
1. Data Node is also known as Slave node.
2. In Hadoop HDFS Architecture, Data Node stores actual data in HDFS.
3. Data Nodes responsible for serving, read and write requests for the clients.
4. Data Nodes can deploy on commodity hardware.
5. Data Nodes sends information to the Name Node about the files and blocks
stored in that node and responds to the Name Node for all filesystem
operations.
6. When a Data Node starts up it announce itself to the Name Node along
with the list of blocks it is responsible for.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
7. Data Node is usually configured with a lot of hard disk space. Because the
actual data is stored in the Data Node.
HDFS Architecture
NameNode:
Functions of NameNode:
It is the master daemon that maintains and manages the DataNodes (slave
nodes)
It records the metadata of all the files stored in the cluster, e.g.
The location of blocks stored, the size of the files, permissions, hierarchy,
etc. There are two files associated with the metadata:
o FsImage: It contains the complete state of the file system
namespace since the start of the NameNode.
o EditLogs: It contains all the recent modifications made to the file
system with respect to the most recent FsImage.
It records each change that takes place to the file system metadata. For
example, if a file is deleted in HDFS, the NameNode will immediately record
this in the EditLog.
It regularly receives a Heartbeat and a block report from all the DataNodes
in the cluster to ensure that the DataNodes are live.
It keeps a record of all the blocks in HDFS and in which nodes these blocks
are located.
The NameNode is also responsible to take care of
the replication factor of all the blocks which we will discuss in detail later
in this HDFS tutorial blog.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
DataNode:
DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is
a commodity hardware, that is, a non-expensive system which is not of
high quality or high-availability. The DataNode is a block server that stores
the data in the local file ext3 or ext4.
Functions of DataNode:
These are slave daemons or process which runs on each slave machine.
The actual data is stored on DataNodes.
The DataNodes perform the low-level read and write requests from the file
system’s clients.
They send heartbeats to the NameNode periodically to report the overall
health of HDFS, by default, this frequency is set to 3 seconds.
Till now, you must have realized that the NameNode is pretty much
important to us. If it fails, we are doomed. But don’t worry, we will be
talking about how Hadoop solved this single point of failure problem in the
next Apache Hadoop HDFS Architecture blog. So, just relax for now and
let’s take one step at a time.
Learn more about Big Data and its applications from the Data Engineer
certification.
Secondary NameNode:
Apart from these two daemons, there is a third daemon or a process called
Secondary NameNode. The Secondary NameNode works concurrently with
the primary NameNode as a helper daemon. And don’t be confused about
the Secondary NameNode being a backup NameNode because it is not.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
The Secondary NameNode is one which constantly reads all the file
systems and metadata from the RAM of the NameNode and writes it into
the hard disk or the file system.
It is responsible for combining the EditLogs with FsImage from the
NameNode.
It downloads the EditLogs from the NameNode at regular intervals and
applies to FsImage. The new FsImage is copied back to the NameNode,
which is used whenever the NameNode is started the next time.
Blocks:
Now, as we know that the data in HDFS is scattered across the DataNodes
as blocks. Let’s have a look at what is a block and how is it formed?
Blocks are the nothing but the smallest continuous location on your hard
drive where data is stored. In general, in any of the File System, you store
the data as a collection of blocks. Similarly, HDFS stores each file as blocks
which are scattered throughout the Apache Hadoop cluster. The default
size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache
Hadoop 1.x) which you can configure as per your requirement.
It is not necessary that in HDFS, each file is stored in exact multiple of the
configured block size (128 MB, 256 MB etc.). Let’s take an example where
I have a file “example.txt” of size 514 MB as shown in above
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Well, whenever we talk about HDFS, we talk about huge data sets, i.e.
Terabytes and Petabytes of data. So, if we had a block size of let’s say of
4 KB, as in Linux file system, we would be having too many blocks and
therefore too much of the metadata. So, managing these no. of blocks and
metadata will create huge overhead, which is something, we don’t want.
As you understood what a block is, let us understand how the replication
of these blocks takes place in the next section of this HDFS Architecture.
Meanwhile, you may check out this video tutorial on HDFS Architecture
where all the HDFS Architecture concepts has been discussed in detail:
Replication Management:
HDFS provides a reliable way to store huge data in a distributed
environment as data blocks. The blocks are also replicated to provide fault
tolerance. The default replication factor is 3 which is again configurable.
So, as you can see in the figure below where each block is replicated three
times and stored on different DataNodes (considering the default
replication
factor):
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Therefore, if you are storing a file of 128 MB in HDFS using the default
configuration, you will end up occupying a space of 384 MB (3*128 MB)
as the blocks will be replicated three times and each replica will be residing
on a different DataNode.
Rack Awareness:
Anyways, moving ahead, let’s talk more about how HDFS places replica
and what is rack awareness? Again, the NameNode also ensures that all
the replicas are not stored on the same rack or a single rack. It follows an
in-built Rack Awareness Algorithm to reduce latency as well as provide
fault tolerance.
This is how an actual Hadoop production cluster looks like. Here, you have
multiple racks populated with DataNodes:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
So, now you will be thinking why do we need a Rack Awareness algorithm?
The reasons are:
To prevent loss of data: We don’t have to worry about the data even if
an entire rack fails because of the switch failure or power failure. And if
you think about it, it will make sense, as it is said that never put all your
eggs in the same basket.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Assume that the system block size is configured for 128 MB (default). So,
the client will be dividing the file “example.txt” into 2 blocks – one of 128
MB (Block A) and the other of 120 MB (block B).
Now, the following protocol will be followed whenever the data is written
into HDFS:
At first, the HDFS client will reach out to the NameNode for a Write Request
against the two blocks, say, Block A & Block B.
The NameNode will then grant the client the write permission and will
provide the IP addresses of the DataNodes where the file blocks will be
copied eventually.
The selection of IP addresses of DataNodes is purely randomized based on
availability, replication factor and rack awareness that we have discussed
earlier.
Let’s say the replication factor is set to default i.e. 3. Therefore, for each
block the NameNode will be providing the client a list of (3) IP addresses
of DataNodes. The list will be unique for each block.
Suppose, the NameNode provided following lists of IP addresses to the
client:
o For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of
DataNode 6}
o For Block B, set B = {IP of DataNode 3, IP of DataNode 7, IP of
DataNode 9}
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
1. Set up of Pipeline
2. Data streaming and replication
3. Shutdown of Pipeline (Acknowledgement stage)
1. Set up of Pipeline:
Before writing the blocks, the client confirms whether the DataNodes,
present in each of the list of IPs, are ready to receive the data or not. In
doing so, the client creates a pipeline for each of the blocks by connecting
the individual DataNodes in the respective list for that block. Let us
consider Block A. The list of DataNodes provided by the NameNode is:
So, for block A, the client will be performing the following steps to create
a pipeline:
The client will choose the first Data Node in the list (Data Node IPs for
Block A) which is Data Node 1 and will establish a TCP/IP connection.
The client will inform Data Node 1 to be ready to receive the block. It will
also provide the IPs of next two Data Nodes (4 and 6) to the Data Node 1
where the block is supposed to be replicated.
The Data Node 1 will connect to Data Node 4. The DataNode 1 will inform
Data Node 4 to be ready to receive the block and will give it the IP of
DataNode 6. Then, Data Node 4 will tell Data Node 6 to be ready for
receiving the data.
Next, the acknowledgement of readiness will follow the reverse sequence,
i.e. From the DataNode 6 to 4 and then to 1.
At last DataNode 1 will inform the client that all the DataNodes are ready
and a pipeline will be formed between the client, DataNode 1, 4 and 6.
Now pipeline set up is complete and the client will finally begin the data
copy or streaming process.
2. Data Streaming:
As the pipeline has been created, the client will push the data into the
pipeline. Now, don’t forget that in HDFS, data is replicated based on
replication factor. So, here Block A will be stored to three DataNodes as
the assumed replication factor is 3. Moving ahead, the client will copy the
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Once the block has been written to DataNode 1 by the client, DataNode 1
will connect to DataNode 4.
Then, DataNode 1 will push the block in the pipeline and data will be copied
to DataNode 4.
Again, DataNode 4 will connect to DataNode 6 and will copy the last replica
of the block.
Once the block has been copied into all the three DataNodes, a series of
acknowledgements will take place to ensure the client and NameNode that
the data has been written successfully. Then, the client will finally close
the pipeline to end the TCP session.
Similarly, Block B will also be copied into the DataNodes in parallel with
Block A. So, the following things are to be noticed here:
As you can see in the above image, there are two pipelines formed for
each block (A and B). Following is the flow of operations that is taking
place for each block in their respective pipelines:
Now, following steps will be taking place while reading the file:
The client will reach out to NameNode asking for the block metadata for
the file “example.txt”.
The NameNode will return the list of DataNodes where each block (Block
A and B) are stored.
After that client, will connect to the DataNodes where the blocks are
stored.
The client starts reading data parallel from the DataNodes (Block A from
DataNode 1 and Block B from DataNode 3).
Once the client gets all the required file blocks, it will combine these blocks
to form a file.
While serving read request of the client, HDFS selects the replica which is
closest to the client. This reduces the read latency and the bandwidth
consumption. Therefore, that replica is selected which resides on the same
rack as the reader node, if possible.
Now, you should have a pretty good idea about Apache Hadoop HDFS
Architecture. I understand that there is a lot of information here and it
may not be easy to get it in one go. I would suggest you to go through it
again and I am sure you will find it easier this time. Now, in my next blog,
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Hadoop Configuration
It contains the configuration settings for Hadoop Core such as I/O
settings that are common to HDFS and MapReduce. The hdfs-site. xml file
contains the configuration settings for HDFS daemons; the Name Node, the
Secondary Name Node, and the Data Nodes. Here, we can configure hdfs-
site.
Mapred-site.xml:
Defining mapred-site.xml:
exportHADOOP_DATANODE_HEAPSIZE=”128″
exportHADOOP_TASKTRACKER_HEAPSIZE=”512″
machine cluster with 2,400 cores and about 3 PB raw storage. Each of the
commodity node has 8 cores and 12 TB storage.
Facebook uses streaming and Java API a lot and have used Hive to build
a higher-level data warehousing framework. They have also developed a
FUSE application over HDFS.
NameNode has high memory requirement and will have a lot of RAM and
does not require a lot of memory on hard disk. The memory requirement
are supposed to store data. They have multiple drives as well. Learn more
from this Big Data Course about Hadoop Clusters, HDFS, and other
Consider you have following input data for your MapReduce in Big
data Program
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1
The data goes through the following phases of MapReduce in Big Data
Input Splits:
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Mapping
Shuffling
Reducing
In this phase, output values from the Shuffling phase are aggregated.
This phase combines values from Shuffling phase and returns a single
output value. In short, this phase summarizes the complete dataset.
In our example, this phase aggregates the values from Shuffling phase
i.e., calculates total occurrences of each word.
One map task is created for each split which then executes map
function for each record in the split.
It is always beneficial to have multiple splits because the time
taken to process a split is small as compared to the time taken for
processing of the whole input. When the splits are smaller, the
processing is better to load balanced since we are processing the
splits in parallel.
However, it is also not desirable to have splits too small in size.
When splits are too small, the overload of managing the splits
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
and map task creation begins to dominate the total job execution
time.
For most jobs, it is better to make a split size equal to the size of
an HDFS block (which is 64 MB, by default).
Execution of map tasks results into writing output to a local disk
on the respective node and not to HDFS.
Reason for choosing local disk over HDFS is, to avoid replication
which takes place in case of HDFS store operation.
Map output is intermediate output which is processed by reduce
tasks to produce the final output.
Once the job is complete, the map output can be thrown away. So,
storing it in HDFS with replication becomes overkill.
In the event of node failure, before the map output is consumed
by the reduce task, Hadoop reruns the map task on another node
and re-creates the map output.
Reduce task doesn’t work on the concept of data locality. An
output of every map task is fed to the reduce task. Map output is
transferred to the machine where reduce task is running.
On this machine, the output is merged and then passed to the
user-defined reduce function.
Unlike the map output, reduce output is stored in HDFS (the first
replica is stored on the local node and other replicas are stored
on off-rack nodes). So, writing the reduce output
Hadoop divides the job into tasks. There are two types of tasks:
as mentioned above.
A job is divided into multiple tasks which are then run onto
multiple data nodes in a cluster.
It is the responsibility of job tracker to coordinate the activity
by scheduling tasks to run on different data nodes.
Execution of individual task is then to look after by task tracker,
which resides on every data node executing part of the job.
Task tracker’s responsibility is to send the progress report to
the job tracker.
In addition, task tracker periodically sends ‘heartbeat’ signal to
the Jobtracker so as to notify him of the current state of the
system.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Thus job tracker keeps track of the overall progress of each job.
In the event of task failure, the job tracker can reschedule it on a
different task tracker.
Role of Hbase in Big Data Processing:
HBase provides low latency random read and write access to petabytes of data by
distributing requests from applications across a cluster of hosts. Each host has access to data
in HDFS and S3, and serves read and write requests in milliseconds.
Since 1970, RDBMS is the solution for data storage and maintenance related problems. After
the advent of big data, companies realized the benefit of processing big data and started opting
for solutions like Hadoop.
Hadoop uses distributed file system for storing big data, and MapReduce to process it. Hadoop
excels in storing and processing of huge data of various formats such as arbitrary, semi-, or
even unstructured.
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed only in a sequential
manner. That means one has to search the entire dataset even for the simplest of jobs.
A huge dataset when processed results in another huge data set, which should also be
processed sequentially. At this point, a new solution is needed to access any point of data in
a single unit of time (random access).
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is
an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data
in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File
System and provides read and write access.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
HDFS is a distributed file system HBase is a database built on top of the HDFS.
suitable for storing large files.
HDFS does not support fast individual HBase provides fast lookups for larger tables.
record lookups.
It provides high latency batch It provides low latency access to single rows from billions of
processing; no concept of batch records (Random access).
processing.
It provides only sequential access of HBase internally uses Hash tables and provides random
data. access, and it stores the data in indexed HDFS files for faster
lookups.
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
Such databases are designed for small number Column-oriented databases are designed
of rows and columns. for huge tables.
It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard
scalable. to scale.
Features of HBase
HBase is linearly scalable.
It has automatic failure support.
It provides consistent read and writes.
It integrates with Hadoop, both as a source and a destination.
It has easy java API for client.
It provides data replication across clusters.
Applications of HBase
It is used whenever there is a need to write heavy applications.
HBase is used whenever we need to provide fast random access to available data.
Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
HBase - Architecture
In HBase, tables are split into regions and are served by the region servers. Regions are
vertically divided by column families into “Stores”. Stores are saved as files in HDFS. Shown
below is the architecture of HBase.
Note: The term ‘store’ is used for regions to explain the storage structure.
HBase has three major components: the client library, a master server, and region servers.
Region servers can be added or removed as per requirement.
MasterServer
The master server -
Assigns regions to the region servers and takes the help of Apache ZooKeeper
for this task.
Handles load balancing of the regions across region servers. It unloads the
busy servers and shifts the regions to less occupied servers.
Maintains the state of the cluster by negotiating the load balancing.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Regions
Regions are nothing but tables that are split up and spread across the region servers.
Region server
The region servers have regions that -
When we take a deeper look into the region server, it contain regions and stores as shown
below:
The store contains memory store and HFiles. Memstore is just like a cache memory. Anything
that is entered into the HBase is stored here initially. Later, the data is transferred and saved
in Hfiles as blocks and the memstore is flushed.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Zookeeper
Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization,
etc.
Zookeeper has ephemeral nodes representing different region servers.
Master servers use these nodes to discover available servers.
In addition to availability, the nodes are also used to track server failures or
network partitions.
Clients communicate with region servers via zookeeper.
In pseudo and standalone modes, HBase itself will take care of zookeeper.
Hive
Hive is a data warehouse system which is used to analyze structured data. It is built
on the top of Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large datasets
residing in distributed storage. It runs SQL like queries called HQL (Hive query
language) which gets internally converted to MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach of writing
complex MapReduce programs. Hive supports Data Definition Language (DDL), Data
Manipulation Language (DML), and User Defined Functions (UDF).
Features of Hive
These are the following features of Hive:
Limitations of Hive
o Hive is not capable of handling real-time data.
o It is not designed for online transaction processing.
o Hive queries contain high latency.
Hive Architecture
The following architecture explains the flow of submission of query into Hive.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Hive Client
Hive allows writing applications in various languages, including Java, Python, and
C++. It supports different types of clients such as:-
o Thrift Server - It is a cross-language service provider platform that serves the request
from all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications.
The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect
to Hive.
Hive Services
The following are the services provided by Hive:-
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive
queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides
a web-based GUI for executing Hive queries and commands.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
o Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column
and its type information, the serializers and deserializers which is used to read and
write data and the corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts HiveQL
statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver
(any database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check the
syntax and query plan or the requirement of query.
3 Get Metadata
4 Send Metadata
5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up
to here, the parsing and compiling of a query is complete.
6 Execute Plan
7 Execute Job
8 Fetch Result
9 Send Results
10 Send Results
PIG
What is Apache Pig?
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze
larger sets of data representing them as data flows. Pig is generally used with Hadoop; we
can perform all the data manipulation operations in Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language known as Pig Latin. This
language provides various operators using which programmers can develop their own
functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin
language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has
a component known as Pig Engine that accepts the Pig Latin scripts as input and converts
those scripts into MapReduce jobs.
Any novice programmer with a basic knowledge Exposure to Java is must to work with
of SQL can work conveniently with Apache Pig. MapReduce.
Apache Pig uses multi-query approach, thereby MapReduce will require almost 20
reducing the length of the codes to a great times more the number of lines to
extent. perform the same task.
Pig SQL
In Apache Pig, schema is optional. We can store data Schema is mandatory in SQL.
without designing a schema (values are stored as $01,
$02 etc.)
The data model in Apache Pig is nested relational. The data model used in SQL is
flat relational.
Apache Pig provides limited opportunity for Query There is more opportunity for
optimization. query optimization in SQL.
Apache Pig uses a language called Pig Latin. It Hive uses a language called HiveQL. It was
was originally created at Yahoo. originally created at Facebook.
Apache Pig can handle structured, Hive is mostly for structured data.
unstructured, and semi-structured data.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type
checking, and other miscellaneous checks. The output of the parser will be a DAG (directed
acyclic graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data
flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.
Local Mode
o It executes in a single JVM and is used for development experimenting and
prototyping.
o Here, files are installed and run using localhost.
o The local mode works on a local file system. The input and output data stored in the
local file system.
1. $ pig-x local
MapReduce Mode
o The MapReduce mode is also known as Hadoop Mode.
o It is the default mode.
o In this Pig renders Pig Latin into MapReduce jobs and executes them on the cluster.
o It can be executed against semi-distributed or fully distributed Hadoop installation.
o Here, the input and output data are present on HDFS.
1. $ pig
Ways to execute Pig Program
These are the following ways of executing a Pig program on local and MapReduce
mode: -
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
o Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To invoke
Grunt shell, run the pig command. Once the Grunt mode executes, we can provide
Pig Latin statements and command interactively at the command line.
o Batch Mode - In this mode, we can run a script file having a .pig extension. These
files contain Pig Latin commands.
o Embedded Mode - In this mode, we can define our own functions. These functions
can be called as UDF (User Defined Functions). Here, we use programming languages
like Java and Python.
Pig Latin
The Pig Latin is a data flow language used by Apache Pig to analyze the data in
Hadoop. It is a textual language that abstracts the programming from the Java
MapReduce idiom into a notation.
() The parenthesis can enclose one or more items. It can also be used to
indicate the tuple data type.
Example - (10, xyz, (3,6,9))
[] The straight brackets can enclose one or more items. It can also be used
to indicate the map data type.
Example - [INNER | OUTER]
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
{} The curly brackets enclose two or more items. It can also be used to
indicate the bag data type
Example - { block | nested_block }
... The horizontal ellipsis points indicate that you can repeat a portion of the
code.
Example - cat path [path ...]
Pig Example
Use case: Using Pig find the most occurred start letter.
Solution:
Case 1: Load the data into bag named "lines". The entire line is stuck to element line
of type character array.
Case 2: The text in the bag lines needs to be tokenized this produces one word per
row.
Case 3: To retain the first letter of each word type the below command .This
commands uses substring method to take the first character.
Case 4: Create a bag for unique character where the grouped bag will contain the
same character for each occurrence of that character.
Case 6: Arrange the output according to count in descending order using the
commands below.
Case 8: Store the result in HDFS . The result is saved in output directory under sonoo
folder.