0% found this document useful (0 votes)
89 views13 pages


Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 13



Let us start with the history of HBase and know how HBase has evolved over a period of

 Apache HBase is modelled after Google’s BigTable, which is used to collect data and
serve request for various Google services like Maps, Finance, Earth etc.
 Apache HBase began as a project by the company Powerset for Natural Language
Search, which was handling massive and sparse data sets.
 Apache HBase was first released in February 2007. Later in January 2008, HBase
became a sub project of Apache Hadoop.
 In 2010, HBase became Apache’s top level project.

Introduction to HBase

HBase is an open source, multidimensional, distributed, scalable and a NoSQL

database written in Java. HBase runs on top of HDFS (Hadoop Distributed File System) and
provides BigTable like capabilities to Hadoop. It is designed to provide a fault tolerant way of
storing large collection of sparse data sets.

Since, HBase achieves high throughput and low latency by providing faster Read/Write
Access on huge data sets. Therefore, HBase is the choice for the applications which require
fast & random access to large amount of data.
It provides compression, in-memory operations and Bloom filters (data structure which tells
whether a value is present in a set or not) to fulfill the requirement of fast and random read-

NoSQL Databases

NoSQL means Not only SQL. NoSQL databases is modeled in a way that it can represent data
other than tabular formats, unkile relational databases. It uses different formats to
represent data in databases and thus, there are different types of NoSQL databases based
on their representation format. Most of NoSQL databases leverages availability and speed
over consistency. Now, let us move ahead and understand about the different types of
NoSQL databases and their representation formats.

Key-Value stores:
It is a schema-less database which contains keys and values. Each key, points to a value
which is an array of bytes, can be a string, BLOB, XML, etc. e.g. Lamborghini is a key and can
point to a value Gallardo, Aventador, Murciélago, Reventón, Diablo, Huracán, Veneno,
Centenario etc.

Key-Value stores databases: Aerospike, Couchbase, Dynamo, FairCom c-treeACE,

FoundationDB, HyperDex, MemcacheDB, MUMPS, Oracle NoSQL Database,
OrientDB, Redis, Riak, Berkeley DB.

Key-value stores handle size well and are good at processing a constant stream of
read/write operations with low latency. This makes them perfect for User preference
and profile stores, Product recommendations; latest items viewed on a retailer
website for driving future customer product recommendations, Ad servicing;
customer shopping habits result in customized ads, coupons, etc. for each customer
in real-time.

Document Oriented:
It follows the same key value pair, but it is semi structured like XML, JSON, BSON. These
structures are considered as documents.

Document Based databases: Apache CouchDB, Clusterpoint, Couchbase,

DocumentDB, HyperDex, IBM Domino, MarkLogic, MongoDB, OrientDB, Qizx,

As document supports flexible schema, fast read write and partitioning makes it
suitable for creating user databases in various services like twitter, e-commerce
websites etc.

Column Oriented:
In this database, data is stored in cell grouped in column rather than rows. Columns are
logically grouped into column families which can be either created during schema definition
or at runtime.
These types of databases store all the cell corresponding to a column as continuous disk
entry, thus making the access and search much faster.

Column Based Databases: HBase, Accumulo, Cassandra, Druid, Vertica.

It supports the huge storage and allow faster read write access over it. This makes
column oriented databases suitable for storing customer behaviors in e-commerce
website, financial systems like Google Finance and stock market data, Google maps

Graph Oriented:
It is a perfect flexible graphical representation, used unlike SQL. These types of databases
easily solve address scalability problems as it contains edges and node which can be
extended according to the requirements.

Graph based databases: AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph,

MarkLogic, Neo4J, OrientDB, Virtuoso, Stardog.

This is basically used in Fraud detection, Real-time recommendation engines (in most
cases e-commerce), Master data management (MDM), Network and IT operations,
Identity and access management (IAM), etc.
HBase and Cassandra are the two famous column oriented databases. So, now talking it to a
higher level, let us compare and understand the architectural and working differences
between HBase and Cassandra.

HBase Tutorial: HBase VS Cassandra

 HBase is modelled on BigTable (Google) while Cassandra is based on DynamoDB

(Amazon) initially developed by Facebook.
 HBase leverages Hadoop infrastructure (HDFS, ZooKeeper) while Cassandra evolved
separately but you can combine Hadoop and Cassandra as per your needs.
 HBase has several components which communicate together like HBase HMaster,
ZooKeeper, NameNode, Region Severs. While Cassandra is a single node type, in
which all nodes are equal and performs all functions. Any node can be the
coordinator; this removes Single Point of failure.
 HBase is optimized for read and supports single writes, which leads to strict
consistency. HBase supports Range based scans, which makes scanning process
faster. Whereas Cassandra supports single row reads which maintains eventual
 Cassandra does not support range based row scans, which slows the scanning
process as compared to HBase.
 HBase supports ordered partitioning, in which rows of a Column Family are stored in
RowKey order, whereas in Casandra ordered partitioning is a challenge. Due to
RowKey partitioning the scanning process is faster in HBase as compared to
 HBase does not support read load balancing, one Region Server serves the read
request and the replicas are only used in case of failure. While Cassandra supports
read load balancing and can read the same data from various nodes. This can
compromise the consistency.
 In CAP (Consistency, Availability & Partition -Tolerance) theorem HBase maintains
Consistency and Availability while Cassandra focuses on Availability and Partition -
Features of HBase

 Atomic read and write: On a row level, HBase provides atomic read and write. It can
be explained as, during one read or write process, all other processes are prevented
from performing any read or write operations.
 Consistent reads and writes: HBase provides consistent reads and writes due to
above feature.
 Linear and modular scalability: As data sets are distributed over HDFS, thus it is
linearly scalable across various nodes, as well as modularly scalable, as it is divided
across various nodes.
 Automatic and configurable sharding of tables: HBase tables are distributed across
clusters and these clusters are distributed across regions. These regions and clusters
split, and are redistributed as the data grows.
 Easy to use Java API for client access: It provides easy to use Java API for
programmatic access.
 Thrift gateway and a REST-ful Web services: It also supports Thrift and REST API for
non-Java front-ends.
 Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for
high volume query optimization .
 Automatic failure support: HBase with HDFS provides WAL (Write Ahead Log) across
clusters which provides automatic failure support.
 Sorted rowkeys: As searching is done on range of rows, HBase stores rowkeys in a
lexicographical order. Using these sorted rowkeys and timestamp, we can build an
optimized request.
Where we can use HBase?

 We should use HBase where we have large data sets (millions or billions or rows and
columns) and we require fast, random and real time, read and write access over the
 The data sets are distributed across various clusters and we need high scalability to
handle data.
 The data is gathered from various data sources and it is either semi structured or
unstructured data or a combination of all. It could be handled easily with HBase.
 You want to store column oriented data.


HDFS is a Java based distributed file system that allows you to store large data across
multiple nodes in a Hadoop cluster. So, HDFS is an underlying storage system for storing the
data in the distributed environment. HDFS is a file system, whereas HBase is a database
(similar as NTFS and MySQL).

As Both HDFS and HBase stores any kind of data (i.e. structured, semi-structured and
unstructured) in a distributed environment so lets look at the differences between HDFS file
system and HBase, a NoSQL database.

 HBase provides low latency access to small amounts of data within large data sets
while HDFS provides high latency operations.
 HBase supports random read and writes while HDFS supports WORM (Write once
Read Many or Multiple times).
 HDFS is basically or primarily accessed through MapReduce jobs while HBase is
accessed through shell commands, Java API, REST, Avro or Thrift API.

HDFS stores large data sets in a distributed environment and leverages batch processing on
that data. E.g. it would help an e-commerce website to store millions of customer’s data in a
distributed environment which grew over a long period of time(might be 4-5 years or more).
Then it leverages batch processing over that data and analyze customer behaviors, pattern,
requirements. Then the company could find out what type of product, customer purchase in
which months. It helps to store archived data and execute batch processing over it.

While HBase stores data in a column oriented manner where each column is stored
together so that, reading becomes faster leveraging real time processing. E.g. in a similar e-
commerce environment, it stores millions of product data. So if you search for a product
among millions of products, it optimizes the request and search process, producing the
result immediately (or you can say in real time).
HBase Data Model

As we know, HBase is a column-oriented NoSQL database. Although it looks similar to a

relational database which contains rows and columns, but it is not a relational database.
Relational databases are row oriented while HBase is column-oriented. So, let us first
understand the difference between Column-oriented and Row-oriented databases:

Row-oriented vs column-oriented Databases:

 Row-oriented databases store table records in a sequence of rows. Whereas column-

oriented databases store table records in a sequence of columns, i.e. the entries in a
column are stored in contiguous locations on disks.

To better understand it, let us take an example and consider the table below.

If this table is stored in a row-oriented database. It will store the records as shown below:

1, Paul Walker, US, 231, Gallardo,

2, Vin Diesel, Brazil, 520, Mustang

In row-oriented databases data is stored on the basis of rows or tuples as you can see

While the column-oriented databases store this data as:

1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang

In a column-oriented databases, all the column values are stored together like first column
values will be stored together, then the second column values will be stored together and
data in other columns are stored in a similar manner.

 When the amount of data is very huge, like in terms of petabytes or exabytes, we
use column-oriented approach, because the data of a single column is stored
together and can be accessed faster.
 While row-oriented approach comparatively handles less number of rows and
columns efficiently, as row-oriented database stores data is a structured format.
 When we need to process and analyze a large set of semi-structured or unstructured
data, we use column oriented approach. Such as applications dealing with Online
Analytical Processing like data mining, data warehousing, applications including
analytics, etc.
 Whereas, Online Transactional Processing such as banking and finance domains
which handle structured data and require transactional properties (ACID properties)
use row-oriented approach.

HBase tables has following components, shown in the image below:

 Tables: Data is stored in a table format in HBase. But here tables are in column-
oriented format.
 Row Key: Row keys are used to search records which make searches fast. You would
be curious to know how? I will explain it in the architecture part moving ahead in this
 Column Families: Various columns are combined in a column family. These column
families are stored together which makes the searching process faster because data
belonging to same column family can be accessed together in a single seek.
 Column Qualifiers: Each column’s name is known as its column qualifier.
 Cell: Data is stored in cells. The data is dumped into cells which are specifically
identified by rowkey and column qualifiers.
 Timestamp: Timestamp is a combination of date and time. Whenever data is stored,
it is stored with its timestamp. This makes easy to search for a particular version of
In a more simple and understanding way, we can say HBase consists of:

 Set of tables
 Each table with column families and rows
 Row key acts as a Primary key in HBase.
 Any access to HBase tables uses this Primary Key
 Each column qualifier present in HBase denotes attribute corresponding to the
object which resides in the cell.

Components of HBase Architecture

HBase has three major components i.e., HMaster Server, HBase Region Server,
Regions and Zookeeper.

The below figure explains the hierarchy of the HBase Architecture. We will talk about each
one of them individually.

Now before going to the HMaster, we will understand Regions as all these Servers (HMaster,
Region Server, Zookeeper) are placed to coordinate and manage Regions and perform
various operations inside the Regions. So you would be curious to know what are regions
and why are they so important?


A region contains all the rows between the start key and the end key assigned to that
region. HBase tables can be divided into a number of regions in such a way that all the
columns of a column family is stored in one region. Each region contains the rows in a
sorted order.
Many regions are assigned to a Region Server, which is responsible for handling, managing,
executing reads and writes operations on that set of regions.

So, concluding in a simpler way:

 A table can be divided into a number of regions. A Region is a sorted range of rows
storing data between a start key and an end key.
 A Region has a default size of 256MB which can be configured according to the need.
 A Group of regions is served to the clients by a Region Server.
 A Region Server can serve approximately 1000 regions to the client.


As in the below image, you can see the HMaster handles a collection of Region Server which
resides on DataNode. Let us understand how HMaster does that.

 HBase HMaster performs DDL operations (create and delete tables) and
assigns regions to the Region servers as you can see in the above image.
 It coordinates and manages the Region Server (similar as NameNode manages
DataNode in HDFS).
 It assigns regions to the Region Servers on startup and re-assigns regions to Region
Servers during recovery and load balancing.
 It monitors all the Region Server’s instances in the cluster (with the help of
Zookeeper) and performs recovery activities whenever any Region Server is down.
 It provides an interface for creating, deleting and updating tables.
ZooKeeper – The Coordinator

This below image explains the ZooKeeper’s coordination mechanism.

 Zookeeper acts like a coordinator inside HBase distributed environment. It helps in

maintaining server state inside the cluster by communicating through sessions.
 Every Region Server along with HMaster Server sends continuous heartbeat at
regular interval to Zookeeper and it checks which server is alive and available as
mentioned in above image. It also provides server failure notifications so that,
recovery measures can be executed.
 Referring from the above image you can see, there is an inactive server, which acts
as a backup for active server. If the active server fails, it comes for the rescue.
 The active HMaster sends heartbeats to the Zookeeper while the inactive HMaster
listens for the notification send by active HMaster. If the active HMaster fails to send
a heartbeat the session is deleted and the inactive HMaster becomes active.
 While if a Region Server fails to send a heartbeat, the session is expired and all
listeners are notified about it. Then HMaster performs suitable recovery actions
which we will discuss later in this blog.
 Zookeeper also maintains the .META Server’s path, which helps any client
in searching for any region. The Client first has to check with .META Server in which
Region Server a region belongs, and it gets the path of that Region Server.
Meta Table

 The META table is a special HBase catalog table. It maintains a list of all the Regions
Servers in the HBase storage system, as you can see in the above image.
 Looking at the figure you can see, .META file maintains the table in form of keys and
values. Key represents the start key of the region and its id whereas the value
contains the path of the Region Server.

Components of Region Server

This below image shows the components of a Region Server. Now, I will discuss them
A Region Server maintains various regions running on the top of HDFS. Components of a
Region Server are:

 WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a file
attached to every Region Server inside the distributed environment. The WAL stores
the new data that hasn’t been persisted or committed to the permanent storage. It
is used in case of failure to recover the data sets.
 Block Cache: From the above image, it is clearly visible that Block Cache resides in
the top of Region Server. It stores the frequently read data in the memory. If the
data in BlockCache is least recently used, then that data is removed from
 MemStore: It is the write cache. It stores all the incoming data before committing it
to the disk or permanent memory. There is one MemStore for each column family in
a region. As you can see in the image, there are multiple MemStores for a region
because each region contains multiple column families. The data is sorted in
lexicographical order before committing it to the disk.
 HFile: From the above figure you can see HFile is stored on HDFS. Thus it stores the
actual cells on the disk. MemStore commits the data to HFile when the size of
MemStore exceeds.

You might also like