0% found this document useful (0 votes)
130 views31 pages

HBase

HBase is a distributed, column-oriented database that provides random access and real-time read/write capabilities for large datasets by running on top of Hadoop and its Hadoop Distributed File System, as Hadoop only allows sequential batch processing, while HBase allows for fast random reads and writes by using a distributed, scalable architecture with a master server and region servers along with Zookeeper for coordination. HBase stores data in tables comprised of rows and columns grouped into column families and uses key-value pairs to provide fast lookups and allows for easy scalability and failover.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views31 pages

HBase

HBase is a distributed, column-oriented database that provides random access and real-time read/write capabilities for large datasets by running on top of Hadoop and its Hadoop Distributed File System, as Hadoop only allows sequential batch processing, while HBase allows for fast random reads and writes by using a distributed, scalable architecture with a master server and region servers along with Zookeeper for coordination. HBase stores data in tables comprised of rows and columns grouped into column families and uses key-value pairs to provide fast lookups and allows for easy scalability and failover.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

HBase

Topics at a glance
⮚Hadoop and its data access limitations.
⮚Why Hbase?
⮚Hbase and its importance in Hadoop frame work.
⮚History and Architecture of Hbase.
⮚Hbase components and their responsibilities.
⮚Hbase data storage model.
⮚Advantages and disadvantages of Hbase.
⮚Conclusion of the session
Why Hbase???

With the evolution of the internet web application scope was increased -
⮚ Huge volumes of structured and semi-structured data started getting generated.
⮚ Semi-structured data (emails, JSON, XML, and .csv files and exe files)
⮚ Loads of semi-structured data was created across the globe.
⮚ So storing and processing of this data became a major challenge.
Hadoop and its limitations
⮚Hadoop can perform only batch processing, and data will be accessed only in
a sequential manner.
⮚So if needed to access any data randomly then need a new access
methodology.

✔New program tool added in Hadoop framework to provide the random


access to user.

▪ HBase
▪ Cassandra,
▪ CouchDB,
▪ Dynamo and MongoDB
HBase
• HBase is an open-source NoSQL database and Part of the Hadoop framework.
• Similar to Google’s big table. Initially, it was Google Big Table, afterward; it was
renamed as HBase .
• Hbase is primarily written in Java and needed for real-time Big Data
applications.
• HBase is a distributed column-oriented non-relational database management
system that runs on top of Hadoop Distributed File System (HDFS).
• HBase is a column-oriented database and the tables in it are sorted by row.
• The table schema defines only column families, which are the key value pairs.
• It uses log storage with the Write-Ahead Logs (WAL).
• It supports fast random access and heavy writing competency.

5
How is HBase different from other NoSQL models

• HBase stores data in the form of key/value pairs in a columnar model. In this
model, all the columns are grouped together as Column families.
• HBase on top of Hadoop will increase the throughput and performance of
distributed cluster set up.
• Provides faster random reads and writes operations.
Features of Hbase
• Horizontally scalable: Can add any number of columns anytime.
• Automatic Failover: Allows a system administrator to automatically switch
data handling to a standby system in the event of system compromise/failure.
• Integrations with Map/Reduce framework: Al the commands and java codes
internally implement Map/ Reduce to do the task and it is built over Hadoop
Distributed File System.
• It doesn't enforce relationships within your data.
• It is designed to run on a cluster of computers, built using commodity
hardware.
• HBase is built for low latency operations
History of HBase
• In Nov 2006, Google released the paper on BigTable.
• Feb 2007, Initial HBase prototype was created as a Hadoop
contribution.
• Oct 2007, The first usable HBase along with Hadoop 0.15.0 was
released.
• Jan 2008, HBase became the sub project of Hadoop.
• Oct 2008, HBase 0.18.1 was released.
• Jan 2009, HBase 0.19.0 was released.
• Sept 2009, HBase 0.20.0 was released.
• May 2010, HBase became Apache top-level project.
HBase existence in the Hadoop Ecosystem
HBase Table – To store data

10
HBase: Keys and Column Families

Each record is divided into Column Families

Each row has a Key

Each column family consists of one or more Columns

11
Example- Storage Mechanism in HBase

⮚HBase is a column-oriented database.


⮚Data is store in form of table.
HBase Architecture

❖Apache Zookeeper monitors the system.


❖HBase Master assigns regions and load balancing.
❖The Region server serves data to read and write.
❖The Region Server is all the different computers in the Hadoop cluster.
❖It consists of Region, HLog, Store, Memory Store, and different files.
❖All this is a part of the HDFS storage system.
HBase - Components

HBase has three major components -


1. Master servers. 2. Region server 3. Zookeeper
1. HMaster

• HMaster in HBase is the implementation of a Master server in HBase.


• It acts as a monitoring agent to monitor all Region Server instances present in the cluster and acts
as an interface for all the metadata changes.
• In a distributed cluster environment, Master runs on Name Node.
Responsibilities of a HMaster in HBase architecture

a. Coordinating the region servers as following -


Assigns Regions on startup.
Recovery and load balancing.
Monitors all RegionServer instances in the HBase Cluster.
b. Admin functions
When a client wants to change any schema or change in Metadata operations,
HMaster takes responsibility for these operations as follow-
• Table (create able, removeTable, enable, disable)
• ColumnFamily (add Column, modify Column)
• Region (move, assign)
2. Regions & Regions Server

⮚Table is split According to rowkey Scope


horizontal to several region. (Start to end key)
⮚After split rows are called region and these
Regions are assigned to certain nodes in the cluster
for management is called Region Server.
• They are responsible for processing data read and
write requests.
• Each Region Server can manage about 1000
regions.
⮚HRegion Server is the Region Server
implementation.
⮚Responsible for serving and managing regions or
data that is present in a distributed cluster.
⮚Region servers run on Data Nodes present in the
Hadoop cluster.
17
How Regions splits
Sr.No Personal Education_ Details Job details
Details
Emp_ Name Age Graduate Percent Company Designation
id age Name
1 Amit 25 Bsc 76 HCL Project Lead
2 Sumit 30 BTech 80 TCS Project Manager
3 Varsha 35 MTech 75 Wipro Project Engineer
HDFS

Region
Server
HDFS
HBase Regions & Regions Server ..

⮚When HBase Region Server receives writes and read requests from the client, it
assigns the request to a specific region, where the actual column family resides.
⮚Client can directly contact with HRegion servers, there is no need of HMaster
mandatory permission to the client regarding communication with HRegion servers.
⮚The client requires HMaster help when operations related to metadata and schema
changes are required.

❖HMaster can get into contact with multiple HRegion servers and performs the
following functions.

▪ Hosting and managing regions


▪ Splitting regions automatically
▪ Handling read and writes requests
▪ Communicating with the client directly
Region Server
⮚Region Server runs on HDFS DataNode and Responsible for processing data read and write requests.
⮚If Client required any data then he will directly interacts with Region Server.
⮚Regions are tables that are split up and spread across the region servers.

Components of Region server :

⮚WAL (Write Ahead Log) is a file on a


distributed file system for Storing new
data
⮚Block Cache - This is the read cache.
Memory The most frequently accessed
data is stored in the LRU (Least Current
Used) cache.
⮚MemStore - This is the write cache,
in Memory.
⮚Hfile -Store HBase data on hard disk
(HDFS). 20
3.Hbase- Zookeeper
• Hbase use Zookeeper to coordinate shared state information for members of
distributed systems.
• Active HMaster and Region servers, connects with a session to Zookeeper.
• For active sessions ZooKeeper maintains ephemeral nodes by using heartbeats.
• Zookeeper maintains which servers are healthily available and notifies them when
the server fails.
• Ephemeral nodes mean znodes which exist as long as the session which created the
znode is active and then znode is deleted when the session ends.
• Zookeeper uses a consistency protocol to ensure the consistency of the distributed
state.
• Each Region Server in HBase Architecture produces an ephemeral node. Further, to
discover available region servers and HMaster shall monitors these nodes.
• Active HMaster sends heartbeats to Zookeeper.
Working Process of Zookeeper

1. Active HMaster
2. Inactive Hmaster
❖ HBase META Table
• META Table is a special HBase Catalog Table. Basically, it holds the location of the
regions in the HBase Cluster.
• It keeps a list of all Regions in the system.
• Structure of the .META. table is as follows:
• Key: region start key, region id
• Values: RegionServer
Hbase Table parameters

• Tables: Data is stored in a table format in Hbase.


• Row Key: Row keys are used to search records which make searches
fast.
• Column Families: Various columns are combined in a column family.
These column families are stored together which makes the searching
process faster because data belonging to same column family can be
accessed together in a single seek.
• Column Qualifiers: Each column’s name is known as its column
qualifier.
• Cell: Data is stored in cells.
• Timestamp: Timestamp is a combination of date and time. Whenever
data is stored, it is stored with its timestamp.
HDFS vs. HBase

HDFS HBase

HDFS is a Java-based file system HBase is a Java based No-SQL


utilized for storing large data sets. database.

HDFS has a rigid architecture that HBase allows for dynamic


does not allow changes. It doesn’t changes and can be utilized for
facilitate dynamic storage. standalone applications.

HDFS is ideally suited for write- HBase is ideally suited for random
once and read-many times use write and read of data that is
cases stored in HDFS.
25
HBase - Read
• A Read against HBase must be
reconciled between the HFiles,
MemStore & BLOCKCACHE.
• The Block Cache is designed to
keep frequently accessed data
from the HFiles in memory so as
to avoid disk reads.
• Each column family has its own
Block Cache.
Block: It is the smallest indexed
unit of data and is the smallest
unit of data that can be read from
disk. default size 64KB.
Hbase - Write

When a write is
made, by default,
it goes into two
places:
⮚write-ahead log
(WAL), Hlog.
⮚in-memory
write buffer,
MemStore.
Advantages of HBase
❖ Hbase designed to store Denormalized Data.
❖ Hbase Supports Automatic Partitioning
❖ Strong consistency model– All readers will see same value, while a write returns.
❖ Scales automatically
– While data grows too large, Regions splits automatically.
– To spread and replicate data, it uses HDFS.
❖ Built-in recovery – It uses Write Ahead Log for recovery.
❖ Integrated with Hadoop
❖ Hbase is schema-less, no data model has been defined.
❖ Hbase has the ability to perform Random read and write operations.
❖ Hbase provides data replication across clusters for higher availability.
❖ Feature random access (internal hash table) to stores data in HDFS files for faster
lookups/searching.
Disadvantages of HBase

• Single point of failure - If HMaster goes down, complete cluster will be fail
and no work/task will be performed.
• Cannot perform functions like SQL and doesn’t support SQL structure.
• Does not contain any query optimizer
• Does not support for transaction.
• Business continuity reliability
– Write Ahead Log replay very slow.
– Also, a slow complex crash recovery.
• Joining and normalization is very difficult to perform.
• Very difficult to store large binary data.
Real Time Example of HBase-Facebook

How Facebook use Hbase to store user data

User Account Type Type of Posted Time Stamp Violating Last Login
Account ID (Personal/ Contents for community Activity
Business) Posted (Public/ standards Time of
Private) Account

Rahul3@fb Personal Image + Text Public 20/08/2022 No 05.30 PM


08.45.00.PM
ABZ@fb Business Text +Video Public 25/08/2022 No 08.30 PM
06.45.00.PM

Tarun@fb Personal Text +Video Public 25/08/2022 Yes 08.30 PM


06.45.00.PM

User Account Type of Action Account Account Remarks


Account Type Contents Required Suspended Blocked
ID (Personal/ Posted
Business)
Tarun@fb Personal Text +Video Yes For a period Yes Violating
Is a hated 2 week etc. Community
contents standard
Any Query??

You might also like