0% found this document useful (0 votes)

20 views18 pages

Lec 18

Uploaded by

kanish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views18 pages

Lec 18

Uploaded by

kanish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Big Data Computing

Prof. Rajiv Misra

Computer Science andEngineering,IIT Patna

Lecture - 18
Design of HBase
Design of HBase,

Refer Slide Time :( 0: 17)

Preface content of this lecture:

In this lecture we will discuss:

1. What is HBase?
2. HBase architecture
3. HBase components
4. Data model
5. HBase storage hierarchy
6. Cross-data center replication
7. Auto sharding and distribution
8. Bloom filter and fold, store and shift.

Refer Slide Time :( 0: 40)

So, HBase is an open source NoSQL database. It is a distributed column-oriented data store that can scale
out horizontally to thousands of commodity servers and Petabytes of indexed storage because this amount
of commodity machines or the cluster is required to store the large amount of data that is why?. This
particular aspect is covered into the distributed column-oriented data store and also it can scale out
horizontally to thousands of servers and Petabytes of indexed storage. HBase is designed to operate on
top of HDFS, for scalability, fault tolerance and high availability. HBase is actually an implementation of
Big Table, which is our storage architecture, given by Google and this is which a distributed storage
system, developed at Google. So, HBase works with a structured, unstructured and semi-structured data.

Refer Slide Time :( 1: 59)

So, HBase is basically designed as the Google's, Big Table, was first “block-based” storage system and
Yahoo! Open sourced this particular concept, a block based and which is now known, as HBase. So,
HBase is a major Apache project today and Facebook also uses HBase internally and it has various API
functions which provides for example: get and put by row that is the key-value pair and scan the row
range and filter all these operations which are supported to perform the range queries, it is also having a
multiput. So, that multiple key-value store can be stored and handled at the same time. Unlike Cassandra,
HBase prefers consistency over availability. Now, Cassandra prefers availability whereas HBase prefers
consistency over availability. So, now we are going to see that where consistency is a preferred option for
the application HBase is used and wherever availability is more important than Cassandra is used. That is
why both of them exist as the NoSQL solution.
Refer Slide Time :( 3:

51)

Let us see some of the important aspects of the HBase architecture. So, HBase has the region servers and
these region servers are basically handling the regions and there is one HBase master and this zookeeper
has to interact with the HBase master and all the other component and HBase, then as HBase also, deals
with the data nodes. So, HBase master has to communicate with the region servers and zookeeper. So, we
will see, in more detail about this. So, HBase architecture here, the table is split into the regions and
served by the region servers. So, regions are vertically divided by the column families into the stores that
we will discuss later on. And stores saved as files on HDFS. HBase utilizes zookeeper for the distributed
coordination service.

Refer Slide Time :( 5: 11)

The components in more detail. So, client will find the region servers that are serving particularly the row
range of interest. So, then HMaster monitors all the region servers instances in the cluster system.
Regions are our basic element of availability and distribution for the tables. Our region servers serving
and managing the regions and in a distributed cluster region runs on the data node.

Refer Slide Time :( 5: 42)

So, this is the typical layout of the data model which is also called as the, ‘Column Families’. So, data is
stored in HBase and is located by its "rowkey". So, "rowkey" is the primary key from the notion of
relational database management system. So, the records in HBase are stored in the sorted order according
to the rowkeys and the data in a row are grouped together as the column families and each column family
has one or more columns. These columns in a family are stored together in a low level storage file which
is called as, ‘HFile’.

Refer Slide Time :( 6: 28)

So, the tables are divided into the sequence of rows, by the key range called ‘Regions’. So, here we can
see that, this particular key range is basically nothing but, the row 1 and this will be, stored together into
the region. So, these regions are then assigned to the data node in the cluster called the ‘Region Servers’.
Now, that is shown over here. So, again for example, let us say that the range the sequence or the key
range, the "rowkey” range R2 will store these set off the rows, sequence of the rows and this will be
stored in another region server and the region servers, these regions are managed by the data nodes, they
are called they are called the ‘Region Servers’. Region servers as the data nodes.

Refer Slide Time :( 7: 42)

Now, column family a column is identified by the column qualifier that consists of the column family
name concatenated with the column name using the colon ex- personaldata:name. So, here we can see
that, this particular column family name column is identified by the columns of the column family name
concatenated with the column name using the colon. So, this is the column name and the column family
name is shown over here this is the column family name. So, this is shown over here. So, column family
name and the name will qualify, will identify the particular column. So, column families are mapped to
the storage files and are stored in separate files which can also be accessed separately.

Refer Slide Time :( 8: 53)

Now, cell in HBase. A table data is stored in HBase table cells. And cell is a combination of row, column
family, column qualifier and contains a value and a timestamp. So, the key consists of the row key,
column name and a timestamp that is shown here and the entire cell, with the added, structural
information is called the, ‘Key Value Pair’.

Refer Slide Time :( 9: 21)

So, HBase data model consists of a table. HBase organizes data into the table and table names are the
strings composed of the characters that are safe for use in the file system path. And the rows within the
table, the data is stored according to its row. Rows are identified by the arrow key and row keys do not
have the data type and are always treated as the byte [ ] (byte array). So, this aspect we have already
covered. So, column family the data within the row is grouped by the column family. Every row in the
table has the same column family, although the row need not store the data in all its families. Column
families are the strings and composed of characters that are safe for use in the file system path.

Refer Slide Time :( 10: 08)

Now, column qualifier: the data within the column family is addressed via the column qualifier and or
simply the column qualifier need not be specified in the advance and column qualifier need not be
consistent between the rows. Like "row keys" column qualifier do not have the data type and is always
treated as a byte [ ]. Cell is a combination of row key, column family and column qualifier uniquely
identifies the cell. The data is stored in the cell is referred to as the cell value.

Refer Slide Time :( 10: 41)

And timestamp the values within the cell are versioned. Versions are identified by their version number,
which by default is the timestamp. If the timestamp is not is specified for the read, the latest one is
returned. The number of cell values versions retained by the HBase is configured for each family. The
default number of cell versions is three.

Refer Slide Time :( 11: 03)

Let, us see in more detail once again the HBase architecture. So, HBase has this one a client. Client can
access to the HRegionServer and this HRegionServer are many HRegionServers. one such
HRegionServer is shown over here which has HLog and each HRegionServer is further divided into
different Hregions. One such Hregion is shown over here and Hregions will contain the store and also a
MemStore. So, within the store it will be having the StoreFile and StoreFile will contain basic storage that
is called, ‘HFile’ and ‘HFile’ is stored in HDFS. Now, as far as there is one HMaster and HMaster
communicates with the zookeeper and with the HRegionServers and HDFS, we will see, we have seen
about the HMaster and what is zookeeper? Is a small group of servers which runs consensus protocol like
Paxos and the zookeeper is the coordination service for HBase and assigns the different nodes and servers
to this particular service if zookeeper is not there then HBase will stop functioning.

Refer Slide Time :( 12: 38)

Now, let us see the HBase storage hierarchy. So, HBase has HBase table which is split into the multiple
regions which is replicated across the servers. Now, then it will be having a column family which is a
subset of the columns with similar query and one store per combination of column family plus a region is
there. And also, it has the Memstore for each store: in-memory updates to store and flush to the disc when
full, that like we have seen in the Cassandra. So, StoreFiles for each store for each region where the data
lives and within that contains the basic and basic HFile and as HFile will be stored in HDFS. So, each
HFile is a SStable from the Google's Big Table.

Refer Slide Time :( 13: 41)

So, this is about the HBase HFile. So, HFile comprises, of data and Meta file, Metafile information
indices and trailer where the data portion consists of the magic value and (key, value) pairs. So, the key
value pair is having the key length, value length, row length, row and column family length, column
family, column qualifier, timestamp, key type and value. And these rows will have SSN numbers and
column family will have the demographic information and column qualifier will have ethnicity
information and this becomes the HBase key.

Refer Slide Time :( 14: 30)

Now, HBase prefers the strong consistency or availability. So, HBase prefers the write-ahead log and
whenever a client comes with the key (K 1, K 2, K 3, K 4) gives to the HRegionServer then this let us say,
(K 1 and K 2) will be on a particular HRegion and (K 3, K4) will have another H Region and then this
particular aspect will be stored in the store and these store will have the MemStore, StoreFiles and
internally they are stored in as HFile. So, right to HLog before writing to MemStore is there to ensure the
fault tolerance aspect and recovery from the failures. So, this helps recover from the failure by replaying
HLog.

Refer Slide Time :( 15: 35)

So, Log Replay after the recovery from the failure or upon boot up, HRegionServer and HMaster will do
the replay. So, replay any stale logs using the timestamp to find out where the database is with respect to
the logs and replay will add it to the MemStore.

Refer Slide Time :( 15: 53)

Now, cross-datacenter replication now there will be a single “Master” cluster, others “Slave” cluster
replicate the same table, master cluster synchronously sends HLogs over the slaves clusters, coordination
among the cluster is done by a zookeeper. Zookeeper can be used like a file system to store the control
information and also this particular zookeeper will use different paths for invoking the state and peer
cluster number and each log all this information is there.

Refer Slide Time :( 16: 29)

Now, let us see how the auto sharding is done and Auto sharding means that tables is divided into the row,
range or a range keys and they are being stored in the region servers and we and these region servers are
now solved with the client.

Refer Slide Time :( 16: 50)

Now, similarly the table logical view, we can see that the rows are now split into the row keys, in the
range keys and these range keys are given charted into the region servers and we have shown here that
these rows from A to Z are stored in three different region servers.

Refer Slide Time :( 17: 11)

And this layout is there and it is automatically done by that is called, ‘Auto Sharding’ and ‘Distribution’.
So, unit of the scalability in HBase is the region that we have seen which is managed by the region
servers and they are sorted and contiguous range of rows spread randomly across the RegionServers,
moved around for the load balancing and failover that we have already seen in the previous slide. So, is
split automatically or manually to scale with the growing data and capacity is only a factor of cluster
nodes versus the regions per node.

Refer Slide Time :( 17: 46)

Now, there is a use of bloom filter here in HBase also. So, bloom filters are generated when HBase when
HFile is persisted, stored at the end of HFile, loaded into the memory. This bloom filter will allow to
check on the rows and from the column level and they can filter the entire store files from the reads,
useful when the data is grouped and also useful when many misses are expected during the reads.

Refer Slide Time :( 18: 15)

So, let us see the positioning of bloom filter into this HFile. So, at the bottom of HFile this particular the
bloom filter information is stored and this will help in identifying which of these different blocks are
basically containing the block index or information. So, using bloom filter you can access at least two at
most two blocks not the entire set of block sequentially has to be read. So, hence the access is reduced
from 6 seeks per block to the 2 seeks per block using the bloom filter and bloom filter stone we is restored
as a part of s five.

Refer Slide Time :( 19: 05)

Now, as far as fold store and shift is concerned you can see here that, for a particular row, the same data
is stored, multiple instances because the time series data is also stored at different time stamp the data is
stored for a particular row key. So, that is why? Several data on a particular row key is now appearing
they may be varying at their timestamp T 1, T 2 and T 3 and they are a store, into the, into the five.

Refer Slide Time :( 19: 44)

Logical layout does not match the physical one and here all the values are stored with the full coordinates
including the row key, column family, column qualifier and the timestamp. Folds column into the “row
per column” and nulls are cost free as nothing is stored, versions are multiple rows in a folded table.

Refer Slide Time :( 20: 06)

Conclusion: Traditional databases (RDBMSs) works with strong consistency and offers acid property and
in the modern workload doesn't need such strong guarantees, but do need fast response time that is the
availability. Unfortunately, CAP provides three out of two that is here given the partition tolerance, the
HBase prefers the consistency over the availability. So, key-value pair or a NoSQL system offers the
BASE property in these scenarios. So, Cassandra offers the eventual consistency and the other variety of
consistency models are striving towards the strong consistency. So, in this lecture we have covered about
the HBase architecture components, data model, storage hierarchy, cross datacenter replication auto
starting, distribution and bloom filter and fold store and shift operation. Thank you.

DP-203T00 Microsoft Azure Data Engineering-03
No ratings yet
DP-203T00 Microsoft Azure Data Engineering-03
21 pages
Vehicle Insurance Management System Report
78% (9)
Vehicle Insurance Management System Report
48 pages
Lec 18
No ratings yet
Lec 18
21 pages
Unit 5 Big Data
No ratings yet
Unit 5 Big Data
34 pages
HBase
No ratings yet
HBase
6 pages
HBase
No ratings yet
HBase
39 pages
Hadoop HBASE
No ratings yet
Hadoop HBASE
71 pages
BDA Unit 5
No ratings yet
BDA Unit 5
33 pages
HBase
No ratings yet
HBase
31 pages
BDT Unit - V
No ratings yet
BDT Unit - V
15 pages
UNIT5
No ratings yet
UNIT5
42 pages
BDA Unit-4 Part-2 HBase, Hive, Pig
No ratings yet
BDA Unit-4 Part-2 HBase, Hive, Pig
74 pages
UNIT 5 Notes
No ratings yet
UNIT 5 Notes
47 pages
Cse 17CS82 M2 S4 PPT
No ratings yet
Cse 17CS82 M2 S4 PPT
19 pages
HBASE
No ratings yet
HBASE
18 pages
Big Data Analytics Unit-5
No ratings yet
Big Data Analytics Unit-5
28 pages
Bda - Unit 5
No ratings yet
Bda - Unit 5
30 pages
Unit 5 BDA
No ratings yet
Unit 5 BDA
34 pages
Unit - 5 Part - 1
No ratings yet
Unit - 5 Part - 1
8 pages
HBase (Unit 4)
No ratings yet
HBase (Unit 4)
37 pages
Hbase - Quick Guide Hbase - Overview
No ratings yet
Hbase - Quick Guide Hbase - Overview
53 pages
Big Data Unit 5
No ratings yet
Big Data Unit 5
18 pages
Assignment 10
No ratings yet
Assignment 10
9 pages
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
No ratings yet
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
6 pages
HBase
No ratings yet
HBase
38 pages
10 HBase
No ratings yet
10 HBase
13 pages
HBASE
No ratings yet
HBASE
35 pages
Unit - IV - Notes
No ratings yet
Unit - IV - Notes
23 pages
Assignment Day 10: Task 1
No ratings yet
Assignment Day 10: Task 1
8 pages
Big Data 22MSM40206
No ratings yet
Big Data 22MSM40206
9 pages
Cs525: Special Topics in DBS: Large-Scale Data Management
No ratings yet
Cs525: Special Topics in DBS: Large-Scale Data Management
35 pages
NoteGPT - What Is HBase - HBase Architecture - HBase Tutorial For Beginners - Hadoop Tutorial - Simplilearn
No ratings yet
NoteGPT - What Is HBase - HBase Architecture - HBase Tutorial For Beginners - Hadoop Tutorial - Simplilearn
5 pages
Hadoop Week 6
No ratings yet
Hadoop Week 6
38 pages
Large-Scale Data Management: Hbase
No ratings yet
Large-Scale Data Management: Hbase
36 pages
HBase - Tutorial
No ratings yet
HBase - Tutorial
14 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
42 pages
HBASE
No ratings yet
HBASE
11 pages
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
No ratings yet
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
32 pages
Unit 4
No ratings yet
Unit 4
15 pages
Module 05 HBase - Distributed NoSQL Database
No ratings yet
Module 05 HBase - Distributed NoSQL Database
54 pages
HBase
No ratings yet
HBase
27 pages
Hbase
100% (1)
Hbase
30 pages
HBase
No ratings yet
HBase
14 pages
HBase Presentation
No ratings yet
HBase Presentation
23 pages
Unit 5 Lecture No-3 (Hbase)
No ratings yet
Unit 5 Lecture No-3 (Hbase)
35 pages
Unit 3
No ratings yet
Unit 3
15 pages
Chapter 12 HBase
No ratings yet
Chapter 12 HBase
108 pages
4 4HBase
No ratings yet
4 4HBase
17 pages
Hbase: Q) What Is Hbase ?
No ratings yet
Hbase: Q) What Is Hbase ?
15 pages
HBase
No ratings yet
HBase
30 pages
HBase Architecture PDF
No ratings yet
HBase Architecture PDF
32 pages
BDA Unit-5
No ratings yet
BDA Unit-5
31 pages
Unit V Hadoop Related Tools
No ratings yet
Unit V Hadoop Related Tools
54 pages
Apache HBase
No ratings yet
Apache HBase
12 pages
Unit-5 Notes
No ratings yet
Unit-5 Notes
61 pages
10 NoSQL Databases - HBase Hive Cassandra
No ratings yet
10 NoSQL Databases - HBase Hive Cassandra
74 pages
Unit 5 Hbase
No ratings yet
Unit 5 Hbase
15 pages
Unit 5 Lecture No-3 (Hbase)
No ratings yet
Unit 5 Lecture No-3 (Hbase)
35 pages
Week-5 - Lecture Notes
No ratings yet
Week-5 - Lecture Notes
138 pages
Lesson 6 NoSQL Databases HBase
100% (1)
Lesson 6 NoSQL Databases HBase
47 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Lec 19
No ratings yet
Lec 19
23 pages
Lec 21
No ratings yet
Lec 21
16 pages
Esia Notes
No ratings yet
Esia Notes
35 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
Ios Unit 4
No ratings yet
Ios Unit 4
18 pages
Ios Unit 1
No ratings yet
Ios Unit 1
27 pages
Ios Unit 3
No ratings yet
Ios Unit 3
10 pages
My Journey As A Data Engineer Spans Over
No ratings yet
My Journey As A Data Engineer Spans Over
6 pages
Types of Information System
No ratings yet
Types of Information System
112 pages
Database Management System CS3492 - REGULATION 2021 Downloaded From Stucor App
No ratings yet
Database Management System CS3492 - REGULATION 2021 Downloaded From Stucor App
13 pages
Report#20922873
No ratings yet
Report#20922873
29 pages
Sarika Bhad CV 1620666644
No ratings yet
Sarika Bhad CV 1620666644
2 pages
Locks-Granularity, Types, 2PL
No ratings yet
Locks-Granularity, Types, 2PL
19 pages
Chapter 4 Interacting With Database
No ratings yet
Chapter 4 Interacting With Database
103 pages
Forticnp: Prioritize Risk Management Activities
No ratings yet
Forticnp: Prioritize Risk Management Activities
5 pages
Nishant Web Tech
No ratings yet
Nishant Web Tech
40 pages
WinCC Unified Extract 2021
No ratings yet
WinCC Unified Extract 2021
15 pages
L2 Data Model
No ratings yet
L2 Data Model
4 pages
Assignment Questions
No ratings yet
Assignment Questions
2 pages
Dbms 1 Midterm Lab Exam 30-40
No ratings yet
Dbms 1 Midterm Lab Exam 30-40
5 pages
RDBMS Session 1-5 QA
No ratings yet
RDBMS Session 1-5 QA
13 pages
ML Project Movie Recommendation System
No ratings yet
ML Project Movie Recommendation System
2 pages
UE 461 Intro. To GIS - 12
No ratings yet
UE 461 Intro. To GIS - 12
124 pages
Database NK
No ratings yet
Database NK
86 pages
Store Passwords Securely in Database Using SHA256 - ASP .NET Core - by Juldhais Hengkyawan - Medium
No ratings yet
Store Passwords Securely in Database Using SHA256 - ASP .NET Core - by Juldhais Hengkyawan - Medium
31 pages
Preparation of Production Report: Learning Activity Sheet No.
100% (1)
Preparation of Production Report: Learning Activity Sheet No.
3 pages
Python For Analytics - 2025 - 2020
No ratings yet
Python For Analytics - 2025 - 2020
28 pages
MMSegmentation
No ratings yet
MMSegmentation
2 pages
Iso TS 29585-2010
No ratings yet
Iso TS 29585-2010
64 pages
Dbms
No ratings yet
Dbms
47 pages
Gis Enabled Interactive Web Site & On-Line Property Tax Management System 2008-09
No ratings yet
Gis Enabled Interactive Web Site & On-Line Property Tax Management System 2008-09
5 pages
DBMS Syllabus
No ratings yet
DBMS Syllabus
2 pages
Knowledge Management Encyclopedia
No ratings yet
Knowledge Management Encyclopedia
17 pages
Ecommerce Web Application Sem IV Project Report
No ratings yet
Ecommerce Web Application Sem IV Project Report
10 pages
TYCS - Data Science MCQ
No ratings yet
TYCS - Data Science MCQ
6 pages

Lec 18

Uploaded by

Lec 18

Uploaded by

Big Data Computing

Prof. Rajiv Misra

Refer Slide Time :( 0: 17)

Preface content of this lecture:

Refer Slide Time :( 0: 40)

Refer Slide Time :( 1: 59)

Refer Slide Time :( 5: 11)

Refer Slide Time :( 5: 42)

Refer Slide Time :( 6: 28)

Refer Slide Time :( 7: 42)

Refer Slide Time :( 8: 53)

Refer Slide Time :( 9: 21)

Refer Slide Time :( 10: 08)

Refer Slide Time :( 10: 41)

Refer Slide Time :( 11: 03)

Refer Slide Time :( 12: 38)

Refer Slide Time :( 13: 41)

Refer Slide Time :( 14: 30)

Refer Slide Time :( 15: 35)

Refer Slide Time :( 15: 53)

Refer Slide Time :( 16: 29)

Refer Slide Time :( 16: 50)

Refer Slide Time :( 17: 11)

Refer Slide Time :( 17: 46)

Refer Slide Time :( 18: 15)

Refer Slide Time :( 19: 05)

Refer Slide Time :( 19: 44)

Refer Slide Time :( 20: 06)

You might also like