0% found this document useful (0 votes)
3 views

Module 05 HBase - Distributed NoSQL Database

The document provides an overview of HBase, a distributed column-oriented storage system designed for high reliability, performance, and scalability. It covers HBase's architecture, key processes, and Huawei's enhanced features, highlighting its suitability for massive data storage and real-time access. Additionally, it contrasts HBase with traditional relational databases and details its data storage models, including KeyValue structures and the role of ZooKeeper in managing distributed operations.

Uploaded by

Lucas Oliveira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 05 HBase - Distributed NoSQL Database

The document provides an overview of HBase, a distributed column-oriented storage system designed for high reliability, performance, and scalability. It covers HBase's architecture, key processes, and Huawei's enhanced features, highlighting its suitability for massive data storage and real-time access. Additionally, it contrasts HBase with traditional relational databases and details its data storage models, including KeyValue structures and the role of ZooKeeper in managing distributed operations.

Uploaded by

Lucas Oliveira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Technical Principles of

HBase

www.huawei.com

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved.


Objectives
 Upon completion of this course, you will be able to know:
 System architecture of HBase
 Key features of HBase
 Basic functions of HBase
 Huawei enhanced features of HBase

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Contents
1. Introduction to HBase

2. Functions and Architecture of HBase

3. Key Processes of HBase

4. Huawei Enhanced Features of HBase

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
HBase Overview
 HBase is a column-based distributed storage system that
features high reliability, performance, and scalability.
 HBase is suitable for storing big table data (which contains billions of rows
and millions of columns) and allows real-time data access.

 HBase uses HDFS as the file storage system to provide a distributed


column-oriented database system that allows real-time data reading and
writing.

 HBase uses ZooKeeper as the collaboration service.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
HBase vs. RDB

HBase RDB

1. Distributed storage and


column-oriented. 1. Fixed data structure.
2. Dynamic extension of 2. Pre-defined data
columns. structure.
3. Supports common 3. I/O intensive and cost-
commercial hardware, consuming expansion.
lowering the expansion cost.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
Application Scenarios of HBase
 HBase applies to the following scenarios:
 Massive data (TB and PB)
 The Atomicity, Consistency, Isolation, Durability (ACID) feature supported
by traditional relational databases is not required.
 High throughput
 Efficient random reading of massive data
 High scalability
 Simultaneous processing of structured and unstructured data

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Position of HBase in FusionInsight
Application service layer
OpenAPI/SDK REST/SNMP/Syslog

Data Information Knowledge Wisdom


DataFarm Porter Miner Farmer Manager
System
management
Hadoop API Plugin API
Service
governance
HIVE M/R Spark Storm Flink
Hadoop LibrA
YARN/ Zookeeper Security
management
HDFS/HBase

 HBase is a column-based distributed storage system that features high


reliability, performance, and scalability. It stores massive data and is designed
to eliminate limitations of relational databases in the processing of mass data.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Data Stored By Row

ID Name Phone Address

 Data is stored by row in an underlying file system. Generally, a fixed amount


of space is allocated to each row.
 Advantages: Data can be added, modified, or read by row.
 Disadvantages: Some unnecessary data is obtained when data in a column is
queried.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Data Stored by Column

ID Name Phone Address

 Data is stored by column in an underlying file system.


 Advantages: Data can be read or calculated by column.
 Disadvantages: When a row is read, multiple I/O operations may be
required.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 9
KeyValue Storage Model (1)
ID Name Phone Address

Key-01 Value-ID01 Key-01 Value-Name01

Key-01 Value-Phone01 Key-01 Value-Address01

 KeyValue has a specific structure. Key is used to quickly query a data record,
and Value is used to store user data.
 As a basic user data storage unit, KeyValue must store some description of
itself, such as timestamp and type information. This requires some structured
space.
 Data can be expanded dynamically, adaptive to changes of data types and
structures. Data is read and written by block. Different Columns are not
associated, so are tables.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 10
KeyValue Storage Model (2)
 Partition mode of a KeyValue Database - based on continuous Key range.

Region_01 Region_02 Region_05 Region_06 Region_09 Region_10

Region_03 Region_04 Region_07 Region_08 Region_11 Region_12

Node1 Node2 Node3

Region_01 Region_05 Region_02 Region_06 Region_03 Region_07

Region_09 Region_04 Region_10 Region_12 Region_11 Region_08

Data subregions are created based on the RowKey range (sorting based on a sorting
algorithm such as the alphabetic order based on RowKeys). Each subregion is a basic
distributed storage unit.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 11
KeyValue Storage Model (3)

 The underlying data of HBase exists in the form of KeyValue. KeyValue has a
specific format.
 KeyValue contains key information such as timestamp and type, etc.
 The same key can be associated with multiple Values. Each KeyValue has a
qualifier.
 There can be multiple KeyValues associated with the same Key and Qualifier.
In this case, they are distinguished using timestamps. This is why there are
multiple versions of the same data record.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 12
Contents
1. Introduction to HBase

2. Functions and Architecture of HBase

3. Key Processes of HBase

4. Huawei Enhanced Features of HBase

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
HBase Architecture (1)

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 14
HBase Architecture (2)
 Store: A Region consists of one or

multiple Stores. Each store corresponds

to a Column Family.

 MemStore: A Store contains one MemStore.

Data inserted to a Region by client is

cached to the MemStore.

 StoreFile: The data flushed to the HDFS is stored as a StoreFile in the HDFS.

 Hfile: HFile defines the storage format of StoreFiles in a file system. HFile is underlying
implementation of StoreFile.

 Hlog: HLogs prevent data loss when a RegionServer is faulty. Multiple Regions in a
RegionServer share the same HLog.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
HMaster (1)

"Hey, Region A, please move to


RegionServer 1!"
“RegionServer 2 was gone! Let others take
over it’s Regions!"

RegionServer1 RegionServer2 RegionServer3

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 16
HMaster (2)
 The HMaster process manages all the RegionServers.
 Handles RegionServer failovers.

 The HMaster process performs cluster operations including creating,


modifying, and deleting tables.

 The HMaster process migrates Regions.


 Allocates Regions when a new table is created.

 Ensures load balancing during operation.

 Takes over Regions after a RegionServer failover occurs.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
RegionServer
 RegionServer is the data service
Region process of HBase and is responsible
RegionServer for processing reading and writing
requests of user data.

 RegionServer manages Regions. All


Region
reading and writing requests of user
data are handled based on interaction
among Regions on RegionServers.

Region  Regions can be migrated between


RegionServers.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 18
Region (1)
 A data table is divided horizontally into subtables based on the
KeyValue range to implement distributed storage. A subtable is called
a Region in HBase.
 Each Region is associated with a KeyValue range, which is described
using a StartKey and an EndKey.
 Each Region only needs to record a StartKey, because its EndKey serves as
the StartKey of the next Region.

 Region is the most basic distributed storage unit of HBase.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 19
Region (2)
Row001
Row001
Row002 Region-1
Row002 StartKey, EndKey
………..
……….. Row010
Row010
Row011
Row011 Row012 Region-2
Row012 ……….. StartKey, EndKey
……….. Row020
Row020 Row021
Row021 Row022 Region-3
Row022 ……….. StartKey, EndKey
……….. Row030
Row030 Row031
Row031 ……….. Region-4
……….. ……….. StartKey, EndKey
………..

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Region (3)
META
Region

Region Region Region Region Region

 Regions are categorized as Meta Region and User Region.


 Meta Region records routing information of User Regions.
 Perform the following steps to access data in a Region:
 Search for the address of the Meta Region.
 Search for the address of the User Regions in the Meta Region.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Column Family
Region Region Region Region
/HBase/table
/region-1/ColumnFamily-1
/region-1/ColumnFamily-2

/region-2/ColumnFamily-1
/region-2/ColumnFamily-2
/HBase/table
/region-1 /region-3/ColumnFamily-1
/region-2 /region-3/ColumnFamily-2
/region-3
HDFS

 A ColumnFamily is a physical storage unit of a Region. Multiple column families of the


same Region have different paths in HDFS.
 ColumnFamily information is table-level configuration information. That is, multiple
Regions of the same table have the same column family information. (For example,
each Region has two column families and the configuration information of the same
column family of different Regions is the same.)

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 22
ZooKeeper
ZooKeeper provides the following functions for HBase:
 Distributed lock service
 Multiple HMaster processes will try registering a node in ZooKeeper and the node can be
registered only by one HMaster process. The process that successfully registers the node
becomes the active HMaster process.

 Event listening mechanism


 The active Hmaster’s record is deleted after the active process fails and the standby
processes will receive an update message which indicates the Active HMaster is down.

 Micro database roles


 ZooKeeper stores the addresses of RegionServers. In this case, ZooKeeper can be regarded
as a micro database.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
MetaData Table
User Table 1
 The MetaData Table HBase:Meta
stores the information about
Regions to locate the Specific
Region for Client.

 The MetaData Table is splitted User Table N

into multiple Regions,and


metadata information of Region is
stored in ZooKeeper.
Mapping relation
Metadata Table
User table

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Contents
1. Introduction to HBase

2. Functions and Architecture of HBase

3. Key Processes of HBase

4. Huawei Enhanced Features of HBase

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 25
Writing Process

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Client Initiating a Data Writing Request

Client

 The process of initiating a writing request by a client is like sending


books to a library by a book supplier. The book supplier must
determine to which building and floor the books should be sent.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Writing Process - Locating a Region

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Writing Process - Grouping Data (1)

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
Writing Process - Grouping Data (2)
 Data groups includes two

division steps:
 Find the information of

region and regionserver

of tables based on the

meta table

 Transfer data to specific region according to rwokey

 Data on each RegionServer is sent at the same time. In this case, the
data has been divided by Region.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
Writing Process - Sending a Request to
a RegionServer
 Data is sent using the encapsulated RPC
framework of HBase.

 Operations of sending requests to multiple


RegionServers are implemented concurrently.

 After sending a data writing request, a client


waits for the request processing result.

 If the client does not capture any exception, it


deems that all data has been written successfully.
If writing the data fails completely or partially,
the client can obtain a detailed KeyValue list
relevant to the failure.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Writing Process - Process of Writing
Data to a Region

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Writing Process - Flush
MemStore-1
(ColumnFamily-1)
HFile
Region

MemStore-2 HFile
(ColumnFamily-2)

 In either of the following scenarios, a Flush operation of Memstore is


triggered:
 The total usage of MemStore of a Region reaches the predefined Flush Size
threshold.
 The ratio of occupied memory to total memory of RegionServer reaches the
threshold.
 The number of WALs reaches the threshold.
 Memstore is updated every 1 hour by default.Hbase
 Users can flush a table or Region separately by a shell command.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Impacts of Multiple HFiles

As time passes by, the number of HFiles increases and a query request
will take much more time.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
Compaction (1)
 Compaction aims to reduce the number of small files in a column family in a
Region, thereby increasing reading performance.
 There are two kinds of compaction: major and minor.
 Minor: compaction covering a small range. Minimum and maximum numbers of
files are specified. Small files at a consecutive time duration are combined.
 Major: compaction covering the HFiles in a column family in a Region. During
major compaction, deleted data is cleared.

 Files are selected based on a certain algorithm during minor compaction.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 35
Compaction (2)

Write
put MemStore
Flush

HFile HFile HFile HFile HFile HFile HFile

Minor Compaction

HFile HFile HFile

Major Compaction

HFile

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 36
Region Split
 A common Region splitting operation is
performed to split a Region into two subregions
if the data size of the Region exceeds the Parent
predefined threshold. Region

 During splitting, the split Region suspends


the reading and writing services. During
splitting, data files of the parent Region are
not split and rewritten to the two subregions.
Reference files are created in the new Region
to achieve quick splitting. Therefore, services
of the Region are suspended only for a short
time. DaughterRegion-2

 Routing information of the parent Region


DaughterRegion-1
cached in clients must be updated.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 37
Reading Process

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 38
Client Initiating a Data Reading Request
Get  When a precise key is provided, the
Get operation is performed to read a
single row of user data.

Scan  The Scan operation is to batch scan


user data of a specified Key range.
Client

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 39
Locating a Region

Hi, META, I want to look for books whose code ranges is


from xxx to xxx, please find the bookshelf number and the
floor information within the code range.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 40
OpenScanner
ColumnFamily-1
MemStore
HFile-11
HFile-12
Region
ColumnFamily-2
MemStore
HFile-21
HFile-22

 During the OpenScanner process, scanners corresponding to


MemStore and each HFile are created:
 The scanner corresponding to HFile is StoreFileScanner.

 The scanner corresponding to MemStore is MemStoreScanner.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 41
Filter
 Filter allows users to set filtering criteria during the Scan
 Satisfied Row
operation. Only user data that meets the criteria returns.
 There are some typical Filter types:
 Satisfied Row
 RowFilter
 SingleColumnValueFilter
 KeyOnlyFilter
 FilterList
 Satisfied Row

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 42
BloomFilter
 BloomFilter is used to optimize scenarios where data is randomly read, that is,
scenarios where the Get operation is performed. It can be used to quickly
check whether a piece of user data exists in a large dataset (most data in the
dataset cannot be loaded to the memory).

 A certain error rate exists when BloomFilter checks whether a piece of data
exits. Nevertheless, the conclusion indicated by the message "User data XXXX
does not exist" is accurate.

 The data relevant to BloomFilter of HBase is stored in HFiles.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 43
Contents
1. Introduction to HBase

2. Functions and Architecture of HBase

3. Key Processes of HBase

4. Huawei Enhanced Features of HBase

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 44
Supporting Secondary Index
 The secondary index enables HBase to query data based on specific column
values.
Column Family A Column Family B
RowKey A:Name A:Addr. A:Age B:Mobile B:Email
01 ZhangSan Beijing 23 6875349 ……
02 LiLei Hangzhou 43 6831475 ……
03 WangWu Shenzhen 35 6809568 ……
04 …… Wuhan 28 6812645 ……
05 …… Changsha 26 6889763 ……
06 …… Jinan 35 6854912 ……

When the secondary index is not used, the mobile field needs to be matched in the entire table by row
to search for specified mobile numbers such as ‘68XXX’ which results in long time delay.
When the secondary index is used, the index table is searched first to identify the location of the
mobile number, which narrows down the search scope and reduces the time delay.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 45
HFS
 HBase FileStream (HFS) is a separate module of Hbase. As an
encapsulation of Hbase and HDFS interfaces, HFS provides
capabilities, such as storing, reading and deleting files for
upper-level applications.

 HFS provides the ability of storing massive small files and large
files in HDFS。 That is, massive small files (less than 10MB) and
some large files (larger than 10MB) can be stored in HBase.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 46
HBase MOB (1)
 MOB Data(100KB to 10MB)is directly stored in the file
system (HDFS for example)as HFile. And the information about
address and size of file is stored in HBase as a value. With tools
managing these files, the frequency of compation and split
can be highly reduced, and performance can be improved.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 47
HBase MOB (2)

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 48
Summary
 This module describes the following information about HBase:
KeyValue Storage Model, technical architecture, reading and
writing process and enhanced features of FusionInsight HBase.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 49
Quiz
1. Can the services of the Region in HBase be provided when splitting?

2. What are the advantages of the Region splitting of HBase?

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 50
Quiz
1. What is Compaction used for? ( )

A. Reducing the number of files in a column family and Region

B. Improving data reading performance

C. Reducing the number of files in a column family

D. Reducing the number of files in a Region

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 51
Quiz
1. What is the physical storage unit of HBase? ( )
A. Region

B. Column Family

C. Column

D. Cell

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 52
More Information
 Training materials:
 http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
 Exam outline:
 http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node10
00011797
 Mock exam:
 http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node10000
11798
 Authentication process:
 http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 53
Thank You
www.huawei.com

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 54

You might also like