0% found this document useful (0 votes)
6 views39 pages

HBase

Uploaded by

yashitgupta22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views39 pages

HBase

Uploaded by

yashitgupta22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

HBase

1
HBase: Part of Hadoop’s
Ecosystem

HBase is built on top of HDFS

HBase files are


internally stored
in HDFS

2
HBase: Overview
• HBase is a distributed column-oriented datas store built on top of HDFS
• HBase is an Apache open-source project whose goal is to provide storage
for the Hadoop Distributed Computing
• Data is logically organized into tables, rows and columns

Example Schema of Table in HBase

• Table is a collection of rows.


• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.
3
HBase vs. HDFS
• Both are distributed systems that scale to hundreds or
thousands of nodes

• HDFS is good for batch processing (scans over big files)


• Not good for record lookup
• Not good for incremental addition of small batches
• Not good for updates
• It provides only sequential access of data.

4
HBase vs. HDFS (Cont’d)
• HBase is designed to efficiently address the above points
• Fast record lookup
• Support for record-level insertion
• Support for updates (not in place)
• HBase internally uses Hash tables and provides random
access, and it stores the data in indexed HDFS files for
faster lookups.

5
HBase vs. HDFS (Cont’d)

If application has neither random reads or writes then Stick to HDFS

6
HBase Data Model

7
HBase Data Model
• A column-oriented database stores data in cells grouped into columns,
not rows

• HBase is based on Google’s Bigtable model


• Key-Value pairs

8
HBase Data Model
1. Table & 2. Row
•Several Rows are multiple in Hbase Table. Columns have values ​assigned to them. HBase sorts rows alphabetically by
row key.
•The main goal is to store data so that related rows are closer together. The domain of the site is used as a common row-
key pattern. For example, if our row keys are domains, we should store them in reverse, i.e. org.apache.www or
org.apache.mail or org.Apache.Jira. This way, all Apache domains are close to each other in the HBase table.
3. Column
•An HBase column consists of a column family and a column qualifier separated by the : (colon) character.
•A. Column family: Column families physically house a set of columns and their values; then, Each column family has a
set of storage properties, such as how its data is compressed, whether its values ​should be cached, how its row keys are
encoded, and more. Each row in an HBase table has the same column families.
•b. Column qualifications: A column qualifier for qualification is added to the column family to provide an index for that
data part. Example: the column family is content, then the column qualifier can be content: HTML or content: pdf. The
Column families are fixed during table creation, but column qualifiers are mutable and vary widely between rows.
4. The cell
•A cell is essentially a combination of a row, a column family, and a column qualifier. Contains a value and a timestamp
that represents the version of the value.
5. Timestamp
•A timestamp is an identifier for a given value version and is written next to each value. The timestamp default represents
the time on the RegionServer when the data was written. However, we can specify a different timestamp value when
inserting data into a cell.
HBase: Keys and Column
Families
Each record is divided into Column Families

Each row has a Key

Each column family consists of one or more Columns

10
Column family named “anchor”
Column family named “Contents”

• Key
• Byte array
• Serves as the primary key
for the table
Column named “apache.com”
• Indexed far fast lookup

• Column Family
• Has a name (string)
• Contains one or more
related columns

• Column
• Belongs to one column
family
• Included inside the row
• familyName:columnName

11
Version number for each row

• Version Number
• Unique within each
key value

• By default→
System’s timestamp
• Data type is Long

• Value (Cell)
• Byte array

12
Notes on Data Model
• HBase schema consists of several Tables
• Each table consists of a set of Column Families
• Columns are not part of the schema

• HBase has Dynamic Columns


• Because column names are encoded inside the cells
• Different cells can have different columns

“Roles” column family


has different columns
in different cells

13
Notes on Data Model (Cont’d)
• The version number can be user-supplied
• Even does not have to be inserted in increasing order
• Version number are unique within each key

• Table can be very sparse


• Many cells are empty
Has two columns
• Keys are indexed as the primary key [cnnsi.com & my.look.ca]
HBase Physical Model

15
HBase Physical Model
• Each column family is stored in a separate file (called HTables)

• Key & Version numbers are replicated with each column family

• Empty cells are not stored

HBase maintains a multi-


level index on values:
<key, column family, column
name, timestamp>

16
Example

17
Column Families

18
HBase Regions
• Each HTable (column family) is partitioned horizontally
into regions
• Regions are counterpart to HDFS blocks

Each will be one region

19
HBase Architecture

20
Three Major Components
• The HBaseMaster
• One master

• The HRegionServer
• Many region servers

• The HBase client

21
HBase Components
• Region
• A subset of a table’s rows, like horizontal range partitioning
• Automatically done

• RegionServer (many slaves)


• Manages data regions
• Serves data for reads and writes (using a log)

• Master
• Responsible for coordinating the slaves
• Assigns regions, detects failures
• Admin functions

https://www.analyticsvidhya.com/blog/2022/10/a-brief-introduction-to-apache-hbase-and-its-architecture/
22
Big Picture

23
ZooKeeper
• HBase depends on ZooKeeper

• By default HBase manages the


ZooKeeper instance
• E.g., starts and stops
ZooKeeper

• HMaster and HRegionServers


register themselves with
ZooKeeper

24
Creating a Table
HBaseAdmin admin= new HBaseAdmin(config);
HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new HTableDescriptor(Bytes.toBytes("MyTable"));
desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);

25
Operations On Regions: Get()
• Given a key → return corresponding record

• For each value return the highest version

• Can control the number of versions you want

26
Operations On Regions: Scan()

27
Select value from table where
Get() key=‘com.apache.www’ AND
label=‘anchor:apache.com’

Time
Row key Column “anchor:”
Stamp

t12

t11
“com.apache.www”

t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6

t5

t3
Select value from table
Scan() where anchor=‘cnnsi.com’

Time
Row key Column “anchor:”
Stamp

t12

t11
“com.apache.www”

t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6

t5

t3
Operations On Regions: Put()
• Insert a new record (with a new key), Or

• Insert a record for an existing key


Implicit version number
(timestamp)

Explicit version number

30
Operations On Regions: Delete()

• Marking table cells as deleted

• Multiple levels
• Can mark an entire column family as deleted
• Can make all column families of a given row as deleted

• All operations are logged by the RegionServers


• The log is flushed periodically

31
HBase: Joins
• HBase does not support joins

• Can be done in the application layer


• Using scan() and get() operations

32
Altering a Table

Disable the table before changing the schema

33
Logging Operations

34
HBase Deployment

Master
node

Slave
nodes

35
HBase vs. HDFS

36
HBase vs. RDBMS

37
When to use HBase

38
References
• https://www.bmc.com/blogs/hadoop-hbase/

• https://towardsdatascience.com/hbase-working-principle-a-part-of-hadoop-architecture-fbe0453a031b

• https://medium.com/hands-on-apache-hbase/an-introduction-to-apache-hbase-2cdd1d9ff13

• https://builtin.com/data-science/hbase

• https://www.analyticsvidhya.com/blog/2022/10/a-brief-introduction-to-apache-hbase-and-its-
architecture/

• https://www.tutorialspoint.com/hbase/hbase_overview.htm

• https://www.geeksforgeeks.org/apache-hbase/

• https://www.simplilearn.com/tutorials/hadoop-tutorial/hbase

You might also like