0% found this document useful (0 votes)
107 views38 pages

HBase

HBase is a distributed column-oriented database built on HDFS. It provides Bigtable-style capabilities for Hadoop, including fast random reads and writes and incremental data loading. HBase partitions tables into regions that are distributed across region servers for scalability. The HBase master coordinates region assignments and failures across the region servers.

Uploaded by

Chris Harris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views38 pages

HBase

HBase is a distributed column-oriented database built on HDFS. It provides Bigtable-style capabilities for Hadoop, including fast random reads and writes and incremental data loading. HBase partitions tables into regions that are distributed across region servers for scalability. The HBase master coordinates region assignments and failures across the region servers.

Uploaded by

Chris Harris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

CS525: Special Topics in DBs

Large-Scale Data Management

HBase

Spring 2013
WPI, Mohamed Eltabakh

1
HBase: Overview
• HBase is a distributed column-oriented data
store built on top of HDFS

• HBase is an Apache open source project whose goal


is to provide storage for the Hadoop Distributed
Computing

• Data is logically organized into tables, rows and


columns

2
HBase: Part of Hadoop’s
Ecosystem

HBase is built on top of HDFS

HBase files are


internally stored
in HDFS

3
HBase vs. HDFS
• Both are distributed systems that scale to hundreds or
thousands of nodes

• HDFS is good for batch processing (scans over big files)


• Not good for record lookup
• Not good for incremental addition of small batches
• Not good for updates

4
HBase vs. HDFS (Cont’d)
• HBase is designed to efficiently address the above points
• Fast record lookup
• Support for record-level insertion
• Support for updates (not in place)

• HBase updates are done by creating new versions of


values

5
HBase vs. HDFS (Cont’d)

If application has neither random reads or writes  Stick to HDFS

6
HBase Data Model

7
HBase Data Model
• HBase is based on Google’s Bigtable model
• Key-Value pairs

Column Family

Row key

TimeStamp value

8
HBase Logical View

9
HBase: Keys and Column
Families
Each record is divided into Column Families

Each row has a Key

Each column family consists of one or more Columns

10
Column family named “anchor”
Column family named “Contents”

Column
Time
Row key “content Column “anchor:”
• Key Stamp
s:”
• Byte array
“<html>
• Serves as the primary key t12
…”
for the table “com.apac
“<html>
Column named “apache.com”
• Indexed far fast lookup he.ww t11
…”
w”
• Column Family t10
“anchor:apache
.com”
“APACH
E”
• Has a name (string)
“anchor:cnnsi.co
• Contains one or more t15 “CNN”
m”
related columns
“anchor:my.look. “CNN.co
t13
ca” m”
• Column
“com.cnn.w “<html>
• Belongs to one column ww” t6
…”
family
“<html>
• Included inside the row t5
…”
• familyName:columnName “<html>
t3
…”

11
Version number for each row

Column
Time
Row key “content Column “anchor:”
Stamp
• Version Number s:”

• Unique within each “<html>


t12
key …” value
“com.apac
“<html>
• By default System’s he.ww
w”
t11
…”
timestamp t10
“anchor:apache “APACH
.com” E”
• Data type is Long
“anchor:cnnsi.co
t15 “CNN”
m”
• Value (Cell) “anchor:my.look. “CNN.co
t13
ca” m”
• Byte array
“com.cnn.w “<html>
t6
ww” …”

“<html>
t5
…”
“<html>
t3
…”

12
Notes on Data Model
• HBase schema consists of several Tables
• Each table consists of a set of Column Families
• Columns are not part of the schema

• HBase has Dynamic Columns


• Because column names are encoded inside the cells
• Different cells can have different columns

“Roles” column family


has different columns
in different cells

13
Notes on Data Model (Cont’d)
• The version number can be user-supplied
• Even does not have to be inserted in increasing order
• Version number are unique within each key

• Table can be very sparse


Has two columns
• Many cells are empty [cnnsi.com & my.look.ca]

• Keys are indexed as the primary key


HBase Physical Model

15
HBase Physical Model
• Each column family is stored in a separate file (called HTables)

• Key & Version numbers are replicated with each column family

• Empty cells are not stored

16
Example

17
Column Families

18
HBase Regions
• Each HTable (column family) is partitioned horizontally
into regions
• Regions are counterpart to HDFS blocks

Each will be one region

19
HBase Architecture

20
Three Major Components
• The HBaseMaster
• One master

• The HRegionServer
• Many region servers

• The HBase client

21
HBase Components
• Region
• A subset of a table’s rows, like horizontal range partitioning
• Automatically done

• RegionServer (many slaves)


• Manages data regions
• Serves data for reads and writes (using a log)

• Master
• Responsible for coordinating the slaves
• Assigns regions, detects failures
• Admin functions

22
Big Picture

23
ZooKeeper
• HBase depends on
ZooKeeper

• By default HBase manages


the ZooKeeper instance
• E.g., starts and stops
ZooKeeper

• HMaster and HRegionServers


register themselves with
ZooKeeper

24
Creating a Table
HBaseAdmin admin= new HBaseAdmin(config);
HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new HTableDescriptor(Bytes.toBytes("MyTable"));
desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);

25
Operations On Regions: Get()
• Given a key  return corresponding record

• For each value return the highest version

• Can control the number of versions you want

26
Operations On Regions: Scan()

27
Select value from table where
Get() key=‘com.apache.www’ AND
label=‘anchor:apache.com’

Time
Row key Column “anchor:”
Stamp

t12

t11
“com.apache.www”

t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6

t5

t3
Select value from table
Scan() where anchor=‘cnnsi.com’

Time
Row key Column “anchor:”
Stamp

t12

t11
“com.apache.www”

t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6

t5

t3
Operations On Regions: Put()
• Insert a new record (with a new key), Or

• Insert a record for an existing key


Implicit version number
(timestamp)

Explicit version number

30
Operations On Regions: Delete()

• Marking table cells as deleted

• Multiple levels
• Can mark an entire column family as deleted
• Can make all column families of a given row as deleted

31
HBase: Joins
• HBase does not support joins

• Can be done in the application layer


• Using scan() and get() operations

32
Altering a Table

33
Logging Operations

34
HBase Deployment

Master
node

Slave
nodes

35
HBase vs. HDFS

36
HBase vs. RDBMS

37
When to use HBase

38

You might also like