0% found this document useful (0 votes)
27 views35 pages

Cs525: Special Topics in DBS: Large-Scale Data Management

This document provides an overview of HBase, an open source, distributed, column-oriented database built on top of HDFS. It describes HBase's data model using tables, rows, columns and versions, its logical and physical storage layout, architecture involving a master and region servers, basic operations like get, put and scan, and how it compares to HDFS and relational databases. The key aspects covered are its scalability for large datasets, real-time random read/write capabilities and suitability for applications with large amounts of structured or semi-structured data.

Uploaded by

Woya Ma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views35 pages

Cs525: Special Topics in DBS: Large-Scale Data Management

This document provides an overview of HBase, an open source, distributed, column-oriented database built on top of HDFS. It describes HBase's data model using tables, rows, columns and versions, its logical and physical storage layout, architecture involving a master and region servers, basic operations like get, put and scan, and how it compares to HDFS and relational databases. The key aspects covered are its scalability for large datasets, real-time random read/write capabilities and suitability for applications with large amounts of structured or semi-structured data.

Uploaded by

Woya Ma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

CS525: Special Topics in DBs

Large-Scale Data Management

HBase

1
HBase: Overview
• HBase is a distributed column-oriented data store
built on top of HDFS

• HBase is an Apache open source project whose goal is to


provide storage for the Hadoop Distributed Computing

• Data is logically organized into tables, rows and


columns

2
HBase vs. HDFS (Cont’d)

If application has neither random reads or writes 🡺 Stick to HDFS

3
HBase Data Model

4
HBase Data Model
• HBase is based on Google’s Bigtable model
• Key-Value pairs

5
HBase Logical View

6
HBase: Keys and Column
Families
Each record is divided into Column Families

Each row has a Key

Each column family consists of one or more Columns

7
Column family named “anchor”
Column family named “Contents”

• Key
• Byte array
• Serves as the primary key for
the table Column named “apache.com”
• Indexed far fast lookup

• Column Family
• Has a name (string)
• Contains one or more related
columns

• Column
• Belongs to one column
family
• Included inside the row
• familyName:columnName
8
Version number for each row

• Version Number
• Unique within each
value
key
• By default🡪 System’s
timestamp
• Data type is Long

• Value (Cell)
• Byte array

9
Notes on Data Model
• HBase schema consists of several Tables

• Each table consists of a set of Column Families


• Columns are not part of the schema

• HBase has Dynamic Columns


• Because column names are encoded inside the cells
• Different cells can have different columns

“Roles” column family


has different columns in
different cells

10
Notes on Data Model (Cont’d)
• The version number can be user-supplied
• Even does not have to be inserted in increasing order
• Version number are unique within each key

• Table can be very sparse


Has two columns
• Many cells are empty [cnnsi.com & my.look.ca]

• Keys are indexed as the primary key


HBase Physical Model

12
HBase Physical Model
• Each column family is stored in a separate file (called HTables)

• Key & Version numbers are replicated with each column family

• Empty cells are not stored

HBase maintains a multi-level


index on values:
<key, column family, column
name, timestamp>

13
Example

14
Column Families

15
HBase Regions
• Each HTable (column family) is partitioned horizontally
into regions
• Regions are counterpart to HDFS blocks

Each will be one


region

16
HBase Architecture

17
Three Major Components

• The HBaseMaster
• One master

• The HRegionServer
• Many region servers

• The HBase client

18
HBase Components
• Region
• A subset of a table’s rows, like horizontal range partitioning
• Automatically done

• RegionServer (many slaves)


• Manages data regions
• Serves data for reads and writes (using a log)

• Master
• Responsible for coordinating the slaves
• Assigns regions, detects failures
• Admin functions

19
Big Picture

20
ZooKeeper
• HBase depends on
ZooKeeper

• By default HBase manages


the ZooKeeper instance
• E.g., starts and stops
ZooKeeper

• HMaster and HRegionServers


register themselves with
ZooKeeper

21
Creating a Table
HBaseAdmin admin= new HBaseAdmin(config);
HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new HTableDescriptor(Bytes.toBytes("MyTable"));

desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);

22
Operations On Regions: Get()
• Given a key 🡪 return corresponding record

• For each value return the highest version

• Can control the number of versions you want

23
Operations On Regions: Scan()

24
Select value from table where
Get() key=‘com.apache.www’ AND
label=‘anchor:apache.com’

Time
Row key Column “anchor:”
Stamp

t12

t11
“com.apache.www”

t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6

t5

t3
Select value from table
Scan() where anchor=‘cnnsi.com’

Time
Row key Column “anchor:”
Stamp

t12

t11
“com.apache.www”

t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6

t5

t3
Operations On Regions: Put()
• Insert a new record (with a new key), Or

• Insert a record for an existing key


Implicit version number
(timestamp)

Explicit version number

27
Operations On Regions: Delete()
• Marking table cells as deleted

• Multiple levels
• Can mark an entire column family as deleted
• Can make all column families of a given row as deleted

• All operations are logged by the RegionServers


• The log is flushed periodically

28
HBase: Joins
• HBase does not support joins

• Can be done in the application layer


• Using scan() and get() operations

29
Altering a Table

Disable the table before changing the schema

30
Logging Operations

31
HBase Deployment

Master
node

Slave
nodes

32
HBase vs. HDFS

33
HBase vs. RDBMS

34
When to use HBase

35

You might also like