0% found this document useful (0 votes)
65 views30 pages

Big Data Analytics & Technologies: Hbase

The document provides an overview of HBase including its conceptual data model, physical data storage, data operations, and architecture. It describes HBase's key concepts, learning outcomes, and key terms. It also covers HBase's data storage mechanism involving tables, rows, column families, and columns.

Uploaded by

Wong pi wen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views30 pages

Big Data Analytics & Technologies: Hbase

The document provides an overview of HBase including its conceptual data model, physical data storage, data operations, and architecture. It describes HBase's key concepts, learning outcomes, and key terms. It also covers HBase's data storage mechanism involving tables, rows, column families, and columns.

Uploaded by

Wong pi wen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Big Data Analytics &

Technologies
CT047-3-M

HBase
Topic & Structure of The Lesson

• HBase
– Conceptual data model
– Physical data storage
– Data operations
– Architecture

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <2> of 9


Learning Outcomes

• At the end of this topic, You should be


able to
• Demonstrate the key concepts of
Hbase for Big Data Analytics
• Evaluate Hbase architecture

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <3> of 9


Key Terms You Must Be Able To
Use
• If you have mastered this topic, you should be able to use the
following terms correctly in your assignments and exams:
• NOSQL
• Hbase Architecture
• ZooKeeper
• Column family

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <4> of 9


HBase: Overview

• HBase is a distributed column-oriented


data store built on top of HDFS
• HBase is an Apache open source
project whose goal is to provide
storage for the Hadoop Distributed
Computing
• Data is logically organized into tables,
rows and columns

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase


Hbase: Overview

“HBase™ is the Hadoop database, a


distributed, scalable, big data store. It is a
part of the Hadoop ecosystem that provides
random real-time read/write access to data
in the Hadoop File System.”
(https://hbase.apache.org)

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <6> of 9


HBase: Part of Hadoop’s
Ecosystem

HBase is built on top of HDFS

HBase files are


internally
stored in HDFS

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase


HBase vs. HDFS

• Both are distributed systems that scale to


hundreds or thousands of nodes

• HDFS is good for batch processing (scans


over big files)
– Not good for record lookup
– Not good for incremental addition of small
batches
– Not good for updates

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase


HBase vs. HDFS (Cont’d)

• HBase is designed to efficiently address


the above points
– Fast record lookup
– Support for record-level insertion
– Support for updates

• HBase updates are done by creating new


versions of values

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase


HBase

• Tables have one primary index, the row key.


• No join operators.
• Scans and queries can select a subset of
available columns.
• There are three types of lookups:
– Fast lookup using row key and optional timestamp.
– Full table scan
– Range scan from region start to end.

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <10> of 9


HBase is ..

• A distributed data store that can scale


horizontally to 1,000s of commodity servers and
petabytes of indexed storage.

• Designed to operate on top of the Hadoop


distributed file system (HDFS) for scalability,
fault tolerance, and high availability.

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <11> of 9


Features of HBase

• HBase is linearly scalable.


• It provides consistent read and writes.
• It integrates with Hadoop, both as a
source and a destination.
• It has easy java API for client.
• It provides data replication across clusters.

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <12> of 9


Storage Mechanism in HBase

• Table is a collection of rows.


• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.

Rowid Column Family Column Family Column Family

col1 col2 col3 col1 col2 col3 col1 col2 col3

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <#> of 9


HBase: Keys and Column
Families
Each record is divided into Column Families

Each row has a Key

Each column family consists of one or more Columns

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <14> of 9


Column family named “anchor”
Column family named “Contents”

Column
Time
Row key “content Column “anchor:”
• Key Stamp
s:”
– Byte array
– Serves as the primary “<html>
t12
…”
key for the table “com.apac
“<html>
Column named “apache.com”
– Indexed far fast he.ww t11
…”
lookup w”
“anchor:apache “APACH
• Column Family t10
.com” E”
– Has a name (string) “anchor:cnnsi.co
t15 “CNN”
– Contains one or more m”
related columns “anchor:my.look. “CNN.co
t13
• Column ca” m”
– Belongs to one “com.cnn.w “<html>
t6
column family ww” …”
– Included inside the “<html>
t5
row …”
• familyName:column “<html>
Name t3
…”

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <15> of 9


Version number for each row

Column
Time
Row key “content Column “anchor:”
Stamp
• Version Number s:”

– Unique within each “<html>


t12 value
key …”
“com.apac
“<html>
– By default he.ww t11
…”
w”
System’s “anchor:apache “APACH
t10
timestamp .com” E”
“anchor:cnnsi.co
– Data type is Long t15
m”
“CNN”

• Value (Cell) t13


“anchor:my.look. “CNN.co
ca” m”
– Byte array
“com.cnn.w “<html>
ww” t6
…”

“<html>
t5
…”
“<html>
t3
…”

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <16> of 9


Notes on Data Model

• HBase schema consists of several


Tables

• Each table consists of a set of Column


Families
– Columns are not part of the schema

• HBase has Dynamic Columns


– Because column names are encoded inside
the cells
– Different cells can have different columns
CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <17> of 9
Notes on Data Model (Cont’d)

• The version number can be user-supplied


– Even does not have to be inserted in increasing order
– Version number are unique within each key

• Table can be very sparse


– Many cells are empty

• Keys are indexed as the primary key

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <18> of 9


Storage Mechanism in HBase

• HBase is a column-oriented database and


the tables in it are sorted by row. The table
schema defines only column families,
which are the key value pairs.
• Each cell value of the table has a
timestamp. In short, in an HBase:
Table is a collection of rows.
Row is a collection of column families.
Column family is a collection of columns.
Column is a collection of key value pairs.
CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <19> of 9
HBase Physical Model

• Each column family is stored in a separate file (called


HTables)
• Key & Version numbers are replicated with each column family
• Empty cells are not stored

HBase maintains a multi-


level index on values:
<key, column family,
column name, timestamp>

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <20> of 9


HBase Regions

• Each HTable (column family) is partitioned


horizontally into regions
– Regions are counterpart to HDFS blocks

Each will be one region

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <21> of 9


Three Major Components

• The HBaseMaster
– One master

• The HRegionServer
– Many region
servers

• The HBase client

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <22> of 9


HBase Components

• Region
– A subset of a table’s rows, like horizontal range
partitioning
– Automatically done
• RegionServer (many slaves)
– Manages data regions
– Serves data for reads and writes (using a log)
• Master
– Responsible for coordinating the slaves
– Assigns regions, detects failures
– Admin functions

CT047-3-M-BDAT - Big Data Analytics & Technologies


23
Hbase Slide <23> of 9
Architecture

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <24> of 9


ZooKeeper

• Zookeeper acts like as


a coordinator inside
HBase distributed
environment. It helps
in maintaining server
state inside the cluster
by communicating
through sessions.

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <25> of 9


Where to Use HBase

• Apache HBase is used to have random,


real-time read/write access to Big Data.
• It hosts very large tables on top of clusters
of commodity hardware.
• Apache HBase is a non-relational
database modeled after Google's Bigtable.
Bigtable acts up on Google File System,
likewise Apache HBase works on top of
Hadoop and HDFS.
CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <26> of 9
Quick Review Question

• What are the key components of HBase?


• What is the role of Zookeeper in Hbase?

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <27> of 9


Summary of Main Teaching Points

• Overview and the history of Hbase


• Introduce Hbase Architecture
• What is not HBase

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <28> of 9


Question and Answer Session

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <29> of 9


What we will cover next

• Spark Architecture
– Spark Core
– Resilient Distributed data set
– Programming Languages Supported by Spark

CT047-3-M-BDAT - Big Data Analytics & Technologies Hbase Slide <30> of 9

You might also like