0% found this document useful (0 votes)
15 views

Unit 3 NoSQL

Uploaded by

madhavmane2021
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Unit 3 NoSQL

Uploaded by

madhavmane2021
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 98

Chapter 3 NoSQL


Introduction to NoSQL


No SQL business drivers


No SQL data architecture patterns


No SQL to manage Big data


HBSE overview
NoSQL
(Not Only SQL database)
What is NoSQL?

• It is Database management system that provide mechanism for

storage and retrieval of massive amount of unstructured data in

a distributed environment on a virtual servers with the focus to

provide high scalability, performance, availability and agility.

• NoSQL was develop in response to a large volume of data stored

about users, objects and products that need to be frequently

accessed and processed.


✔ It is next generation database which is completely different from traditional database
✔ NoSQL, stand for "not only SQL. That is SQL as well as other query language can be
used with NoSQL database.
✔ It is non-relational database and does not require a fixed schema, avoids joins, and is
easy to scale.
✔ Uses distributed architecture and works on multiple processors to give high
performance
✔ Many open source NoSQL databases are available
✔ Data file can be easily replicated
✔ It uses simple API
✔ It can manage huge amount of data
✔ It can be implemented on commodity hardware which has separate RAM and Disk
Online analytical processing (OLAP).
Why NoSQL?

• The system response time becomes slow by using RDBMS for massive

volumes of data.

• To resolve this problem, we could "scale up" our systems by upgrading

our existing hardware, but this process is expensive.

• The alternative for this issue is to distribute database load on multiple

hosts whenever the load increases. This method is known as "scaling out."
Features of NoSQL
Non-relational
✔ NoSQL databases never follow the relational model

✔ Never provide tables with flat fixed-column records

✔ Work with self-contained aggregates

✔ Doesn't require object-relational mapping and data normalization

✔ No complex features like query languages, query planners, referential

integrity, joins, ACID (Atomicity, Consistency, Isolation, Durability)


Schema-free
✔ NoSQL databases are either schema-free or have relaxed schemas

✔ Do not require any sort of definition of the schema of the data

✔ Offers heterogeneous structures of data in the same domain


Simple API
✔ Offers easy to use interfaces for storage and querying data provided

✔ APIs allow low-level data manipulation & selection methods

✔ Text-based protocols mostly used with HTTP REST (Representational State

Transfer ) with JSON (JavaScript Object Notation)

✔ Mostly used no standard based query language

✔ Web-enabled databases running as internet-facing services


Distributed
✔ Multiple NoSQL databases can be executed in a distributed fashion

✔ Offers auto-scaling and fail-over capabilities

✔ Often ACID concept can be sacrificed for scalability and throughput

✔ Mostly no synchronous replication between distributed nodes Asynchronous

Multi-Master Replication, peer-to-peer, HDFS Replication

✔ Only providing eventual consistency

✔ Shared Nothing Architecture. This enables less

coordination and higher distribution.


What is the CAP Theorem?
✔ CAP theorem is also called brewer's theorem.
✔ It states that is impossible for a distributed data store to offer more than two out of
three guarantees

– Consistency : The data should remain consistent even after the execution of an
operation. Its guarantees all storage and their replicated nodes have the same data at
the same time

– Availability: The database should always be available and responsive. That is every
request is guaranteed to receive a success or failure response

– Partition Tolerance : that the system should continue to function even if the
communication among the servers is not stable, in spite of arbitrary partitioning due
to network failures
BASE Properties of NoSQL Database

Question: Explain BASE Properties of NoSQL Database ?

BASE: Basically Available, Soft state, Eventual consistency

✔ Basically Available means DB is available all the time as per CAP theorem . That is every

request is guaranteed to receive a success or failure response.

✔ Soft state means even without an input; the system state may change

✔ Eventual consistency means that the system will become consistent over time. Means to

have copies of data on multiple machines to get high availability and scalability. Thus,

changes made to any data item on one machine has to be propagated to other replicas.
Types of NoSQL Databases

Q. List and explain types of NoSQL database with example of each?


• There are mainly four categories of NoSQL databases.
1. Key-value Pair Based(key values stores )

2. Column-oriented Graph ( Column family stores )

3. Graphs based (Graph stores )

4. Document-oriented (document stores)


1. Key Value Pair Based
– Data is stored in key/value pairs. It is designed in such a way to handle lots of data and

heavy load.

– It store data as a hash table where each key is unique, and the value can be a JSON,

BLOB(Binary Large Objects), string, etc.

– For example, a key-value pair may contain a key like "Website" associated with a value

like “amazon".

– Eg.Redis, Dynamo, Riak are some examples of key-value store DataBases.

– They are all based on Amazon's

Dynamo paper.
- It uses Hash Table with unique key and pointer to the particular item of data.

- A bucket Is logical group(not physical) of keys and so different bucket can have identical keys

- The real key is a hash (Bucket+Key)

Client side to Read / write values using keys

1. To fetch the values associated with the key use- Get(Key)

2. To associate the value with the key use- Put(key,value)

3. To fetch the list of values associated with the list of keys use- Multi-get(key1,key2-----KeyN)

4. To remove the entry of r the key from data store use – Delete(Key)
⮚ Rules to access data using Key-Value :

- Distinct key- All keys in Key-Value type are Unique

- No Quires on values: No Queries an be preformed on value of the table.

⮚ Weakness of key value :

- Due to lack of consistency they can’t be used for updating part of a value

or query the database. ie. It cant provide traditional database capabilities.

- It will be difficult to maintain unique values as key if volume of data

increases
⮚ Column-based
– It work on columns and are based on BigTable paper by Google.

– Every column is treated separately. Values of single column databases are stored contiguously.

– Cell is identified by row number and column name identifier.

– They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc.

– Its widely used to manage data warehouses, business intelligence, CRM, Library card catalogs

– HBase, Cassandra, HBase, Hypertable are examples of column based database.


For eg. In Cassandra data model
EmployeeIndia:{
Address:{
City:Mumbai
Pincode:400058
},
projectDetails:{
DurationDays:100
Cost:50000
}

Here,
- The outermost key emloyeeIndia is analogoues to row
- “address” and “projectDetails”are called column family
- The column family “address” has cplumns “city”and “pincode”
- The column family “projectDetails” has columns “ durationDays” and “cost”
- Column can be referenced using Column family
3. Document-Oriented

- It stores and retrieves data as a key value pair but the value part is stored as a document.

The document is stored in JSON (JavaScript Object Notation) or XML formats.

- It pair each key with a complex data structure known as a document. Documents can

contain many different key-value pairs, or key-array pairs, or even nested documents.

- For eg, MangoDB, CouchBase,CouchDB .MongoDB is the most popular of these

databases.
- Searching: The column and key value type lack a formal structure hence can not be

index , so searching is not possible. This can resolved by document store. Using

single ID , a query can result in getting any item out of document store. This is

possible because everything inside a document is automatically indexed .

- Difference in Key value and document store is that Key value stores into memory

the entire document in the value portion , whereas the document store extracts

subsection of all document


• It uses a tree structure

• “document path “ is used like a key to access the leaf values of a document

• For eg. Root is employee, the path can be

• Employee[id=‘2003’]/address/street/buildingname/text()
4. Graphs based
- Graph type database stores entities as well the relations amongst those entities.

- The entity is stored as a node with the relationship as edges.

- An edge gives a relationship between nodes.

- Every node and edge has a unique identifier.

- Graph base database mostly used for social networks, logistics, spatial data.

- E.g.Neo4j, Infinite Graph, OrientDB, FlockDB


• It is based on graph theory

• These databases are designed for data whose relations are represented as a graph

and have elements which are interconnected, with an undetermined number of

relations between them.

• These are used when a business problem has complex relationship among their

objects especially in social networks and rule based engines


Type Typical usage Examples
Key-value stor—A simple data •Image stores •Memcache
storage system that uses a key to •Key-based file systems •Redis
access a value •Object cache •Riak
•Systems designed to scale •DynamoDB

Column family store—A sparse •Web crawler results •HBase


matrix system that uses a row •Big Data problems that can relax •Cassandra
and column as keys consistency rules •Hypertable

Graph store—For relationship •Social networks •Neo4J


intensive problems •Fraud detection •AllegroGraph
•Relationship heavy data •Big data (RDF Store)
•InfiniteGraph (Objectivity)

Document store—Storing •High variability data •MongoDB (10Gen)


hierarchical data structures •Document search •CouchDB
directly in the database •Integration hubs •Couchbase
•Web content management
•MarkLogic
•Publishing
•eXist-db
When to used which type of NoSQL
• Key-Value :When processing a constant stream of small reads and write
• Document: for natural data modeling , programmer friendly,rapid
development,web friendly
• Column: Massive write load, high availability, Multiple data centers and
MapReduce
• Graph: for graph algorithm and relations
NoSQL business derivers

figure, shows how the business drivers volume, velocity, variability, and agility
apply pressure to the single CPU system, resulting in the cracks.
• Volume and velocity refer to the ability to handle large datasets that arrive

quickly.

• Variability refers to how diverse data types don't fit into structured tables

• Agility refers to how quickly an organization responds to business change.


Volume
• To improve the performance two things to be consider:

– If key factor is only speed : used faster processor

– If the processing involves complex computation, Graphics Processing Unit(GPU)

could be used along with CPU

• Instead of Forcing systems designers to shift their focus from increasing speed

on a single chip to using more processors working together. The need to scale

out (also known as horizontal scaling), rather than scale up (faster processors),

moved organizations from serial to parallel processing where data problems

are split into separate paths and sent to separate processors to divide and

conquer the work.


Velocity

• Velocity comes in picture when real time insertion (read and write) in to

the database made by social networking and e-commerce websites.

• For eg. discount scheme in online shopping burst web traffic will

slowdown response for every user and system tuning these systems can

be costly when both high read and write throughput is desired , When

single processors RDBMSs are used as a back end to a web store front
Variability
• Companies that want to capture and report on exception data struggle

when attempting to use rigid database schema structures imposed by

RDBMSs.

• For example, if a business unit wants to capture a few custom fields for a

particular customer, all customer rows within the database need to store

this information even though it doesn’t apply.

• Adding new columns to an RDBMS requires the system be shut down and

ALTER TABLE commands to be run. When a database is large, this process

can impact system availability, costing time and money.


Agility

• The most complex part of building applications using RDBMSs is the process of

putting data into and getting data out of the database.

• If your data has nested and repeated subgroups of data structures you need to

include an object-relational mapping layer.

• The responsibility of this layer is to generate the correct combination of INSERT,

UPDATE, DELETE and SELECT SQL statements to move object data to and from the

RDBMS persistence layer.

• This process is not simple and is associated with the largest barrier to rapid

change when developing new or modifying existing applications.

• Even with experienced staff, small change requests can cause slowdowns in

development and testing schedules.


Three ways to share resources.

shared RAM architecture: where many CPUs access a single shared RAM over a

high-speed bus. This system is ideal for large graph traversal.

shared disk system: where processors have independent RAM but share disk

using a storage area network (SAN).

Shared nothing Architecture : An architecture used in big data solutions. cache-

friendly, using low-cost commodity hardware


Master-slave versus peer-to-peer
• master-slave configuration :
– where all incoming database requests (reads or writes) are
sent to a single master node and redistributed from there.
– The master node is called the NameNode in Hadoop.
– This node keeps a database of all the other nodes in the
cluster and the rules for distributing requests to each node.
• peer-to-peer model :
– stores all the information about the cluster on each node in
the cluster.
– If any node crashes, the other nodes can take over and
processing can continue
HBASE
HBase
• HBase is an open source, non-relational, distributed database modeled

after Google’s BigTable .

• It is written in Java.

• It is developed by Apache Software Foundation and is a part of Apache

Hadoop project.

• HBase runs on top of HDFS (Hadoop Distributed Filesystem), providing

BigTable-like capabilities for Hadoop.

• HBase is a key/ value store.

• HBase is specifically Sparse, Distributed, Multi-dimensional, sorted Maps

and consistent.
• HBase can be used in the following scenarios:
– Huge Data

– Fast Random Access

– Structured Data

– Variable Schema

– Need of Compression

• HBase is a column-oriented non-relational database


management system
• Basics of HBase:
– Rowkey

– Column Family

– Column

– Timestamp

⮚ An HBase table contains column families, which are the logical and physical

grouping of columns.

⮚ Column families contain columns with time stamped versions.

⮚ Columns only exist when they are inserted.

⮚ All column associates of the same column family have the same column

family prefix. Each column value is identified by a key.

⮚ The row key is the implicit primary key. The Rows are sorted by the row key.
• An HBase system is designed to scale linearly.
• It comprises a set of standard tables with rows and columns,
much like a traditional database.
• Each table must have an element defined as a primary key,
and all access attempts to HBase tables must use this primary
key.
HBase Architecture: HBase Data Model
Q.Write a note on Hbase data model

• HBase is a column-oriented NoSQL database.

Difference between Column-oriented and Row-oriented

databases:

– Row-oriented databases store table records in a sequence of rows. Whereas

column-oriented databases store table records in a sequence of columns,

i.e. the entries in a column are stored in contiguous locations on disks.


For eg. consider the table below
Customer ID Name Address Product ID Product Name
1 Paul Walker US 231, Gallardo
2 Vin Diesel Brazil 520 Mustang

If this table is stored in a row-oriented database.


It will store the records as shown below:
1, Paul Walker, US, 231, Gallardo,

2, Vin Diesel, Brazil, 520, Mustang


• In row-oriented databases data is stored on the basis of rows or tuples as you
can see above.

While the column-oriented databases store this data as:


1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang

• In a column-oriented databases, all the column values are stored together like

first column values will be stored together, then the second column values will

be stored together and data in other columns are stored in a similar manner.
• HBase tables has following components, shown in the image
below:
• Tables: Data is stored in a table format in HBase. But here tables are
in column-oriented format.
• Row Key: Row keys are used to search records which make searches
fast.
• Column Families: Various columns are combined in a column family.
These column families are stored together which makes the
searching process faster because data belonging to same column
family can be accessed together in a single seek.
• Column Qualifiers: Each column’s name is known as its column
qualifier.
• Cell: Data is stored in cells. The data is dumped into cells which are
specifically identified by rowkey and column qualifiers.
• Timestamp: Timestamp is a combination of date and time.
Whenever data is stored, it is stored with its timestamp. This makes
easy to search for a particular version of data.
• HBase consists of:
– Set of tables
– Each table with column families and rows
– Row key acts as a Primary key in HBase.
– Any access to HBase tables uses this Primary Key
– Each column qualifier present in HBase denotes attribute
corresponding to the object which resides in the cell.
HBase Architecture: Components
of HBase Architecture
HBase Architecture: Components of HBase
Architecture

• HBase architecture has 3 important components- HMaster,

Region Server and ZooKeeper.

• All these Servers (HMaster, Region Server, Zookeeper) are

placed to coordinate and manage Regions and perform

various operations inside the Regions.


Region Server
• Region server is a process which handles read, writes, update and
delete requests from clients.
• It runs on every node in a Hadoop cluster that is HDFS DataNode.
• HBase is a column-oriented database management system.
• It runs on top of HDFS.
• It suits for sparse data sets which are common in Big Data use cases.
• HBase support writing application in Apache Avro, REST and
Thrift.
• Apache HBase has low latency storage.
• Enterprises use this for real-time analysis.
• The design of HBase is such that to contain many tables.
Each of these tables must have a primary key.
• Memstore : It is HBase implementation of in-memory data
cache, helps to increase performance by serving as much as
data as possible directly from memory.
• WAL: Write-ahead-Log records all changes to the data .Which
is useful in server crashes for recovering everything .If writing
the record to the WAL fails, the whole operation must be
considered failure.
• HFile: It is specialized HDFS file format for Hbase. The
implementation of Hfile in a region server is responsible for
reading and writing Hfiles to and from HDFS.
• Zookeeper: Distributed Hbase instance is depends on a

running zookeeper cluster.

• All participating node and client must able to access the

running Zookeeper instances.

• By default HBase manages a Zookeeper cluster to start and

stop Zookeeper processes as a part of Hbase start and stop

process.
1) Hmaster: HBase HMaster is a lightweight process that assigns regions to region servers
in the Hadoop cluster for load balancing.

It handles a collection of Region Server which resides on DataNode.


Responsibilities of HMaster –

• Manages and Monitors the Hadoop Cluster

• performs DDL operations (create and delete tables) and assigns regions to the Region servers

• It coordinates and manages the Region Server (similar as NameNode manages DataNode in HDFS).

• Whenever a client wants to change the schema and change any of the metadata operations, HMaster

is responsible for all these operations.

• It assigns regions to the Region Servers on startup and re-assigns regions to Region Servers during

recovery and load balancing.

• Controlling the failover- It monitors all the Region Server’s instances in the cluster (with the help of

Zookeeper) and performs recovery activities whenever any Region Server is down.

• It provides an interface for creating, deleting and updating tables.


HBase Master

HBase performs the following functions:


– Maintain and monitor the Hadoop cluster.

– Performs administration of the database.

– Controls the failover.

– HMaster handles DDL (Data Definition Language )operation such as create and
delete.
2) Region Server:

• It is worker nodes which handle read, write, update, and delete


requests from clients.

• It maintains various regions running on the top of HDFS.

• It process & runs on every node in the hadoop cluster.

• It runs on HDFS DataNode .

Region Server consists of the following components –


– Block Cache

– MemStore

– Write Ahead Log (WAL)

– HFile
• Region
– It contains all the rows between the start key and the end key
assigned to that region.
– HBase tables can be divided into a number of regions in
such a way that all the columns of a column family is stored in
one region.
– Each region contains the rows in a sorted order.
– Many regions are assigned to a Region Server, which is
responsible for handling, managing, executing reads and
writes operations on that set of regions.
• Region
– So, concluding in a simpler way:
• A table can be divided into a number of regions.
• It is a sorted range of rows storing data between a start key
and an end key.
• It has a default size of 256MB which can be configured
according to the need.
• A Group of regions is served to the clients by a Region
Server.
• A Region Server can serve approximately 1000 regions to
the client.
Region Server components –

• Block Cache –

– It resides in the top of Region Server .

– This is the read cache.

– Most frequently read data is stored in the read cache and whenever the block cache is

full, recently used data is evicted.

– If the data in BlockCache is least recently used, then that data is removed from

BlockCache.

• MemStore-

– It is the write cache.

– It stores all the incoming data before committing it to the disk or permanent memory.

– There is one MemStore for each column family in a region.

– There are multiple MemStores for a region because each region contains multiple

column families.
Region Server components –

• Write Ahead Log (WAL):


– It is a file attached to every Region Server inside the distributed environment.

– The WAL stores the new data that hasn’t been persisted or committed to the

permanent storage.

– It is used in case of failure to recover the data sets.

– It is a file that stores new data that is not persisted to permanent storage.

• HFile :
– It is the actual storage file that stores the rows as sorted key values on a disk.

– HFile is stored on HDFS.

– Thus it stores the actual cells on the disk.

– MemStore commits the data to HFile when the size of MemStore exceeds
3) ZooKeeper :
– It acts like a coordinator inside HBase distributed environment.

– It helps in maintaining server state inside the cluster by communicating

through sessions.

– It is a centralized monitoring server that maintains configuration

information and provides distributed synchronization.

– Whenever a client wants to communicate with regions, they have to

approach Zookeeper first.

– HMaster and Region servers are registered with ZooKeeper service, client

needs to access ZooKeeper quorum in order to connect with region servers

and HMaster.

– ZooKeeper service keeps track of all the region servers that are there in an

HBase cluster- tracking information about how many region servers are
• Various services that Zookeeper provides include –

– Establishing client communication with region servers.

– Tracking server failure and network partitions.

– Maintain Configuration Information

– Provides ephemeral nodes, which represent different region

servers.
• Every Region Server along with HMaster Server sends continuous heartbeat

at regular interval to Zookeeper and it checks which server is alive and

available as mentioned in above image. It also provides server failure

notifications so that, recovery measures can be executed

• There is an inactive server, which acts as a backup for active server. If the

active server fails, it comes for the rescue.

• While if a Region Server fails to send a heartbeat, the session is expired and

all listeners are notified about it. Then HMaster performs suitable recovery

actions which we will discuss later in this blog.

• Zookeeper also maintains the .META Server’s path, which helps any client

in searching for any region. The Client first has to check with .META Server in

which Region Server a region belongs, and it gets the path of that Region
•The META table is a special HBase catalog table.

•It maintains a list of all the Regions Servers in the HBase storage system

• .META file maintains the table in form of keys and values.

✔ Key represents the start key of the region and its id

✔Value contains the path of the Region Server.


HBase Architecture:
How Search Initializes in HBase?

• Zookeeper stores the META table location. Whenever a client

approaches with a read or writes requests to HBase following

operation occurs:
– The client retrieves the location of the META table from the ZooKeeper.

– The client then requests for the location of the Region Server of

corresponding row key from the META table to access it. The client

caches this information with the location of the META Table.

– Then it will get the row location by requesting from the corresponding

Region Server.
• For future references, the client uses its cache to retrieve the location of META

table and previously read row key’s Region Server. Then the client will not refer

to the META table, until and unless there is a miss because the region is shifted

or moved. Then it will again request to the META server and update the cache.

• As every time, clients does not waste time in retrieving the location of Region

Server from META Server, thus, this saves time and makes the search process

faster.
HBase Architecture: HBase Write Mechanism

Step 1: Whenever the client has a write request, the client writes the data to
the WAL (Write Ahead Log). The edits are then appended at the end of the
WAL file. This WAL file is maintained in every Region Server and Region
Server uses it to recover data which is not committed to the disk.

Step 2: Once data is written to the WAL, then it is copied to the MemStore.

Step 3: Once the data is placed in MemStore, then the client receives the
acknowledgment.

Step 4: When the MemStore reaches the threshold, it dumps or commits the
data into a HFile.
HBase Architecture: HBase Write Mechanism
HBase Write Mechanism- MemStore

• The MemStore always updates the data stored in it, in a lexicographical order

(sequentially in a dictionary manner) as sorted KeyValues.

• There is one MemStore for each column family, and thus the updates are

stored in a sorted manner for each column family.

• When the MemStore reaches the threshold, it dumps all the data into a new

HFile in a sorted manner. This HFile is stored in HDFS. HBase contains multiple

HFiles for each Column Family.

• Over time, the number of HFile grows as MemStore dumps the data.

• MemStore also saves the last written sequence number, so Master Server and

MemStore both knows, that what is committed so far and where to start from.

When region starts up, the last sequence number is read, and from that

number, new edits start.


HBase Architecture: Hbase Write Mechanism- HFile
• HFile is the main persistent storage in an HBase architecture. At last, all
the data is committed to HFile which is the permanent storage of HBase.
• The writes are placed sequentially on the disk. Therefore, the movement
of the disk’s read-write head is very less. This makes write and search
mechanism very fast.
• The HFile indexes are loaded in memory whenever an HFile is opened.
This helps in finding a record in a single seek.
• The trailer is a pointer which points to the HFile’s meta block . It is written
at the end of the committed file. It contains information about timestamp
and bloom filters.
• Bloom Filter helps in searching key value pairs, it skips the file which does
not contain the required rowkey. Timestamp also helps in searching a
HBase Architecture: Read Mechanism

• First the client retrieves the location of the Region Server from .META

Server if the client does not have it in its cache memory. Then it goes

through the sequential steps as follows:

• For reading the data, the scanner first looks for the Row cell in Block

cache. Here all the recently read key value pairs are stored.

• If Scanner fails to find the required result, it moves to the MemStore,

as we know this is the write cache memory. There, it searches for the

most recently written files, which has not been dumped yet in HFile.

• At last, it will use bloom filters and block cache to load the data from

HFile.
HBase Architecture: Compaction

• HBase combines HFiles to reduce the storage and reduce the number of

disk seeks needed for a read. This process is called compaction.

• Compaction chooses some HFiles from a region and combines them.

• But during this process, input-output disks and network traffic might get

congested. This is known as write amplification. So, it is generally

scheduled during low peak load timings.

• There are two types of compaction.

– Minor Compaction

– Major Compaction
HBase Architecture: Compaction

• Minor Compaction: HBase automatically picks smaller HFiles and recommits them

to bigger HFiles as shown in the above image. This is called Minor Compaction. It

performs merge sort for committing smaller HFiles to bigger HFiles. This helps in

storage space optimization.

• Major Compaction: As illustrated in the above image, in Major compaction, HBase

merges and recommits the smaller HFiles of a region to a new HFile. In this

process, the same column families are placed together in the new HFile. It drops

deleted and expired cell in this process. It increases read performance.


HBase Architecture: Region Split

• Whenever a region becomes large,


it is divided into two child regions,
as shown in the above figure.
• Each region represents exactly
a half of the parent region. Then
this split is reported to the
HMaster. This is handled by the
same Region Server until the
HMaster allocates them to a new
Region Server for load balancing.
HBase Architecture: HBase Crash and Data Recovery

• Whenever a Region Server fails, ZooKeeper notifies to the HMaster about the

failure.

• Then HMaster distributes and allocates the regions of crashed Region Server to

many active Region Servers. To recover the data of the MemStore of the failed

Region Server, the HMaster distributes the WAL to all the Region Servers.

• Each Region Server re-executes the WAL to build the MemStore for that failed

region’s column family.

• The data is written in chronological order (in a timely order) in WAL. Therefore,

Re-executing that WAL means making all the change that were made and stored

in the MemStore file.

• So, after all the Region Servers executes the WAL, the MemStore data for all

column family is recovered.


Comparison of HBase vs RDBMS

Sr.No. HBase RDBMS


1 It is the column-oriented database. It s row-oriented database.
2 Schema of HBase is less restrictive, Schema of RDBMS is more restrictive.
adding columns on the fly is possible.
3 HBase is good with the Sparse table. It is not optimized for sparse tables.
4 HBase supports scale out. It means It supports scale up. That means while we
while we need memory processing need memory processing power and
power and more disk, we need to add more disk, we need upgrade same server
new servers to the cluster rather than to a more powerful server, rather than
upgrading the present one. adding new servers.
5 Amount of data does not depend on Amount of data depends on configuration
the particular machine but depends of the server
on number of machines.

6 It follows CAP (Consistency, RDBMS has ACID ( Atomicity, Consistency,


Availability,Partition-tolerance) Isolation and Durability) property.
theorem.
7 supports both structured and RDBMS is suited for structured data.
nonstructural data.
Comparison of HBase vs RDBMS

Sr.No. HBase RDBMS


8 There is no transaction guaranty. RDBMS mostly guarantees transaction
integrity.
9 HBase supports JOINs. RDBMS does not support JOINs.
10 There is no in-built support to supports referential integrity.
referential integrity.
11 No query language SQL
12 De-normalize your data. Normalize as you can
13 Horizontal scalability-just add Hard to share and scale.
hard ware
14 Faster retrieval of data Slower retrieval of data
15 Dynamic in nature It is static in nature
HBase Shell Commands (self Study )
• General command
• Data Definition Language (Tables Managements command)
• Data manipulation command
• Cluster Replication Command
HBase Shell Commands
General command
i. Status : This command provides the status of HBase, like, the number of servers, active

server count, and average load value. The parameters can be'summary', 'simple', or

'detailed', the default parameter provided is "summary".

hbase> status

hbase> status ‘simple’

hbase> status ‘summary’

hbase> status ‘detailed’

ii. version : It shows the version of HBase being used.

hbase> version

iii. table_help : This command provides help for table-reference commands , scan, put, get,

disable, drop etc.

Syntax : table_help

iv. Whoami : It shows the information about the user & Groups present in Hbase
HBase Shell Commands

Data Definition Language (Tables Managements command)

The commands which operate on the tables in Hbase , are Data Definition

Language

• Create: This command creates a table.

Syntax: create <tablename>, <columnfamilyname>

for eg create 'education' ,’studentplacement'

• List : It lists all the tables that are present or created in HBase.

Syntax: list
HBase Shell Commands

Data Definition Language (Tables Managements command)

• Describe: It will give information about table name with column

families, associated filters, versions and some more details.

Syntax:describe <table name>

for eg. describe 'education‘

• Disable : This command will start disabling the named table . If table

needs to be deleted or dropped, it has to disable first

Syntax: disable <tablename>


HBase Shell Commands

Data Definition Language (Tables Managements command)

• is_disabled: It verifies whether a table is disabled.

Syntax: is_disabled 'education'

• enable : This command enables a table.

If a table is disabled in the first instance and not deleted or dropped, and if

we want to re-use the disabled table then we have to enable it by using this

command.

Syntax: enable <tablename>

For eg. enable 'education'

• is_enabled: It verifies whether a table is enabled or not.


HBase Shell Commands

Data Definition Language (Tables Managements command)


• disable_all : It will disable all the tables matching the given regex.
Once the table gets disable the user can able to delete the table
from HBase
Before delete or dropping table, it should be disabled first
Syntax: disable_all<"matching regex“
• show_filters : This command displays all the filters present in HBase
like ColumnPrefix Filter, TimestampsFilter, PageFilter, FamilyFilter,
etc.
Syntax: show_filters
HBase Shell Commands

Data Definition Language (Tables Managements command)


• drop: to drop or delete table first the table should be disable using
disable command
Syntax: drop <table name>
for eg. drop 'education‘
Before execution of this command, it is necessary that you disable
table "education."
• drop_all: will drop all the tables matching the given regex
Tables have to disable first before executing this command using
disable_all
Tables with regex matching expressions are going to drop from HBase
Syntax: drop_all<"regex">
HBase Shell Commands

Data Definition Language (Tables Managements command)

• Alter: It alters the column family schema.

Syntax: alter <tablename>, NAME=><column familyname>, VERSIONS=>5


– Altering single, multiple column family names

for eg. To change or add the ‘Placementrecoreds' column family in table


'education' from current value to keep a maximum of 5 cell VERSIONS,
hbase> alter 'education', NAME=‘Placementrecoreds ', VERSIONS=>5
"education" is table name created with column name “placement"
previously.Here with the help of an alter command we are trying to
change the column family schema to placementrecoreds from placement
HBase Shell Commands

Data Definition Language (Tables Managements command)

– Deleting column family names from table using alter


command
to delete the column space name placementrecoreds that we
previously created in the first step
For eg. hbase> alter 'education', 'delete' =>' placementrecoreds'
to delete column family from the table. To delete the 'f1'
column family in table 'education'.
For eg. hbase> alter 'education', NAME => 'f1', METHOD =>
'delete‘
HBase Shell Commands

Data Definition Language (Tables Managements command)


– Several other operations using scope attributes with table using alter
command
• to change table scope attribute :
Syntax: alter <'tablename'>, MAX_FILESIZE=>'132545224‘

For eg. alter <‘education'>, MAX_FILESIZE=>'132545224‘

You can change table-scope attributes like MAX_FILESIZE,(is in term of memory in


bytes) READONLY, MEMSTORE_FLUSHSIZE, DEFERRED_LOG_FLUSH, etc. These can
be put at the end
• remove a table-scope attribute using table_att_unset method

For eg: alter 'education', METHOD => 'table_att_unset', NAME => 'MAX_FILESIZE‘

it will simply unset MAX_FILESIZE attribute from"education" table.


HBase Shell Commands
Data Definition Language (Tables Managements command)
• alter_status: It gives the status of the alter command . Which
indicates the number of regions of the table that have
received the updated schema of pass table name
Syntax: alter_status 'education'
HBase Shell Commands

Data manipulation commands

• Count: count and return the number of rows in a table

Syntax: count <'tablename'>

For eg. hbase> count ‘education'

• Put: puts a cell value at a specified column in a

specified row.

Syntax: put

<'tablename'>,<'rowname'>,<'columnvalue'>,<'value'>
HBase Shell Commands

Data manipulation commands

• Get:to fetch the contents of the row or a cell. you can also add additional

parameters to it like TIMESTAMP, TIMERANGE,VERSIONS, FILTERS, etc. to get a

particular row or cell content.

Syntax: get <'tablename'>, <'rowname'>, {< Additional parameters>}

For eg. hbase> get ‘education', 'r1', {COLUMN => 'c1'}

row r1 and column c1 values will display using this command

For eg. hbase> get ‘education', 'r1‘

row r1 values will be displayed form table education

For eg. hbase> get ‘education', 'r1', {TIMERANGE => [ts1, ts2]}

row 1 values in the time range ts1 and ts2 will be displayed form education

For eg. hbase> get ‘education', 'r1', {COLUMN => ['c1', 'c2', 'c3']}
HBase Shell Commands

Data manipulation commands

• Truncate: After truncate of an hbase table, the

schema will present but not the records

Syntax: truncate <tablename>


• . This command performs 3 functions; those
are listed below
• Disables table if it already presents
• Drops table if it already presents
• Recreates the mentioned table
HBase Shell Commands

Data manipulation commands


• Delete : delete cell value at defined table of row or column.

Syntax:delete <'tablename'>,<'row name'>,<'column name'>

For eg. delete ‘education', 'r1', 'c1''.

will delete row r1 from column family c1 in table “education."

• Deleteall: will delete all cells in a given row.

Syntax: deleteall <'tablename'>, <'rowname'>

For eg. hbase>deleteall 'guru99', 'r1', 'c1‘

will delete all the rows and columns present in the table
HBase Shell Commands
Data manipulation commands
• Scan: scans entire table and displays the table contents. It may include
one or more attributes such as TIMERANGE, FILTER, TIMESTAMP, LIMIT,
MAXLENGTH, COLUMNS, CACHE, STARTROW and STOPROW.

Syntax: scan <'tablename'>, {Optional parameters}

for eg. scan ‘education’

The different usages of s,can command are


Command Usage

scan '.META.', {COLUMNS => 'info:regioninfo'} It display all the meta data
information related to columns that
are present in the tables in HBase

scan ‘education', {COLUMNS => ['c1', 'c2'], LIMIT => 10, It display contents of table
STARTROW => 'xyz'} education with their column families
c1 and c2 limiting the values to 10

scan ‘education', {COLUMNS => 'c1', TIMERANGE => It display contents of education with
[804, 904]} its column name c1 with the values
present in between the mentioned
time range attribute value

scan ‘education', {RAW => true, VERSIONS =>10} In this command RAW=> true
provides advanced feature like to
display all the cell values present in
the table guru99
Cluster Replication Commands

Command Functionality
add_peer Add peers to cluster to replicate
hbase> add_peer '3', zk1,zk2,zk3:2182:/hbase-prod
remove_peer Stops the defined replication stream.
Deletes all the metadata information about the peer
hbase> remove_peer '1'

start_replication Restarts all the replication features


hbase> start_replication
stop_replication Stops all the replication features
hbase>stop_replication

You might also like