Unit 3 NoSQL
Unit 3 NoSQL
✔
Introduction to NoSQL
✔
No SQL business drivers
✔
No SQL data architecture patterns
✔
No SQL to manage Big data
✔
HBSE overview
NoSQL
(Not Only SQL database)
What is NoSQL?
• The system response time becomes slow by using RDBMS for massive
volumes of data.
hosts whenever the load increases. This method is known as "scaling out."
Features of NoSQL
Non-relational
✔ NoSQL databases never follow the relational model
– Consistency : The data should remain consistent even after the execution of an
operation. Its guarantees all storage and their replicated nodes have the same data at
the same time
– Availability: The database should always be available and responsive. That is every
request is guaranteed to receive a success or failure response
– Partition Tolerance : that the system should continue to function even if the
communication among the servers is not stable, in spite of arbitrary partitioning due
to network failures
BASE Properties of NoSQL Database
✔ Basically Available means DB is available all the time as per CAP theorem . That is every
✔ Soft state means even without an input; the system state may change
✔ Eventual consistency means that the system will become consistent over time. Means to
have copies of data on multiple machines to get high availability and scalability. Thus,
changes made to any data item on one machine has to be propagated to other replicas.
Types of NoSQL Databases
heavy load.
– It store data as a hash table where each key is unique, and the value can be a JSON,
– For example, a key-value pair may contain a key like "Website" associated with a value
like “amazon".
Dynamo paper.
- It uses Hash Table with unique key and pointer to the particular item of data.
- A bucket Is logical group(not physical) of keys and so different bucket can have identical keys
3. To fetch the list of values associated with the list of keys use- Multi-get(key1,key2-----KeyN)
4. To remove the entry of r the key from data store use – Delete(Key)
⮚ Rules to access data using Key-Value :
- Due to lack of consistency they can’t be used for updating part of a value
increases
⮚ Column-based
– It work on columns and are based on BigTable paper by Google.
– Every column is treated separately. Values of single column databases are stored contiguously.
– They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc.
– Its widely used to manage data warehouses, business intelligence, CRM, Library card catalogs
Here,
- The outermost key emloyeeIndia is analogoues to row
- “address” and “projectDetails”are called column family
- The column family “address” has cplumns “city”and “pincode”
- The column family “projectDetails” has columns “ durationDays” and “cost”
- Column can be referenced using Column family
3. Document-Oriented
- It stores and retrieves data as a key value pair but the value part is stored as a document.
- It pair each key with a complex data structure known as a document. Documents can
contain many different key-value pairs, or key-array pairs, or even nested documents.
databases.
- Searching: The column and key value type lack a formal structure hence can not be
index , so searching is not possible. This can resolved by document store. Using
single ID , a query can result in getting any item out of document store. This is
- Difference in Key value and document store is that Key value stores into memory
the entire document in the value portion , whereas the document store extracts
• “document path “ is used like a key to access the leaf values of a document
• Employee[id=‘2003’]/address/street/buildingname/text()
4. Graphs based
- Graph type database stores entities as well the relations amongst those entities.
- Graph base database mostly used for social networks, logistics, spatial data.
• These databases are designed for data whose relations are represented as a graph
• These are used when a business problem has complex relationship among their
figure, shows how the business drivers volume, velocity, variability, and agility
apply pressure to the single CPU system, resulting in the cracks.
• Volume and velocity refer to the ability to handle large datasets that arrive
quickly.
• Variability refers to how diverse data types don't fit into structured tables
• Instead of Forcing systems designers to shift their focus from increasing speed
on a single chip to using more processors working together. The need to scale
out (also known as horizontal scaling), rather than scale up (faster processors),
are split into separate paths and sent to separate processors to divide and
• Velocity comes in picture when real time insertion (read and write) in to
• For eg. discount scheme in online shopping burst web traffic will
slowdown response for every user and system tuning these systems can
be costly when both high read and write throughput is desired , When
single processors RDBMSs are used as a back end to a web store front
Variability
• Companies that want to capture and report on exception data struggle
RDBMSs.
• For example, if a business unit wants to capture a few custom fields for a
particular customer, all customer rows within the database need to store
• Adding new columns to an RDBMS requires the system be shut down and
• The most complex part of building applications using RDBMSs is the process of
• If your data has nested and repeated subgroups of data structures you need to
UPDATE, DELETE and SELECT SQL statements to move object data to and from the
• This process is not simple and is associated with the largest barrier to rapid
• Even with experienced staff, small change requests can cause slowdowns in
shared RAM architecture: where many CPUs access a single shared RAM over a
shared disk system: where processors have independent RAM but share disk
• It is written in Java.
Hadoop project.
and consistent.
• HBase can be used in the following scenarios:
– Huge Data
– Structured Data
– Variable Schema
– Need of Compression
– Column Family
– Column
– Timestamp
⮚ An HBase table contains column families, which are the logical and physical
grouping of columns.
⮚ All column associates of the same column family have the same column
⮚ The row key is the implicit primary key. The Rows are sorted by the row key.
• An HBase system is designed to scale linearly.
• It comprises a set of standard tables with rows and columns,
much like a traditional database.
• Each table must have an element defined as a primary key,
and all access attempts to HBase tables must use this primary
key.
HBase Architecture: HBase Data Model
Q.Write a note on Hbase data model
databases:
• In a column-oriented databases, all the column values are stored together like
first column values will be stored together, then the second column values will
be stored together and data in other columns are stored in a similar manner.
• HBase tables has following components, shown in the image
below:
• Tables: Data is stored in a table format in HBase. But here tables are
in column-oriented format.
• Row Key: Row keys are used to search records which make searches
fast.
• Column Families: Various columns are combined in a column family.
These column families are stored together which makes the
searching process faster because data belonging to same column
family can be accessed together in a single seek.
• Column Qualifiers: Each column’s name is known as its column
qualifier.
• Cell: Data is stored in cells. The data is dumped into cells which are
specifically identified by rowkey and column qualifiers.
• Timestamp: Timestamp is a combination of date and time.
Whenever data is stored, it is stored with its timestamp. This makes
easy to search for a particular version of data.
• HBase consists of:
– Set of tables
– Each table with column families and rows
– Row key acts as a Primary key in HBase.
– Any access to HBase tables uses this Primary Key
– Each column qualifier present in HBase denotes attribute
corresponding to the object which resides in the cell.
HBase Architecture: Components
of HBase Architecture
HBase Architecture: Components of HBase
Architecture
process.
1) Hmaster: HBase HMaster is a lightweight process that assigns regions to region servers
in the Hadoop cluster for load balancing.
• performs DDL operations (create and delete tables) and assigns regions to the Region servers
• It coordinates and manages the Region Server (similar as NameNode manages DataNode in HDFS).
• Whenever a client wants to change the schema and change any of the metadata operations, HMaster
• It assigns regions to the Region Servers on startup and re-assigns regions to Region Servers during
• Controlling the failover- It monitors all the Region Server’s instances in the cluster (with the help of
Zookeeper) and performs recovery activities whenever any Region Server is down.
– HMaster handles DDL (Data Definition Language )operation such as create and
delete.
2) Region Server:
– MemStore
– HFile
• Region
– It contains all the rows between the start key and the end key
assigned to that region.
– HBase tables can be divided into a number of regions in
such a way that all the columns of a column family is stored in
one region.
– Each region contains the rows in a sorted order.
– Many regions are assigned to a Region Server, which is
responsible for handling, managing, executing reads and
writes operations on that set of regions.
• Region
– So, concluding in a simpler way:
• A table can be divided into a number of regions.
• It is a sorted range of rows storing data between a start key
and an end key.
• It has a default size of 256MB which can be configured
according to the need.
• A Group of regions is served to the clients by a Region
Server.
• A Region Server can serve approximately 1000 regions to
the client.
Region Server components –
• Block Cache –
– Most frequently read data is stored in the read cache and whenever the block cache is
– If the data in BlockCache is least recently used, then that data is removed from
BlockCache.
• MemStore-
– It stores all the incoming data before committing it to the disk or permanent memory.
– There are multiple MemStores for a region because each region contains multiple
column families.
Region Server components –
– The WAL stores the new data that hasn’t been persisted or committed to the
permanent storage.
– It is a file that stores new data that is not persisted to permanent storage.
• HFile :
– It is the actual storage file that stores the rows as sorted key values on a disk.
– MemStore commits the data to HFile when the size of MemStore exceeds
3) ZooKeeper :
– It acts like a coordinator inside HBase distributed environment.
through sessions.
– HMaster and Region servers are registered with ZooKeeper service, client
and HMaster.
– ZooKeeper service keeps track of all the region servers that are there in an
HBase cluster- tracking information about how many region servers are
• Various services that Zookeeper provides include –
servers.
• Every Region Server along with HMaster Server sends continuous heartbeat
• There is an inactive server, which acts as a backup for active server. If the
• While if a Region Server fails to send a heartbeat, the session is expired and
all listeners are notified about it. Then HMaster performs suitable recovery
• Zookeeper also maintains the .META Server’s path, which helps any client
in searching for any region. The Client first has to check with .META Server in
which Region Server a region belongs, and it gets the path of that Region
•The META table is a special HBase catalog table.
•It maintains a list of all the Regions Servers in the HBase storage system
operation occurs:
– The client retrieves the location of the META table from the ZooKeeper.
– The client then requests for the location of the Region Server of
corresponding row key from the META table to access it. The client
– Then it will get the row location by requesting from the corresponding
Region Server.
• For future references, the client uses its cache to retrieve the location of META
table and previously read row key’s Region Server. Then the client will not refer
to the META table, until and unless there is a miss because the region is shifted
or moved. Then it will again request to the META server and update the cache.
• As every time, clients does not waste time in retrieving the location of Region
Server from META Server, thus, this saves time and makes the search process
faster.
HBase Architecture: HBase Write Mechanism
Step 1: Whenever the client has a write request, the client writes the data to
the WAL (Write Ahead Log). The edits are then appended at the end of the
WAL file. This WAL file is maintained in every Region Server and Region
Server uses it to recover data which is not committed to the disk.
Step 2: Once data is written to the WAL, then it is copied to the MemStore.
Step 3: Once the data is placed in MemStore, then the client receives the
acknowledgment.
Step 4: When the MemStore reaches the threshold, it dumps or commits the
data into a HFile.
HBase Architecture: HBase Write Mechanism
HBase Write Mechanism- MemStore
• The MemStore always updates the data stored in it, in a lexicographical order
• There is one MemStore for each column family, and thus the updates are
• When the MemStore reaches the threshold, it dumps all the data into a new
HFile in a sorted manner. This HFile is stored in HDFS. HBase contains multiple
• Over time, the number of HFile grows as MemStore dumps the data.
• MemStore also saves the last written sequence number, so Master Server and
MemStore both knows, that what is committed so far and where to start from.
When region starts up, the last sequence number is read, and from that
• First the client retrieves the location of the Region Server from .META
Server if the client does not have it in its cache memory. Then it goes
• For reading the data, the scanner first looks for the Row cell in Block
cache. Here all the recently read key value pairs are stored.
as we know this is the write cache memory. There, it searches for the
most recently written files, which has not been dumped yet in HFile.
• At last, it will use bloom filters and block cache to load the data from
HFile.
HBase Architecture: Compaction
• HBase combines HFiles to reduce the storage and reduce the number of
• But during this process, input-output disks and network traffic might get
– Minor Compaction
– Major Compaction
HBase Architecture: Compaction
• Minor Compaction: HBase automatically picks smaller HFiles and recommits them
to bigger HFiles as shown in the above image. This is called Minor Compaction. It
performs merge sort for committing smaller HFiles to bigger HFiles. This helps in
merges and recommits the smaller HFiles of a region to a new HFile. In this
process, the same column families are placed together in the new HFile. It drops
• Whenever a Region Server fails, ZooKeeper notifies to the HMaster about the
failure.
• Then HMaster distributes and allocates the regions of crashed Region Server to
many active Region Servers. To recover the data of the MemStore of the failed
Region Server, the HMaster distributes the WAL to all the Region Servers.
• Each Region Server re-executes the WAL to build the MemStore for that failed
• The data is written in chronological order (in a timely order) in WAL. Therefore,
Re-executing that WAL means making all the change that were made and stored
• So, after all the Region Servers executes the WAL, the MemStore data for all
server count, and average load value. The parameters can be'summary', 'simple', or
hbase> status
hbase> version
iii. table_help : This command provides help for table-reference commands , scan, put, get,
Syntax : table_help
iv. Whoami : It shows the information about the user & Groups present in Hbase
HBase Shell Commands
The commands which operate on the tables in Hbase , are Data Definition
Language
• List : It lists all the tables that are present or created in HBase.
Syntax: list
HBase Shell Commands
• Disable : This command will start disabling the named table . If table
If a table is disabled in the first instance and not deleted or dropped, and if
we want to re-use the disabled table then we have to enable it by using this
command.
For eg: alter 'education', METHOD => 'table_att_unset', NAME => 'MAX_FILESIZE‘
specified row.
Syntax: put
<'tablename'>,<'rowname'>,<'columnvalue'>,<'value'>
HBase Shell Commands
• Get:to fetch the contents of the row or a cell. you can also add additional
For eg. hbase> get ‘education', 'r1', {TIMERANGE => [ts1, ts2]}
row 1 values in the time range ts1 and ts2 will be displayed form education
For eg. hbase> get ‘education', 'r1', {COLUMN => ['c1', 'c2', 'c3']}
HBase Shell Commands
will delete all the rows and columns present in the table
HBase Shell Commands
Data manipulation commands
• Scan: scans entire table and displays the table contents. It may include
one or more attributes such as TIMERANGE, FILTER, TIMESTAMP, LIMIT,
MAXLENGTH, COLUMNS, CACHE, STARTROW and STOPROW.
scan '.META.', {COLUMNS => 'info:regioninfo'} It display all the meta data
information related to columns that
are present in the tables in HBase
scan ‘education', {COLUMNS => ['c1', 'c2'], LIMIT => 10, It display contents of table
STARTROW => 'xyz'} education with their column families
c1 and c2 limiting the values to 10
scan ‘education', {COLUMNS => 'c1', TIMERANGE => It display contents of education with
[804, 904]} its column name c1 with the values
present in between the mentioned
time range attribute value
scan ‘education', {RAW => true, VERSIONS =>10} In this command RAW=> true
provides advanced feature like to
display all the cell values present in
the table guru99
Cluster Replication Commands
Command Functionality
add_peer Add peers to cluster to replicate
hbase> add_peer '3', zk1,zk2,zk3:2182:/hbase-prod
remove_peer Stops the defined replication stream.
Deletes all the metadata information about the peer
hbase> remove_peer '1'