0% found this document useful (0 votes)
59 views10 pages

Unit V-HBase

HBase

Uploaded by

Smitha Rajesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views10 pages

Unit V-HBase

HBase

Uploaded by

Smitha Rajesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Apache HBase

 Open Source, distributed Hadoop database.


 HBase is a data model that is similar to Google’s big table.
 Provide quick random access to huge amounts of structured data.

Hbase Features
 Apache HBase is linearly scalable.
 It provides automatic failure support.
 It also offers consistent read and writes.
 We can integrate it with Hadoop, both as a source as well as the destination.
 It has easy java API for the client.
 HBase also offers data replication across clusters.
 Other Features :
 Consistency-offers consistent reads and writes.
 Atomic Read and Write-During one read or write process, all other processes are prevented
from performing any read or write operations
 Sharding-HBase offers automatic and manual splitting of regions into smaller subregions, as
soon as it reaches a threshold size.
 High availability-It offers LAN and WAN which supports failover and recovery. There is a
master server, at the core, which handles monitoring the region servers as well as all metadata for
the cluster.

Use cases of Apache HBase are:


 While we want to have random, real-time read/write access to Big Data, we use Apache
HBase.
 It is possible to host very large tables on top of clusters of commodity hardware with Apache
HBase.
 After Google’s Bigtable, HBase is a non-relational database modeled. Basically, as Bigtable
acts up on Google File System, in same way HBase works on top of Hadoop and HDFS.

Hbase Architecture

HBase has three major components:


 The client library/ Zoo Keeper,
 A master server / Hbase Master
 Region servers.
A master server / Hbase Master
 Assigns regions to the region servers with the help of Apache ZooKeeper .
 Handles load balancing of the regions across region servers.
 It unloads the busy servers and shifts the regions to less occupied servers.
 Maintains the state of the cluster by negotiating the load balancing.

 Is responsible for schema changes and other metadata operations such as creation of tables
and column families.
ZooKeeper
 Zookeeper is an open-source project that provides services like maintaining configuration
information, naming, providing distributed synchronization, etc.
 ZooKeeper provides distributed coordination services.
 ZooKeeper maintains which servers are alive and which are available.

 In addition to availability, the nodes are also used to track server failures or network
partitions.

 Clients communicate with region servers via zookeeper.

 In pseudo and standalone modes, HBase itself will take care of zookeeper.

Region Server
 The region servers have regions that

◦ Communicate with the client and handle data-related operations.


◦ Handle read and write requests for all the regions under it.
◦ Decide the size of the region by following the region size thresholds.
 Regions are nothing but tables that are split up and spread across the region servers.
 Region Server are responsible for several things, like handling, managing, executing as well
as reads and writes HBase operations on that set of regions.
 The default size of a region is 256MB, which we can configure as per requirement.
HBase META Table
 META Table is a special HBase Catalog Table.
 It holds the location of the regions in the HBase Cluster.

Hbase Data Model


 Column-oriented database and the tables in it are sorted by row.
 The table schema defines only column families, which are the key value pairs.
 A table have multiple column families and each column family can have any number of
columns.
 Subsequent column values are stored contiguously on the disk.
 Each cell value of the table has a timestamp.
Given below is an example schema of table in Hbase.
Column Family Column Family Column Family Column Family
Rowid
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
1
2
3

 It is suitable for Online Analytical Processing (OLAP).


 Column-oriented databases are designed for huge tables.

HBase and RDBMS


HBase RDBMS
An RDBMS is governed by its schema,
HBase is schema-less, it doesn't have the concept of
which describes the whole structure of
fixed columns schema; defines only column families.
tables.
It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard to
scalable. scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as structured
It is good for structured data.
data.
Apache ZooKeeper
ZooKeeper is a distributed coordination service which also helps to manage the large set of
hosts.
The ZooKeeper framework was originally built at “Yahoo!” for accessing their applications in an
easy and robust manner. Later, Apache ZooKeeper became a standard for organized service used by
Hadoop, HBase, and other distributed frameworks.
 Apache HBase uses ZooKeeper to track the status of distributed data.
Benefits of ZooKeeper
Here are the benefits of using ZooKeeper −

 Simple distributed coordination process

 Synchronization − Mutual exclusion and co-operation between server processes. This


process helps in Apache HBase for configuration management.

 Ordered Messages-By stamping each update, it keeps track with a number denoting its
order.

 Serialization − Serialization means a surety about the consistency of running application.


Encode the data according to specific rules. Ensure your application runs consistently. This
approach can be used in MapReduce to coordinate queue to execute running threads.

 Reliability-When one or more nodes fail, the system keeps performing.

 Atomicity − Data transfer either succeed or fail completely, but no transaction is partial.

 Naming Service-To every node, identifying ZooKeeper attaches a unique identification


which is quite similar to the DNA.

 Automatice Failure Recovery-While modifying, ZooKeeper locks the data, so, if a failure
occurs in the database, that helps the cluster to recover it automatically.
Characteristics of ZooKeeper

• ZooKeeper is simple
◦ ZooKeeper is, simple filesystem that exposes a few simple operations, and some extra
abstractions such as ordering and notifications.
• ZooKeeper is expressive
◦ The ZooKeeper has primitives which are a rich set of building blocks that can be used to
build a large class of data structures and protocols.
◦ Examples include: distributed queues, distributed locks, and leader election among a
group of peers.
• ZooKeeper is highly available
◦ ZooKeeper runs on a collection of machines and is designed to be highly available, so
applications can depend on it.
• ZooKeeper facilitates loosely coupled interactions
◦ ZooKeeper interactions support participants that do not need to know about one another.
◦ For example, ZooKeeper can be used as a meeting mechanism so that processes that
otherwise don’t know of each other’s existence (or network details) can discover each
other and interact.
◦ Coordinating parties may not even be contemporaneous, since one process may leave a
message in ZooKeeper that is read by another after the first has shut down.
• ZooKeeper is a library
◦ ZooKeeper provides an open source, shared repository of implementations and recipes
of common coordination patterns.
◦ Individual programmers do not have the burden of writing common protocols
themselves .
◦ Over time the community can add to, and improve, the libraries, which is to everyone’s
benefit.
Architecture of ZooKeeper

 One ZooKeeper client is connected to one ZooKeeper server, at any given time.
 It has a simple client-server model in which clients are nodes (i.e. machines) and servers are
nodes.
 As a function, ZooKeper Clients make use of the services and servers provides the services.
 Applications make calls to ZooKeeper through a client library.
 The client library handles the interaction with ZooKeeper servers here.
 ZooKeeper architecture must be able to tolerate failures.
 Also, it must be in the position to recover from correlated recoverable failures (power
outages).
 Most importantly it must be correct or easy to implement correctly.

 Additionally, it must be fast along with high throughput and low latency.
Part Description
Clients, one of the nodes in our distributed application cluster, access information from
the server. For a particular time interval, every client sends a message to the server to let
the sever know that the client is alive.
Client
Similarly, the server sends an acknowledgement when a client connects. If there is no
response from the connected server, the client automatically redirects the message to
another server.

Server, one of the nodes in our ZooKeeper ensemble, provides all the services to clients.
Server
Gives acknowledgement to client to inform that the server is alive.
Group of ZooKeeper servers. The minimum number of nodes that is required to form an
Ensemble
ensemble is 3.
Server node which performs automatic recovery if any of the connected node failed.
Leader
Leaders are elected on service startup.
Follower Server node which follows leader instruction.

Hierarchical Namespace
The following diagram depicts the tree structure of ZooKeeper file system used for memory
representation. ZooKeeper node is referred as znode. Every znode is identified by a name and
separated by a sequence of path (/).

 In the diagram, first you have a root znode separated by “/”. Under root, you have two
logical namespaces config and workers.

 The config namespace is used for centralized configuration management and the workers
namespace is used for naming.

 Under config namespace, each znode can store upto 1MB of data. This is similar to UNIX
file system except that the parent znode can store data as well. The main purpose of this
structure is to store synchronized data and describe the metadata of the znode. This structure
is called as ZooKeeper Data Model.
Every znode in the ZooKeeper data model maintains a stat structure. A stat simply provides the
metadata of a znode. It consists of Version number, Action control list (ACL), Timestamp, and Data
length.

 Version number − Every znode has a version number, which means every time the data
associated with the znode changes, its corresponding version number would also increased.
The use of version number is important when multiple zookeeper clients are trying to
perform operations over the same znode.

 Action Control List (ACL) − ACL is basically an authentication mechanism for accessing
the znode. It governs all the znode read and write operations.

 Timestamp − Timestamp represents time elapsed from znode creation and modification. It
is usually represented in milliseconds. ZooKeeper identifies every change to the znodes
from “Transaction ID” (zxid). Zxid is unique and maintains time for each transaction so that
you can easily identify the time elapsed from one request to another request.

 Data length − Total amount of the data stored in a znode is the data length. You can store a
maximum of 1MB of data.

Types of Znodes
Znodes are categorized as persistence, sequential, and ephemeral.

 Persistence znode − Persistence znode is alive even after the client, which created that
particular znode, is disconnected. By default, all znodes are persistent unless otherwise
specified.

 Ephemeral znode − Ephemeral znodes are active until the client is alive. When a client gets
disconnected from the ZooKeeper ensemble, then the ephemeral znodes get deleted
automatically. For this reason, only ephemeral znodes are not allowed to have a children
further. If an ephemeral znode is deleted, then the next suitable node will fill its position.
Ephemeral znodes play an important role in Leader election.

 Sequential znode − Sequential znodes can be either persistent or ephemeral. When a new
znode is created as a sequential znode, then ZooKeeper sets the path of the znode by
attaching a 10 digit sequence number to the original name. For example, if a znode with path
/myapp is created as a sequential znode, ZooKeeper will change the path to
/myapp0000000001 and set the next sequence number as 0000000002. If two sequential
znodes are created concurrently, then ZooKeeper never uses the same number for each
znode. Sequential znodes play an important role in Locking and Synchronization.

Basic Operations in ZooKeeper


Operation Description
create Creates a znode (the parent znode must already exist)

delete delete Deletes a znode (the znode may not have any children)

exists exists Tests whether a znode exists and retrieves its


metadata

getAcl,setAcl getACL , setACL Gets/sets the ACL for a znode

getChildren getChildren Gets a list of the children of a znode

getData,setData getData , setData Gets/sets the data associated with a znode

sync sync Synchronizes a client’s view of a znode with ZooKeeper

• Update operations in ZooKeeper are conditional.


• A delete or setData operation has to specify the version number of the znode that is being
updated. If the version number does not match, the update will fail.
• Writes in ZooKeeper are atomic. Successfull write operations are saved permanently to the
ZooKeeper servers.

You might also like