Unit V-HBase
Unit V-HBase
Hbase Features
Apache HBase is linearly scalable.
It provides automatic failure support.
It also offers consistent read and writes.
We can integrate it with Hadoop, both as a source as well as the destination.
It has easy java API for the client.
HBase also offers data replication across clusters.
Other Features :
Consistency-offers consistent reads and writes.
Atomic Read and Write-During one read or write process, all other processes are prevented
from performing any read or write operations
Sharding-HBase offers automatic and manual splitting of regions into smaller subregions, as
soon as it reaches a threshold size.
High availability-It offers LAN and WAN which supports failover and recovery. There is a
master server, at the core, which handles monitoring the region servers as well as all metadata for
the cluster.
Hbase Architecture
Is responsible for schema changes and other metadata operations such as creation of tables
and column families.
ZooKeeper
Zookeeper is an open-source project that provides services like maintaining configuration
information, naming, providing distributed synchronization, etc.
ZooKeeper provides distributed coordination services.
ZooKeeper maintains which servers are alive and which are available.
In addition to availability, the nodes are also used to track server failures or network
partitions.
In pseudo and standalone modes, HBase itself will take care of zookeeper.
Region Server
The region servers have regions that
Ordered Messages-By stamping each update, it keeps track with a number denoting its
order.
Atomicity − Data transfer either succeed or fail completely, but no transaction is partial.
Automatice Failure Recovery-While modifying, ZooKeeper locks the data, so, if a failure
occurs in the database, that helps the cluster to recover it automatically.
Characteristics of ZooKeeper
• ZooKeeper is simple
◦ ZooKeeper is, simple filesystem that exposes a few simple operations, and some extra
abstractions such as ordering and notifications.
• ZooKeeper is expressive
◦ The ZooKeeper has primitives which are a rich set of building blocks that can be used to
build a large class of data structures and protocols.
◦ Examples include: distributed queues, distributed locks, and leader election among a
group of peers.
• ZooKeeper is highly available
◦ ZooKeeper runs on a collection of machines and is designed to be highly available, so
applications can depend on it.
• ZooKeeper facilitates loosely coupled interactions
◦ ZooKeeper interactions support participants that do not need to know about one another.
◦ For example, ZooKeeper can be used as a meeting mechanism so that processes that
otherwise don’t know of each other’s existence (or network details) can discover each
other and interact.
◦ Coordinating parties may not even be contemporaneous, since one process may leave a
message in ZooKeeper that is read by another after the first has shut down.
• ZooKeeper is a library
◦ ZooKeeper provides an open source, shared repository of implementations and recipes
of common coordination patterns.
◦ Individual programmers do not have the burden of writing common protocols
themselves .
◦ Over time the community can add to, and improve, the libraries, which is to everyone’s
benefit.
Architecture of ZooKeeper
One ZooKeeper client is connected to one ZooKeeper server, at any given time.
It has a simple client-server model in which clients are nodes (i.e. machines) and servers are
nodes.
As a function, ZooKeper Clients make use of the services and servers provides the services.
Applications make calls to ZooKeeper through a client library.
The client library handles the interaction with ZooKeeper servers here.
ZooKeeper architecture must be able to tolerate failures.
Also, it must be in the position to recover from correlated recoverable failures (power
outages).
Most importantly it must be correct or easy to implement correctly.
Additionally, it must be fast along with high throughput and low latency.
Part Description
Clients, one of the nodes in our distributed application cluster, access information from
the server. For a particular time interval, every client sends a message to the server to let
the sever know that the client is alive.
Client
Similarly, the server sends an acknowledgement when a client connects. If there is no
response from the connected server, the client automatically redirects the message to
another server.
Server, one of the nodes in our ZooKeeper ensemble, provides all the services to clients.
Server
Gives acknowledgement to client to inform that the server is alive.
Group of ZooKeeper servers. The minimum number of nodes that is required to form an
Ensemble
ensemble is 3.
Server node which performs automatic recovery if any of the connected node failed.
Leader
Leaders are elected on service startup.
Follower Server node which follows leader instruction.
Hierarchical Namespace
The following diagram depicts the tree structure of ZooKeeper file system used for memory
representation. ZooKeeper node is referred as znode. Every znode is identified by a name and
separated by a sequence of path (/).
In the diagram, first you have a root znode separated by “/”. Under root, you have two
logical namespaces config and workers.
The config namespace is used for centralized configuration management and the workers
namespace is used for naming.
Under config namespace, each znode can store upto 1MB of data. This is similar to UNIX
file system except that the parent znode can store data as well. The main purpose of this
structure is to store synchronized data and describe the metadata of the znode. This structure
is called as ZooKeeper Data Model.
Every znode in the ZooKeeper data model maintains a stat structure. A stat simply provides the
metadata of a znode. It consists of Version number, Action control list (ACL), Timestamp, and Data
length.
Version number − Every znode has a version number, which means every time the data
associated with the znode changes, its corresponding version number would also increased.
The use of version number is important when multiple zookeeper clients are trying to
perform operations over the same znode.
Action Control List (ACL) − ACL is basically an authentication mechanism for accessing
the znode. It governs all the znode read and write operations.
Timestamp − Timestamp represents time elapsed from znode creation and modification. It
is usually represented in milliseconds. ZooKeeper identifies every change to the znodes
from “Transaction ID” (zxid). Zxid is unique and maintains time for each transaction so that
you can easily identify the time elapsed from one request to another request.
Data length − Total amount of the data stored in a znode is the data length. You can store a
maximum of 1MB of data.
Types of Znodes
Znodes are categorized as persistence, sequential, and ephemeral.
Persistence znode − Persistence znode is alive even after the client, which created that
particular znode, is disconnected. By default, all znodes are persistent unless otherwise
specified.
Ephemeral znode − Ephemeral znodes are active until the client is alive. When a client gets
disconnected from the ZooKeeper ensemble, then the ephemeral znodes get deleted
automatically. For this reason, only ephemeral znodes are not allowed to have a children
further. If an ephemeral znode is deleted, then the next suitable node will fill its position.
Ephemeral znodes play an important role in Leader election.
Sequential znode − Sequential znodes can be either persistent or ephemeral. When a new
znode is created as a sequential znode, then ZooKeeper sets the path of the znode by
attaching a 10 digit sequence number to the original name. For example, if a znode with path
/myapp is created as a sequential znode, ZooKeeper will change the path to
/myapp0000000001 and set the next sequence number as 0000000002. If two sequential
znodes are created concurrently, then ZooKeeper never uses the same number for each
znode. Sequential znodes play an important role in Locking and Synchronization.
delete delete Deletes a znode (the znode may not have any children)