Unit 3
Unit 3
Introduction to HBase:
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-
source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random access to
huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System
(HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the
Hadoop File System. It can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System
and provides read and write access.
HDFS is a distributed file system HBase is a database built on top of the HDFS.
suitable for storing large files.
HDFS does not support fast individual HBase provides fast lookups for larger tables.
record lookups.
It provides high latency batch It provides low latency access to single rows from billions of
processing; no concept of batch records (Random access).
processing.
It provides only sequential access of HBase internally uses Hash tables and provides random
data. access, and it stores the data in indexed HDFS files for faster
lookups.
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines
only column families, which are the key value pairs. A table have multiple column families and each
column family can have any number of columns. Subsequent column values are stored contiguously
on the disk. Each cell value of the table has a timestamp. In short, in an HBase:
It is suitable for Online Transaction Process It is suitable for Online Analytical Processing
(OLTP). (OLAP).
Such databases are designed for small number of Column-oriented databases are designed for huge
rows and columns. tables
Difference between HBase and RDBMS:
HBase RDBMS
HBase is schema-less, it doesn't have the concept of An RDBMS is governed by its schema,
fixed columns schema; defines only column families. which describes the whole structure of tables.
It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard to
scalable. scale.
Features of HBase
Apache HBase is used to have random, real-time read/write access to Big Data.
It hosts very large tables on top of clusters of commodity hardware.
Apache HBase is a non-relational database modeled after Google's Big table. Big table acts up on
Google File System; likewise Apache HBase works on top of Hadoop and HDFS.
Applications of HBase:
History of HBase:
Year Event
Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.
HBase Architecture:
There are 3 types of servers in a master-slave type of HBase Architecture. They are HBase, HMaster,
Region Server, and ZooKeeper
Region servers- serve data for reads and write purposes. That means clients can directly communicate
with HBase Region Servers while accessing data.
HBase Master- handles the region assignment as well as DDL (create, delete tables) operations.
A part of HDFS
Zookeeper-maintains a live cluster state.
The data which we manage by Region Server further stores in the Hadoop DataNode. And, all HBase
data is stored in HDFS files.
Regions:
A region consists of all the rows between the start key and the end key which are assigned to that
Region. The Regions to which we assign the nodes in the HBase Cluster are called “Region Servers”. It
manages rows in each region in HBase in a sorted order.
These Regions of a Region Server are responsible for several things, like handling, managing, executing
as well as reads and writes HBase operations The default size of a region is 256MB, which we can
configure as per requirement.
HMaster:
It is responsible for region assignment as well as DDL (create, delete tables) operations.
There are two main responsibilities of a master in HBase architecture:
Coordinating the region servers:
Basically, a master assigns Regions on startup. Also for the purpose of recovery or load balancing, it re-
assigns regions.
Also, a master monitors all Region Server instances in the HBase Cluster.
Admin functions:
Moreover, it acts as an interface for creating, deleting and updating tables in HBase.
Zookeeper:
To maintain server state in the HBase Cluster, HBase uses ZooKeeper as a distributed coordination
service. Basically, which servers are alive and available is maintained by Zookeeper, and also it provides
server failure notification.
Pig:
Pig is a high-level programming language useful for analyzing large data sets. A pig was a result of
development effort at Yahoo!. Map Reduce is not a programming model which data analysts are familiar
with. Apache Pig enables people to focus more on analyzing bulk data sets and to spend less time writing
Map-Reduce programs. The Pig programming language is designed to work upon any kind of data.
Pig Architecture:
Pig allows the programmer to focus on data rather than the nature of execution.
Pig Latin is a relatively stiffened language which uses familiar keywords from data processing e.g., Join,
Group and Filter.
Execution modes:
Local mode: In this mode, Pig runs in a single JVM and makes use of local file system. This mode
is suitable only for analysis of small datasets using Pig
Map Reduce mode: In this mode, queries written in Pig Latin are translated into MapReduce jobs
and are run on a Hadoop cluster (cluster may be pseudo or fully distributed). MapReduce mode
with the fully distributed cluster is useful of running Pig on large datasets.
Sqoop:
The traditional application management system, that is, the interaction of applications with relational
database using RDBMS, is one of the sources that generate Big Data. Such Big Data, generated by
RDBMS, is stored in Relational Database Servers in the relational database structure.
When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra, Pig, etc. of the
Hadoop ecosystem came into picture, they required a tool to interact with the relational database
servers for importing and exporting the Big Data residing in them.
Sqoop occupies a place in the Hadoop ecosystem to provide feasible interaction between relational
database server and Hadoop’s HDFS.
Sqoop − “SQL to Hadoop and Hadoop to SQL”
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to
import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from
Hadoop file system to relational databases. It is provided by the Apache Software Foundation.
Working of Sqoop:
Sqoop Import:
The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a
record in HDFS. All records are stored as text data in text files or as binary data in Avro and Sequence
files.
Sqoop Export:
The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop
contain records, which are called as rows in table. Those are read and parsed into a set of records and
delimited with user-specified delimiter.
Zookeeper:
ZooKeeper is a distributed co-ordination service to manage large set of hosts. Co-ordinating and
managing a service in a distributed environment is a complicated process. ZooKeeper solves this issue
with its simple architecture and API. ZooKeeper allows developers to focus on core application logic
without worrying about the distributed nature of the application.
The ZooKeeper framework was originally built at “Yahoo!” for accessing their applications in an easy
and robust manner. Later, Apache ZooKeeper became a standard for organized service used by Hadoop,
HBase, and other distributed frameworks.
For example, Apache HBase uses ZooKeeper to track the status of distributed data.
The common services provided by ZooKeeper are as follows −
Naming service − Identifying the nodes in a cluster by name. It is similar to DNS, but for nodes.
Configuration management − Latest and up-to-date configuration information of the system for a
joining node.
Cluster management − Joining / leaving of a node in a cluster and node status at real time.
Leader election − Electing a node as leader for coordination purpose.
Locking and synchronization service − Locking the data while modifying it. This mechanism helps
you in automatic fail recovery while connecting other distributed applications like Apache HBase.
Highly reliable data registry − Availability of data even when one or a few nodes are down.
Distributed applications offer a lot of benefits, but they throw a few complex and hard-to-crack
challenges as well. ZooKeeper framework provides a complete mechanism to overcome all the
challenges. Race condition and deadlock are handled using fail-safe synchronization approach.
Another main drawback is inconsistency of data, which ZooKeeper resolves with atomicity.
The benefits of using ZooKeeper −
Simple distributed coordination process
Synchronization − Mutual exclusion and co-operation between server processes. This process helps in
Apache HBase for configuration management.
Ordered Messages
Serialization − Encode the data according to specific rules. Ensure your application runs consistently.
This approach can be used in MapReduce to coordinate queue to execute running threads.
Reliability
Atomicity − Data transfer either succeed or fail completely, but no transaction is partial.
Flume:
Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large
amounts of streaming data such as log files, events (etc...) from various sources to a centralized data store.
Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming
data (log data) from various web servers to HDFS.
Applications of Flume:
Assume an e-commerce web application wants to analyze the customer behavior from a particular region.
To do so, they would need to move the available log data in to Hadoop for analysis. Here, Apache Flume
comes to our rescue.
Flume is used to move the log data generated by application servers into HDFS at a higher speed.
Advantages of Flume:
Features of Flume:
Flume ingests log data from multiple web servers into a centralized store (HDFS, HBase) efficiently.
Using Flume, we can get the data from multiple servers immediately into Hadoop.
Along with the log files, Flume is also used to import huge volumes of event data produced by social
networking sites like Facebook and Twitter, and e-commerce websites like Amazon and Flipkart.
Flume supports a large set of sources and destinations types.
Flume supports multi-hop flows, fan-in fan-out flows, contextual routing, etc.
Flume can be scaled horizontally.
The following illustration depicts the basic architecture of Flume. As shown in the illustration, data
generators (such as Facebook, Twitter) generate data which gets collected by individual
Flume agents running on them. Thereafter, a data collector (which is also an agent) collects the data
from the agents which is aggregated and pushed into a centralized store such as HDFS or HBase.
Flume Event
An event is the basic unit of the data transported inside Flume. It contains a payload of byte array that is to
be transported from the source to the destination accompanied by optional headers. A typical Flume event
would have the following structure –
Flume Agent
An agent is an independent daemon process (JVM) in Flume. It receives the data (events) from clients or
other agents and forwards it to its next destination (sink or agent). Flume may have more than one agent.
Following diagram represents a Flume Agent
As shown in the diagram a Flume Agent contains three main components namely, source, channel,
and sink.
Source:
A source is the component of an Agent which receives data from the data generators and transfers it to
one or more channels in the form of Flume events.
Apache Flume supports several types of sources and each source receives events from a specified data
generator.
Example − Avro source, Thrift source, twitter 1% source etc.
Channel:
A channel is a transient store which receives the events from the source and buffers them till they are
consumed by sinks. It acts as a bridge between the sources and the sinks.
These channels are fully transactional and they can work with any number of sources and sinks.
Example − JDBC channel, File system channel, Memory channel, etc.
Sink:
A sink stores the data into centralized stores like HBase and HDFS. It consumes the data (events) from
the channels and delivers it to the destination. The destination of the sink might be another agent or the
central stores.
Example − HDFS sink
Oozie:
Apache Oozie is a scheduler system to run and manage Hadoop jobs in a distributed environment. It
allows to combine multiple complex jobs to be run in a sequential order to achieve a bigger task.
Within a sequence of task, two or more jobs can also be programmed to run parallel to each other.
Oozie is that it is tightly integrated with Hadoop stack supporting various Hadoop jobs like Hive, Pig,
Sqoop as well as system-specific jobs like Java and Shell.
Oozie is an Open Source Java Web-Application available under Apache license 2.0. It is responsible
for triggering the workflow actions, which in turn uses the Hadoop execution engine to actually
execute the task. Hence, Oozie is able to leverage the existing Hadoop machinery for load balancing,
fail-over, etc.
When Oozie starts a task, it provides a unique call back HTTP URL to the task, and notifies that URL
when it is complete
Following three types of jobs are common in Oozie −
Oozie Workflow Jobs − These are represented as Directed Acyclic Graphs (DAGs) to specify a
sequence of actions to be executed.
Oozie Coordinator Jobs − These consist of workflow jobs triggered by time and data availability.
Oozie Bundle − These can be referred to as a package of multiple coordinator and workflow jobs.