BDA Unit-5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Unit-5: Frameworks

Basics of Hive
 Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
 It resides on top of Hadoop to summarize Bigdata and makes querying and analysing easy.
 It is a platform used to develop SQL type scripts to do MapReduce operations.
 Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive.
 It is used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.

Feature of Hive
 Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates
 Features of Hive
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Hive Architecture
 Hive Components
 Shell
 Meta Store
 Compiler
 Execution Engine
 Driver
Hive Components
 Hive is a data warehouse infrastructure software that can create interaction between user
and HDFS.
 The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD
Insight (In Windows server).
 Hive chooses respective database servers to store the schema or Metadata of tables,
databases, columns in a table, their data types, and HDFS mapping.
 HiveQL is like SQL for querying on schema info on the Meta-store.
 It is one of the replacements of traditional approach for MapReduce program. Instead of
writing MapReduce program in Java, we can write a query for MapReduce job and process it.
 The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine.
Execution engine processes the query and generates results as same as MapReduce results.
 Hadoop distributed file system or HBASE are the data storage techniques to store data into
file system.

Hive Integration and Work Flow


 Explanation of the workflow. Hourly Log Data can be stored directly into HDFS and then
data cleansing is performed on the log file.
 Finally, Hive table(s) can be created to query the log file.

Hive Data Units


1. Databases:
 The namespace for tables.
2. Tables:
 Set of records that have similar schema.
3. Partitions:
 Logical separations of data based on classification of given information as per specific
attributes.
 Once hive has partitioned the data based on a specified key, it starts to assemble the
records into specific folders as and when the records are inserted.
4. Buckets (or Clusters):
 Similar to partitions but uses hash function to segregate data and determines the
cluster or bucket into which the record should be placed.

Partitioning & Bucketing


 Partitioning tables changes how Hive structures the data storage. Hive will create
subdirectories reflecting the partitioning structure like
 .../customers/country=ABC
 Although partitioning helps in enhancing performance and is recommended, having
too many partitions may prove detrimental for few queries.
 Bucketing is another technique of managing large datasets.
 If we partition the dataset based on customer_ID, we would end up with far too many
partitions.
 Instead, if we bucket the customer table and use customer_id as the bucketing
column, the value of this column will be hashed by a user-defined number into
buckets.
 Records with the same customer_id will always be placed in the same bucket.
 Assuming we have far more customer_ids than the number of buckets, each bucket
will house many customer_ids.
 While creating the table you can specify the number of buckets that you would like
your data to be distributed in using the syntax “CLUSTERED BY (customer_id) INTO
XX BUCKETS”; here XX is the number of buckets.

Hive Architecture
Hive Architecture
 Hive Command-Line Interface (Hive CLI): The most commonly used interface to interact
with Hive.
 Hive Web Interface: It is a simple Graphic User Interface to interact with Hive and to
execute query.
 Hive Server: This is an optional server. This can be used to submit Hive Jobs from a
remote client.
 JDBC/ODBC: Jobs can be submitted from a JDBC Client. One can write a Java code to
connect to Hive and submit jobs on it.
 Driver: Hive queries are sent to the driver for compilation, optimization and execution.
 Meta-store: Hive table definitions and mappings to the data are stored in a Meta-store.
 A Meta-store consists of the following:
 Meta-store Service: Offers interface to the Hive.
 Database: Stores data definitions, mappings to the data and others.

Metadata & Metastore


 The metadata which is stored in the metastore includes IDs of Database, IDs of Tables, IDs of
Indexes, etc.
 The time of creation of a Table, the Input Format used for a Table, the Output Format used
for a Table, etc.
 The metastore is updated whenever a table is created or deleted from Hive.
 There are three kinds of metastore.
1. Embedded Metastore
2. Local Metastore
3. Remote Metastore

 Embedded Metastore
 This metastore is mainly used for unit tests.
 Here, only one process is allowed to connect to the meta-store at a time. This is the
default meta-store for Hive.
 It is Apache Derby Database. In this meta-store, both the database and the meta-
store service run embedded in the main Hive Server process.
 Local Metastore
 Metadata can be stored in any RDBMS component like MySQL.
 Local meta-store allows multiple connections at a time. In this mode, the Hive meta-
store service runs in the main Hive Server process, but the meta-store database runs
in a separate process, and can be on a separate host.
 Remote Metastore
 In this, the Hive driver and the meta-store interface run on different JVMs (which can
run on different machines as well).
 This way the database can be free-walled from the Hive user and also database
credentials are completely isolated from the users of Hive.
HiveQL:

 Hive's SQL language is known as HiveQL, it is combination of SQL-92, Oracle's SQL language
and MySQL.
 HiveQl provides some improved features from previous version of SQL standards, like
analytics function from SQL 2003.
 Some Hive's extension like multitable inserts, TRANSFORM, MAP and REDUCE clauses, are
inspired by MapReduce.

SQL vs. HQL

Feature SQL HiveQL


Updates UPDATE, INSERT, DELETE UPDATE, INSERT, DELETE
Transaction Supported Limited Support
Indexes Supported Supported
Boolean, integral, floatingpoint,
Integral, floating-point, fixedpoint,
Data Types fixed-point, text and binary strings,
text and binary strings, temporal
temporal, array, map, struct
Functions Hundreds of built-in functions Hundreds of built-in functions
Multitable inserts Not supported Supported
Create table.....as
Not supported Supported
Select
Supported with SORT BY clause for
Select Supported partial ordering and LIMIT to
restrict number of rows returned.
Inner joins, outer joins, semi joins,
Joins Supported
map joins, cross joins
Used in FROM, WHERE, or HAVING
Subqueries Used in any clause
clauses
Views Updatable Read-only

Tables
 A Hive table is logically comprised of the information being stored and the related metadata
describing the layout of the information in the table. The data normally resides in HDFS, in
spite of the fact that it might reside in any Hadoop filesystem, including the local filesystem
or S3.
 Relational database is used to store Hive metadata.

Managed Tables and External tables

 Two types of table can be created in Hive: managed table and external table.
 When the table is created in Hive, by default Hive will manage the data that means Hive
moves the data into Hive's warehouse directory. Such a table is known as managed table.
 You can also create an external table which notifies Hive that the data is stored at a location
which is outside warehouse directory.
 The difference in these two types of table can be observed in LOAD and DROP query.
 First let's take example of managed table.
 When the data is loaded into managed table, it is moved to the Hive's warehouse directory.
For example:
CREATE TABLE managed tbl (dummy STRING);
LOAD DATA INPATH '/user/jerry/test.txt' INTO table managed_tbl;

 Above query moves the file hdfs://user/jerry/test.txt into Hive's warehouse directory for
managed_tbl table, that is hdfs://user/hive/warehouse/managed_tbl
 In hive the directory name in Hive's warehouse is same as managed table name.
 To drop the table following query is used:
DROP TABLE managed_tbl;
 After the above query get executed, the table along with it's data and metadata is deleted.
 The LOAD operation performs move operation and the DROP performs delete operation
after which data is no longer exists. This is the meaning of Hive to manage the data.
 An external table acts differently. In this creation and deletion of data is controlled. The
external data location is specified at the time of table creation, as follows:
CREATE EXTERNAL TABLE external_tbl (dummy STRING)
LOCATION '/user/jerry/external_tbl';
LOAD DATA INPATH '/user/jerry/test.txt' INTO TABLE external_tbl;
 The keyword EXTERNAL notifies Hive that Hive is not managing the data; hence the data is
not moved to the warehouse directory. Also it does not verify if the external location
specified is exists or not at the time it is defined.
 This feature specifies that the data can be created after the table is created.
 When the external table is dropped, Hive only deletes the metadata and leaves the data
untouched.

Importing Data
 Data can be imported using Load Data, Inserts and Create Table ..... As Select.

1. LOAD DATA
 It imports the data by moving or copying files to the table's directory.
 LOAD DATA is used as follows to import data :
CREATE TABLE test (dummy STRING);
LOAD DATA INPATH I/user/jerry/demo.txti INTO table test;

2. Inserts
 Insert statements are used as follows to import data into table :
INSERT OVERWRITE TABLE target_tbl
SELECT col1, col2
FROM source tbl;
 In the above query OVERWRITE keyword specifies that the content of targettable is replaced
by the results of the SELECT statement.
 In order to insert records into already created table Insert into table query can be used.
 In Hive we can do Multitable insert. For multitable insert, the statement will start as follows:
FROM source_table
INSERT OVERWRITE TABLE target_ table
SELECT col1, col2;
 Above query shows that multitable insert statement can be added in above query. Because
of this reason it is called Multitable Insert.
 Multitable Insert statement is more efficient than multiple INSERT statements because the
source table needs to be scanned only once to produce the multiple disjoint outputs.
 Following is an example of Multitable Insert:
FROM records2_tbl
INSERT OVERWRITE TABLE stations_by_year
SELECT year, COUNT(DISTINCT station)
GROUP BY year
INSERT OVERWRITE TABLE records_by_year
SELECT year, COUNT(1)
GROUP BY year
INSERT OVERWRITE TABLE good_records_by_year
SELECT year, COUNT(1)
WHERE temperature != 9999 AND quality IN (0, 1, 4, 5, 9) GROUP BY year;

The above shows that records2_tbl is the only source table, but three different tables are there to
store the results of different select statement.

3. CREATE TABLE.... AS SELECT


 Any time it is suitable to store the output of Hive query into new table, so as to perform
further processing on the output.
 The column definition of new table is derived by the SELECT clause.
 In the following query, the target_table have two columns named col1 and col2 whose data
type is same as source_table:
CREATE TABLE target_table
AS
SELECT col1, col2
FROM source_table;
 A CTAS operation is atomic, hence in any case if SELECT query fails, the table is not created.

Altering table
 In Hive table definition can be changed after the table is created, since Hive supports schema
on read approach.
 In Hive table can be renamed by using ALTER TABLE statement:
ALTER TABLE source_name RENAME TO target_name;
 Along with updating table metadata, ALTER TABLE statement also moves the table directory
so that it shows the new name. For above query the directory
/user/hive/warehouse/source_name renamed to /user/hive/warehou se/target_name.
 Hive permits to add new columns, change the definition of column or can replace the
existing columns with new set of columns.
 A new column can be added to an existing table as follows :
ALTER TABLE target_name ADD COLUMNS (col3 STRING);
 After the existing columns col3 is added. Since the data files are not updated, so the queries
for col3 will return null value, since Hive does not allow to update existing record.
Dropping Table
 To delete the data and metadata of table the DROP TABLE statement is used. If the table is
external, then only metadata is deleted and the data is left as it is.
 The DROP TABLE query can be used as follows :
DROP TABLE test_table;

 In can you want to keep the table definition and only delete the data TRUNCM statement is
used. Query will be as follow :
TRUNCATE TABLE test_table;

Querying Data

Sorting and Aggregating


 Data in Hive can be sorted by using ORDER BY clause. It performs sorting of input in parallel.
In case globally sorted result is not necessary, in such cases SORT BY clause can be used.
Sorted file for each reducer is generated using SORT BYclause.
 When you want .to control which row should go to which reducer, in such case DITRIBUTE
BY clause of Hive is used.
 Following example sorts the weather dataset by temperature and year, in such a way that all
rows for specified year go to same reducer:
hive> FROM records2_tbl
>SELECT year, temp
>DISTRIBUTE BY year
>SORT BY year ASC, temp DESC;
1949 111
1949 78
1950 22
1950 0
1950 -11

MapReduce Scripts
 To call an external program or script from hive, TRANSFORM, MAP and REDUCE clause is
used by using Hadoop streaming approach.
 Following python script filters the poor quality readings for temperature:
import re
import sys
for line in sys.stdin:
(y, t, q) = line.strip().split()
if (t != "9999" and re.match("[01459]", q)):
print "%s\t%s" %(y, t)
 Above script can be used in Hive as follows :
hive> ADD FILE /user/jerry/hive-sample/ srchtain/python/good_quality.py;
hive> FROM records2
>SELECT TFtANSFORM(year, temp, quality)
>USING’good_quality.py’
> AS year, temp;
1950 0
1950 22
1950 -11
1949 111

 The above query streams the year, temp and quality fields as tab separated lines to
good_quality.py script, and tab separated output is parsed into year and temp field to form
the result of the query.
 The above example doesn't have any reducer. To specify map and reduce function, nested
form of query can be used.
 Following example shows the python script for map function to filter poor quality reading
for temperature and python script for reduce function to find max temp. The MAP and
REDUCE keyword is used in Hive query.
 Map function for filtering poor quality reading for temperature in Python :
import re
import sys
for line in sys.stdin:
(y, t, q) = line.strip().split()
if (t != "9999" and re.match("[01459]", q)):
print "%s\t%s" % (y, t)

 Reduce function for maximum temperature in Python:


import sys
(last key, max val) = (None, -sys.maxint)
for line in sys.stdin:
(key, val) = line.strip().split("\t")
if last key and last key != key:
print "%s\t%s" % (last_key, max_val)
(last_key, max val) = (key, int(val))
else:
(last_key, max val) = (key, max(max_val, int(val)))

 Above scripts can be used in Hive query as follows :


FROM (
FROM records2_tbl
MAP year, temp, quality
USING ‘goodquality.py’
AS year, temp) map_output
REDUCE year, temp
USING 'max_temperature_reduce.py'
AS year, temp;
Joins
 Hive supports four type of join which are, Inner Join, Outer Join, Semi Join and Map join.

1. Inner Join
 Inner join is simple type of join, where the match in tables given as input results in rows in
the output.
 Consider two tables, first things that contain IDs and their names and second sales that
contain IDs of items user brought and name of people.
 Following query shows the content of both the table :
hive> SELECT * FROM sales;
Jay 3
Paresh 5
Avinash 0
Shri 4
Harshal 3
hive> SELECT * FROM things;
3 Tie
5 Coat
4 Hat
1 Scarf

 Inner join is performed on the above tables as follows :


hive> SELECT sales.*, things.*
> FROM sales JOIN things ON (sales.id = things.id);
Jay 3 3 Tie
Paresh 5 5 Coat
Shri 4 4 Hat
Harshal 3 3 Tie

 By using the predicate in the ON clause, the table in the FROM clause is joined with the table
in JOIN clause.
 In Hive you can only use equality in join predicate, which in the above example is matching
on the id column on both the tables.

2. Outer Join
 Outer join permits to discover non-matches in the table being joined.
 In case you use LEFT OUTER JOIN, the query will display a row for each row in left table,
even though there does not exists the matching row in the table to which it is being joined.
 In following example left table is sales and right table is things. LEFT OUTER JOIN can be
performed as follows :
hive> SELECT sales.*, things.*
> FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);
Jay 3 3 Tie
Paresh 5 5 Coat
Avinash 0 NULL NULL
Shri 4 4 Hat
Harshal 3 3 Tie
 When we have performed inner join the row for Avinash was not returned, but in this
example row for Avinash is returned, and column from the corresponding things table are
NULL as there is no match.
 RIGHT OUTER JOIN is also supported by Hive. In this join as compared to left outer join roles
of table are reversed.
 In this join, all the items from things table is included, even if they don't have matching row
in left table.
 In our example scarf is also included in output even though it is not purchased by anyone.
 The query will be like :
hive> SELECT sales.*, things.*
> FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id); Jay 33 Tie
Harshal 3 3 Tie
Paresh 5 5 Coat
Shri 44 Hat
NULL NULL 1 Scarf

 Hive supports a full outer join, in which the result have rows from both the table even
though they don't satisfy the condition.
 The query will look like
hive> SELECT sales.*, things.*
> FROM sales FULL OUTER JOIN things ON (sales.id =Things.id);
Avinash 0 NULL NULL
NULL NULL 1 Scarf
Harshal 3 3 Tie
Jay 3 3 Tie
Shri 4 4 Hat
Paresh 5 5 Coat

3. Semi Join
 To retrieve all items from things table which are also in sales table, SEMI JOIN can be used
 The query will be like:
hive> SELECT*
>FROM things LEFT SEMI JOIN sales ON(sales.id = things.id);
2 Tie
4 Coat
3 Hat
 Hive limits on using only LEFT SEMI JOIN query.

4. Map Join
Consider the following inner join query :
SELECT sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);
 In case one table small in size to fit in memory, as things in our example, Hive can load that
table into its memory to perform join using mapper. This is known as map join.
Sub Queries
 If a SELECT statement is embedded inside another SQL statement, then it is known as
subquery.
 Hive have limited support for subquery, it allows to use subquery in either FROM clause of
SELECT statement or in WHERE clause in some cases.
 Following query retrieves the mean maximum temperature for each year and weather
station:
SELECT station, year, AVG(max_temp)
FROM (
SELECT station, year, MAX(temp) AS max_temp
FROM records2
WHERE temp!= 9999 AND quality IN (0, 1, 4, 5, 9)
GROUP BY station, year
) mt
GROUP BY station, year;
 In above query FROM subquery is used to find max temp for every station and date
combination, and the outer query uses AVG function to find mean maximum temperature for
every year and station.
 The outer query uses output of subquery, due to this the subquery must be assign an alias
(mt in above query). The columns of subquery must have unique name so that outer query
can refer it.

Basics of Pig
 Apache Pig is an abstraction from MapReduce.
 It is a tool / platform for the analysis of larger data sets and their representation as data
streams.
 Pig is commonly used with Hadoop; we can use Apache Pig to do all data manipulation
operations in Hadoop.
 Pig provides a high-level language called Pig Latin for writing data analysis programs.
 The language offers a variety of operators that programmers can use to develop their own
functions for reading, writing, and processing data.
 To use Apache Pig to parse data, programmers must write scripts in Pig Latin.
 All of these scripts are converted internally into Map and Reduce tasks.
 Apache Pig has a component called Pig Engine that takes Pig Latin scripts as input and
converts these scripts into MapReduce jobs.
Need of Pig in Bigdata
 Programmers who are not very good at Java often encounter difficulties using Hadoop,
especially when performing MapReduce tasks.
 Apache Pig is the options to all of these programmers.
 Pig Latin allows programmers to easily perform MapReduce tasks without entering complex
code in Java.
 Apache Pig uses several query methods to reduce code length.
 For example, operations that require you to enter 200 lines of code (LoC) in Java can be
easily performed by entering just 10 lines of code in Apache Pig. It reduced development
time by almost 16 times.
 Pig Latin is a SQL-like language, it is easy to learn Apache Pig after familiarizing yourself
with SQL.
 It has many built-in operators to support data operations such as join, filter, sort, etc.
 It also provides nested data types such as tuples, packets, and maps that MapReduce lacks.
Features & Application of Pig
 Rich set of operators
 It provides many operators to perform operations like join, sort, filer, etc.
 Ease of programming
 Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL.
 Optimization opportunities
 The tasks in Apache Pig optimize their execution automatically, so the programmers
need to focus only on semantics of the language.
 Extensibility
 Using the existing operators, users can develop their own functions to read, process,
and write data.
 UDF’s
 Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.
 Handles all kinds of data
 Apache Pig analyzes all kinds of data, both structured as well as unstructured. It
stores the results in HDFS.

Application of Pig
 Apache Pig is generally used by data scientists for performing tasks involving ad-hoc
processing and quick prototyping.
 Apache Pig is used:
 To process huge data sources such as web logs.
 To perform data processing for search platforms.
 To process time sensitive data loads.

Execution Types / Run Modes


 Apache Pig executes in two modes:
1. Local Mode
2. MapReduce Mode

Pig Run
Modes

MapReduce
Local Mode
Mode
Local Mode & MapReduce Mode

Local Mode MapReduce Mode


 It executes in a single JVM and is used for  The MapReduce mode is also known as
development experimenting and Hadoop Mode.
prototyping.
 Files are installed and run using  It is the default mode.
localhost.
 The local mode works on a local file  In this Pig renders Pig Latin into
system. MapReduce jobs and executes them on
the cluster.
 The input and output data stored in the  It can be executed against semi-
local file system. distributed or fully distributed Hadoop
installation.
 Command: $ pig -x local  Here, the input and output data are
present on HDFS.
 Command: $ pig or $ pig -x mapreduce

Ways to execute Pig Program


 These are the following ways of executing a Pig program on local and MapReduce mode: -
 Interactive Mode -
 In this mode, the Pig is executed in the Grunt shell.
 To invoke Grunt shell, run the pig command.
 Once the Grunt mode executes, we can provide Pig Latin statements and command
interactively at the command line.
 Batch Mode -
 In this mode, we can run a script file having a .pig extension.
 These files contain Pig Latin commands.
 Embedded Mode -
 In this mode, we can define our own functions.
 These functions can be called as UDF (User Defined Functions).
 Here, we use programming languages like Java and Python.
Pig Work Flow & Components

Pig Architecture
 Parser
 Initially Pig scripts are managed by the Parser. It checks the syntax of the script, does
type checking, and other miscellaneous checks.
 The parser output will be a DAG (direct acyclic graph), which represents the Pig Latin
statements and logical operators.
 In the DAG, the logical script operators are represented as the nodes and data flows
are represented as a edges.
 Optimizer
 The logical plan (DAG) is switched to the logical optimizer, which carries out logical
optimizations, such as projection and pushdown.
 Compiler
 The compiler compiles the optimized logical plan into a series of MapReduce jobs.
 Execution engine
 Finally, the works of MapReduce are sent to Hadoop in sorted order. Finally, these
MauProduces works are performed in Hadoop that produce the desired results
Pig Latin Data Model

Pig Latin Data Model


 Atom
 Any single value in Pig Latin, irrespective of their data, type is known as an Atom.
 It is stored as string and can be used as string and number.
 int, long, float, double, chararray, and bytearray are the atomic values of Pig.
 A piece of data or a simple atomic value is known as a field.
 Example − 'rahul'or ‘50’
 Tuple
 A record that is formed by an ordered set of fields is known as a tuple, the fields can
be of any type.
 A tuple is similar to a row in a table of RDBMS.
 Example − (rahul, 50)
 Bag
 A bag is an unordered set of tuples. A collection of tuples (non-unique) is known as
a bag.
 Each tuple can have any number of fields (flexible schema). A bag is represented by
‘{}’.
 It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that
every tuple contain the same number of fields or that the fields in the same position
(column) have the same type.
 Example − {(Raja, 30), (Mohammad, 45)}
 A bag can be a field in a relation; in that context, it is known as inner bag.
 Example − {Raja, 30, {9848022338, [email protected],}}
 Map
 A map (or data map) is a set of key-value pairs.
 The key needs to be of type chararray and should be unique.
 The value might be of any type. It is represented by ‘[]’
 Example − [name#Raja, age#30]
 Relation
 A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no
guarantee that tuples are processed in any particular order).

Apache Pig Vs. MapReduce

Apache Pig Vs. SQL


Apache Pig Vs. Hive

Basics of HBase
 HBase is a data model similar to Google Big Table, which can quickly and randomly access
massive amounts of structured data.
 HBase is a distributed column-oriented database based on the Hadoop file system.
 It is an open source project and can be scaled horizontally.
 It uses the fault tolerance of the Hadoop File System (HDFS).
 It is part of the Hadoop ecosystem and provides real-time random read/write access to data
in the Hadoop file system.
 You can save data in HDFS directly or through HBase. Data consumers randomly read data
in HDFS or access it through HBase.
 HBase resides on the Hadoop file system and provides read and write access.

Storage Mechanism in HBase


 HBase is a column database and the tables in it are sorted by raw.
 The table diagram only defines the column families, which are the key value pairs.
 A table has several families of columns and every family of columns can have any number of
columns.
 The following column values are stored contiguously to the disc. Each cell value in the table
has a timestamp.
 In summary, in a HBase:
 The table is a collection of raw.
 Raw is a collection of column families.
 The column family is a collection of columns.
 Column is a collection of key value pairs.
Rowid Column Family Column Family Column Family Column Family

col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3

Example:
Features of HBase
 HBase is linearly scalable.
 It has automatic failure support.
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
 It provides data replication across clusters.

Use & Application of HBase


Use of HBase:
 Apache HBase is used to have random, real-time read/write access to Big Data.
 It hosts very large tables on top of clusters of commodity hardware.
 Apache HBase is a non-relational database modeled after Google's Bigtable.
 Bigtable acts up on Google File System, likewise Apache HBase works on top of Hadoop and
HDFS.
Application of HBase:
 It is used whenever there is a need to write heavy applications.
 HBase is used whenever we need to provide fast random access to available data.
 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

HBase Architecture

HBase has four major components:


 Client
 HMaster
 Region Servers
 Zookeeper
HBase Components
 Master Server
 Assigns regions to the region servers and takes the help of Apache ZooKeeper for this
task.
 Handles load balancing of the regions across region servers. It unloads the busy
servers and shifts the regions to less occupied servers.
 Maintains the state of the cluster by negotiating the load balancing.
 It responsible for schema changes and other metadata operations such as creation of
tables and column families.
 Regions
 Regions are nothing but tables that are split up and spread across the region servers.
 Region server
 The region servers have regions that communicate with the client and handle data-
related operations.
 Handle read and write requests for all the regions under it.
 Decide the size of the region by following the region size thresholds.

Region Server & Region

Table HBase table present in the HBase cluster


Region HRegions for the presented tables
Store It stores per ColumnFamily for each region for the table
 Memstore for each store for each region for the table
Memstore  It sorts data before flushing into HFiles
 Write and read performance will increase because of sorting
StoreFile StoreFiles for each store for each region for the table
Block Blocks present inside StoreFiles
HBase Write & Read

Hbase Write & Read


1. Client wants to write data and in turn first communicates with Regions server and then
regions
2. Regions contacting Memstore for storing associated with the column family.
3. First data stores into Memstore, where the data is sorted and after that, it flushes into HFile.
The main reason for using Memstore is to store data in a Distributed file system based on
Row Key. Memstore will be placed in Region server main memory while HFiles are written
into HDFS.
4. Client wants to read data from Regions
5. In turn Client can have direct access to Memstore, and it can request for data.
6. Client approaches HFiles to get the data. The data are fetched and retrieved by the Client.
 Memstore holds in-memory modifications to the store. The hierarchy of objects in HBase
Regions is as shown from top to bottom in below table.

Zookeeper in HBase
 Zookeeper is an open-source project that provides services like maintaining configuration
information, naming, providing distributed synchronization, etc.
 Zookeeper has temporary nodes representing different region servers. Master servers use
these nodes to discover available servers.
 In addition to availability, the nodes are also used to track server failures or network
partitions.
 Clients communicate with region servers via zookeeper.
 In pseudo and standalone modes, HBase itself will take care of zookeeper.
Basics of Zookeeper
 Zookeeper is a distributed co-ordination service to manage large set of hosts.
 Coordinating and managing a service in a distributed environment is a complicated process.
 Zookeeper solves this issue with its simple architecture and API.
 Zookeeper allows developers to focus on core application logic without worrying about the
distributed nature of the application.
 The zookeeper frame was originally built in Yahoo.
 To access you easily and robust applications. Subsequently, Apache Zookeeper has become a
standard for the organized service used by Hadoop, HBase and other distributed
frameworks.
 For example, Apache HBase uses ZooKeeper to track the status of distributed data.

Service of Zookeeper
 Apache Zookeeper is a service used by a cluster (group of nodes) to coordinate between
themselves and maintain shared data with robust synchronization techniques.
 Naming service
 Identifying the nodes in a cluster by name. It is similar to DNS, but for nodes.
 Configuration Management
 Latest and up-to-date configuration information of the system for a joining node.
 Cluster management
 Joining / leaving of a node in a cluster and node status at real time.
 Leader election
 Electing a node as leader for coordination purpose.
 Locking and synchronization service
 Locking the data while modifying it. This mechanism helps you in automatic fail
recovery while connecting other distributed applications like Apache HBase.
 Highly reliable data registry
 Availability of data even when one or a few nodes are down.

Benefits of Zookeeper
 Simple Distributed Coordination Process
 Synchronization
 Mutual exclusion and co-operation between server processes. This process helps in
Apache HBase for configuration management.
 Ordered Messages
 Serialization
 Encode the data according to specific rules. Ensure your application runs
consistently. This approach can be used in MapReduce to coordinate queue to
execute running threads.
 Reliability
 Atomicity
 Data transfer either succeed or fail completely, but no transaction is partial.
Zookeeper Architecture
Client Clients, one of the nodes in our distributed application cluster,
access information from the server. For a particular time interval,
every client sends a message to the server to let the sever know
that the client is alive.
Similarly, the server sends an acknowledgement when a client
connects. If there is no response from the connected server, the
client automatically redirects the message to another server.

Server Server, one of the nodes in our ZooKeeper ensemble, provides all
the services to clients. Gives acknowledgement to client to inform
that the server is alive.

Ensemble Group of ZooKeeper servers. The minimum number of nodes that is


required to form an ensemble is 3.

Leader Server node which performs automatic recovery if any of the


connected node failed. Leaders are elected on service startup.

Follower Server node which follows leader instruction.


Zookeeper Data
Model

Zookeeper Data Model


 This tree structure of ZooKeeper file system used for memory representation. ZooKeeper
node is referred as znode.
 Every znode is identified by a name and separated by a sequence of path (/).
 First you have a root znode separated by “/”. Under root, you have two logical
namespaces config and workers.
 The config namespace is used for centralized configuration management and
the workers namespace is used for naming.
 Under config namespace, each znode can store upto 1MB of data. This is similar to
UNIX file system except that the parent znode can store data as well.
 The main purpose of this structure is to store synchronized data and describe the metadata
of the znode.
 znode consists of Version number, Action control list (ACL), Timestamp, Data length.

You might also like