0% found this document useful (0 votes)
38 views

Notes - 5 Unit Big Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Notes - 5 Unit Big Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Big Data (Unit 5)

Hadoop Eco System Frameworks

 Application of Big Data using Pig, Hive and HBase.

1. Pig :

Pig is a high-level platform or tool which is used to process large datasets.

It provides a high level of abstraction for processing over MapReduce.

It provides a high-level scripting language, known as Pig Latin which is used to develop the data analysis
codes.

Applications :

 For exploring large datasets Pig Scripting is used.


 Provides supports across large data sets for Ad-hoc queries.
 In the prototyping of large data-sets processing algorithms.
 Required to process the time-sensitive data loads.
 For collecting large amounts of datasets in form of search logs and web crawls.
 Used where the analytical insights are needed using the sampling.

2. Hive :

Hive is a data warehouse infrastructure tool to process structured data in Hadoop.

It resides on top of Hadoop to summarize Big Data and makes querying and analyzing easy.

It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Benefits :

 Ease of use
 Accelerated initial insertion of data
 Superior scalability, flexibility, and cost-efficiency
 Streamlined security
 Low overhead
 Exceptional working capacity

3. HBase :

HBase is a column-oriented non-relational database management system that runs on top of the Hadoop
Distributed File System (HDFS). HBase provides a fault-tolerant way of storing sparse data sets, which
are common in many big data use cases

HBase does support writing applications in Apache Avro, REST and Thrift.

Application :

 Medical
 Sports
 Web
 Oil and petroleum
 e-commerce
Mr. Satish Kr Singh Page 1
Apache Pig
 Apache Pig
Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is used to process the
large datasets. It provides a high-level of abstraction for processing over the MapReduce. It provides a
high-level scripting language, known as Pig Latin which is used to develop the data analysis codes. First,
to process the data which is stored in the HDFS, the programmers will write the scripts using the Pig Latin
Language. Internally Pig Engine (a component of Apache Pig) converted all these scripts into a specific
map and reduce task. But these are not visible to the programmers in order to provide a high-level of
abstraction. Pig Latin and Pig Engine are the two main components of the Apache Pig tool. The result of
Pig always stored in the HDFS.

Note: Pig Engine has two type of execution environment i.e. a local execution environment in a single JVM
(used when dataset is small in size) and distributed execution environment in a Hadoop Cluster.

 Difference between Pig and MapReduce

Apache Pig MapReduce

It is a scripting language. It is a compiled programming language.


Abstraction is at higher level. Abstraction is at lower level.
It have less line of code as compared to MapReduce. Lines of code is more.
Less effort is needed for Apache Pig. More development efforts are required for
MapReduce.
Code efficiency is less as compared to MapReduce. As compared to Pig efficiency of code is higher.
Pig provides built in functions for ordering, sorting and
Hard to perform data operations.
union.

It allows nested data types like map, tuple and bag It does not allow nested data types

 Comparison of Pig with Databases

There are several differences between the two languages, and between Pig and relational database
management systems (RDBMSs) in general. The most significant difference is that Pig Latin is a
data flow programming language, whereas SQL is a declarative programming language. In other
words, a Pig Latin program is a step-by-step set of operations on an input relation, in which each
step is a single transformation. By contrast, SQL statements are a set of constraints that, taken
together, define the output. In many ways, programming in Pig Latin is like working at the level
of an RDBMS query planner, which figures out how to turn a declarative statement into a system
of steps.

RDBMSs store data in tables, with tightly predefined schemas. Pig is more relaxed about the data
that it processes: you can define a schema at runtime, but it’s optional. Essentially, it will operate
on any source of tuples (although the source should support being read in parallel, by being in
multiple files, for example), where a UDF is used to read the tuples from their raw representation.2
The most common representation is a text file with tab-separated fields, and Pig provides a built-
in load function for this format. Unlike with a traditional database, there is no data import process
to load the data into the RDBMS. The data is loaded from the filesystem (usually HDFS) as the
first step in the processing.

Mr. Satish Kr Singh Page 2


Pig’s support for complex, nested data structures further differentiate it from SQL, which operates
on flatter data structures. Also, Pig’s ability to use UDFs and streaming operators that are tightly
integrated with the language and Pig’s nested data structures makes Pig Latin more customizable
than most SQL dialects.

RDBMSs have several features to support online, low-latency queries, such as transactions and
indexes, that are absent in Pig. Pig does not support random reads or queries on the order of tens
of milliseconds. Nor does it support random writes to update small portions of data; all writes are
bulk streaming writes, just like with MapReduce.

 Features of Apache Pig

 Operator set – Many operations like join, filter and sort can be performed through these
operators.
 Programming ease – Pig Latin closely resembles to SQL. It is also easy to write a Pig script if
you’re good at SQL.
 User defined functions – It is also a very amazing feature that it offers the facility to create
User-defined Functions in other programming languages like Java. Meanwhile, invoke or embed
them in Pig Scripts.
 Extensibility – Developers can develop their own functions to read, process and write data.
 Optimization opportunities – Pig tasks optimize their execution automatically. The
programmers only need to focus on semantics of the language. The animal Pig eats anything it
gets its mouth on. Apache Pig is named as such as it similarly processes all kinds of data like
structured, semi-structured and unstructured data and stores the result in HDFS
 Handles all kinds of data - Handling all kinds of data is one of the reasons for easy
programming. That means it analyses all kinds of data. Either structured or unstructured. Also, it
stores the results in HDFS.
 Extensibility - Extensibility is one of the most interesting features it has. It means users can
develop their own functions to read, process, and write data, using the existing operators.

 Difference between Pig Latin and SQL

Pig SQL
Pig Latin is a procedural language. SQL is a declarative language.
In Apache Pig, schema is optional. We can store
data without designing a schema (values are Schema is mandatory in SQL.
stored as $01, $02 etc.)
The data model in Apache Pig is nested
The data model used in SQL is flat relational.
relational.
Apache Pig provides limited opportunity for There is more opportunity for query optimization
Query optimization. in SQL.

In addition to above differences, Apache Pig Latin –

 Allows splits in the pipeline.


 Allows developers to store data anywhere in the pipeline.
 Declares execution plans.
 Provides operators to perform ETL (Extract, Transform, and Load) functions.

 Grunt
Grunt is a JavaScript task runner that helps us in automating mundane and repetitive tasks like minification,
compilation, unit testing, linting, etc. Grunt has hundreds of plugins to choose from, you can use Grunt to
automate just about anything with a minimum of effort. The objective of this article is to get started with
Grunt and to learn how to automatically minify our JavaScript files and validate them using JSHint.
Mr. Satish Kr Singh Page 3
Grunt has line-editing facilities like those found in GNU Readline (used in the bash shell and many other
command-line applications). For instance, the Ctrl-E key com‐ bination will move the cursor to the end of
the line. Grunt remembers command history, too,1 and you can recall lines in the history buffer using Ctrl-
P or Ctrl-N (for previous and next), or equivalently, the up or down cursor keys. Another handy feature is
Grunt’s completion mechanism, which will try to complete Pig Latin keywords and functions when you
press the Tab key.

 Pig Latin
The Pig Latin is a data flow language used by Apache Pig to analyse the data in Hadoop. It is a textual
language that abstracts the programming from the Java MapReduce idiom into a notation. However, we
can say, Pig Latin is a very simple language with SQL like semantics. It is possible to use it in a productive
manner. It also contains a rich set of functions. Those exhibits data manipulation.

,
 Data Processing Operators in Pig
The Apache Pig Operators is a high-level procedural language for querying large data sets using Hadoop
and the Map-Reduce Platform. A Pig Latin statement is an operator that takes a relation as input and
produces another relation as output. These operators are the main tools for Pig Latin provides to operate
on the data.

 Loading and Storing Data

To load data from external storage for pro‐ cessing in Pig. Storing the results is straightforward, too. Here’s
an example of using PigStorage to store tuples as plain-text values separated by a colon character.

 Filtering Data

Once you have some data loaded into a relation, often the next step is to filter it to remove the data that you
are not interested in. By filtering early in the processing pipe‐ line, you minimize the amount of data flowing
through the system, which can improve efficiency. The FOREACH...GENERATE operator is used to act
on every row in a relation. It can be used to remove fields or to generate new ones.

 Grouping and Joining Data

Pig has very good built-in support for join operations, making it much more approachable. . Since the large
datasets that are suitable for analysis by Pig (and MapReduce in general) are not normalized, however,
joins are used more infrequently in Pig.

 Sorting Data

Relations are unordered in Pig. There is no guarantee which order the rows will be processed in. If you
want to impose an order on the output, you can use the ORDER operator to sort a relation by one or more
fields.

 Combining and Splitting Data

Sometimes you have several relations that you would like to combine into one. For this, the UNION
statement is used. Pig attempts to merge the schemas from the relations that UNION is operating on. The
SPLIT operator is the opposite of UNION: it partitions a relation into two or more relations.

Hive
Mr. Satish Kr Singh Page 4
Hive is a data warehouse and an ETL tool which provides an SQL-like interface between the user and the
Hadoop distributed file system (HDFS) which integrates Hadoop. It is built on top of Hadoop. It is a
software project that provides data query and analysis. It facilitates reading, writing and handling wide
datasets that stored in distributed storage and queried by Structure Query Language (SQL) syntax. It is not
built for Online Transactional Processing (OLTP) workloads. It is frequently used for data warehousing
tasks like data encapsulation, Ad-hoc Queries, and analysis of huge datasets. It is designed to enhance
scalability, extensibility, performance, fault-tolerance and loose-coupling with its input formats.

Initially Hive is developed by Facebook and Amazon, Netflix and It delivers standard SQL functionality
for analytics. Traditional SQL queries are written in the MapReduce Java API to execute SQL Application
and SQL queries over distributed data. Hive provides portability as most data warehousing applications
functions with SQL-based query languages like NoSQL.

Apache Hive is a data ware house system for Hadoop that runs SQL like queries called HQL (Hive query
language) which gets internally converted to map reduce jobs. It supports Data definition Language, Data
Manipulation Language and user defined functions. Apache Hive is a data warehouse software project that
is built on top of the Hadoop ecosystem. It provides an SQL-like interface to query and analyse large
datasets stored in Hadoop’s distributed file system (HDFS) or other compatible storage systems.

Hive uses a language called HiveQL, which is similar to SQL, to allow users to express data queries,
transformations, and analyses in a familiar syntax. HiveQL statements are compiled into MapReduce jobs,
which are then executed on the Hadoop cluster to process the data.

 Apache Hive Architecture

The above figure shows the architecture of Apache Hive and its major components. The major
components of Apache Hive are:

 User Interface (UI) – As the name describes User interface provide an interface between user
and hive. It enables user to submit queries and other operations to the system. Hive web UI, Hive
command line, and Hive HD Insight (In windows server) are supported by the user interface.

 Hive Server – It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.

 Driver – Queries of the user after the interface are received by the driver within the Hive.
Concept of session handles is implemented by driver. Execution and Fetching of APIs modelled
on JDBC/ODBC interfaces is provided by the user.

 Compiler – Queries are parses, semantic analysis on the different query blocks and query
expression is done by the compiler. Execution plan with the help of the table in the database and
partition metadata observed from the metastore are generated by the compiler eventually.

 Metastore – All the structured data or information of the different tables and partition in the
warehouse containing attributes and attributes level information are stored in the metastore.

Mr. Satish Kr Singh Page 5


Sequences or de-sequences necessary to read and write data and the corresponding HDFS files
where the data is stored. Hive selects corresponding database servers to stock the schema or
Metadata of databases, tables, attributes in a table, data types of databases, and HDFS mapping.

 Execution Engine – Execution of the execution plan made by the compiler is performed in the
execution engine. The plan is a DAG of stages. The dependencies within the various stages of the
plan is managed by execution engine as well as it executes these stages on the suitable system
components.

 HDFS or HBASE – Hadoop distributed file system or HBASE are the data storage techniques to
store data into file system.

 Hive Shell

The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL.
HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by MySQL, so if you
are familiar with MySQL, you should feel at home using Hive. When starting Hive for the first
time, we can check that it is working by listing its tables there should be none. HiveQL is generally
case insensitive (except for string comparisons).

 Hive shell is a primary way to interact with hive.


 It is a default service in the hive.
 It is also called CLI (command line interference).
 Hive shell is similar to MySQL Shell.
 Hive users can run HQL queries in the hive shell.
 In hive shell up and down arrow keys are used to scroll previous commands.
 HiveQL is case-insensitive (except for string comparisons).
 The tab key will autocomplete (provides suggestions while you type into the field) Hive
keywords and functions.

Hive Shell can run in two modes:

Non-Interactive mode :

 Non-interactive mode means run shell scripts in administer zone.


 Hive Shell can run in the non-interactive mode, with the -f option.
 Example: $hive -f script.q, Where script. q is a file.

Interactive mode :

The hive can work in interactive mode by directly typing the command “hive” in the terminal.

Example: $hive

Hive> show databases;

Other useful Hive shell features include the ability to run commands on the host op‐ erating system
by using a ! prefix to the command and the ability to access Hadoop filesystems using the dfs
command.

Mr. Satish Kr Singh Page 6


 Hive Services
The following are the services provided by Hive:-

 Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive
queries and commands.
 Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a
web-based GUI for executing Hive queries and commands.
 Hive MetaStore - It is a central repository that stores all the structure information of various
tables and partitions in the warehouse. It also includes metadata of column and its type
information, the serializers and de-serializers which is used to read and write data and the
corresponding HDFS files where the data is stored.
 Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
 Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
 Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
 Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-
reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks in the
order of their dependencies.

 Comparison with Hive and Traditional database

Hive Traditional Database


Schema on WRITE – table schema is enforced at data
Schema on READ – it’s does not verify the schema
load time i.e if the data being loaded doesn’t conformed
while it’s loaded the data
on schema in that case it will be rejected
It’s very easily scalable at low cost Not much Scalable, costly scale up.
It’s based on Hadoop notation that is Write once and In traditional database we can read and write many time
read many times
Record level updates is not possible in Hive Record level updates, insertions and
deletes, transactions and indexes are possible
OLTP (On-line Transaction Processing) is not yet Both OLTP (On-line Transaction Processing)
supported in Hive but it’s supported OLAP (On-line and OLAP (On-line Analytical Processing) are
Analytical Processing) supported in RDBMS
Normalized data is stored. Normalized and de-normalized both type of data is
stored.
No partition method is used. Sharding method is used for partition.
It is used to maintain database. It is used to maintain data warehouse.

 HiveQL
Hive Query Language (HiveQL) is a query language in Apache Hive for processing and analysing
structured data. It separates users from the complexity of Map Reduce programming. It reuses common
concepts from relational databases, such as tables, rows, columns, and schema, to ease learning. Hive
provides a CLI for Hive query writing using Hive Query Language (HiveQL). Generally, HiveQL syntax
is similar to the SQL syntax that most data analysts are familiar with. Hive supports four file formats which
are: TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar File). Hive uses derby database

Mr. Satish Kr Singh Page 7


for single user metadata storage, and for multiple user Metadata or shared Metadata case, Hive uses
MYSQL.

 Hive Tables
Hive is designed for querying and managing only structured data stored in tables. Fundamentally, Hive
knows two different types of tables: Internal table and the External table. The Internal table is also known
as the managed table.

1. Hive Internal Table

 Hive owns the data for the internal tables.


 It is the default table in Hive. When the user creates a table in Hive without specifying it as external,
then by default, an internal table gets created in a specific location in HDFS.
 By default, an internal table will be created in a folder path similar to /user/hive/warehouse
directory of HDFS. We can override the default location by the location property during table
creation.
 If we drop the managed table or partition, the table data and the metadata associated with that table
will be deleted from the HDFS.

2. Hive External Table

 Hive does not manage the data of the External table.


 We create an external table for external use as when we want to use the data outside the Hive.
 External tables are stored outside the warehouse directory. They can access data stored in sources
such as remote HDFS locations or Azure Storage Volumes.
 Whenever we drop the external table, then only the metadata associated with the table will get
deleted, the table data remains untouched by Hive.
 We can create the external table by specifying the EXTERNAL keyword in the Hive create table
statement.

 Hive Partitioning and Bucketing

Hive Partition is a way to split the large table into smaller tables based on the values of a column (one
partition for each distinct values) whereas Bucket is a technique to divide the data in a manageable form
(you can specify how many buckets you want). Both Partitioning and Bucketing in Hive are used to
improve performance by eliminating table scans when dealing with a large set of data on a Hadoop file
system (HDFS). The major difference between Partitioning vs Bucketing lives in the way how they split
the data.

Hive Partition is a way to organize large tables into smaller logical tables based on values of columns; one
logical table (partition) for each distinct value. In Hive, tables are created as a directory on HDFS. A table
can have one or more partitions that correspond to a sub-directory for each partition inside a table directory.

Hive Bucketing a.k.a (Clustering) is a technique to split the data into more manageable files, (By
specifying the number of buckets to create). The value of the bucketing column will be hashed by a user-
defined number into buckets.

Bucketing can be created on just one column; you can also create bucketing on a partitioned table to further
split the data which further improves the query performance of the partitioned table.

Each bucket is stored as a file within the table’s directory or the partitions directories. Note that partition
creates a directory and you can have a partition on one or more columns; these are some of the differences
between Hive partition and bucket.

Mr. Satish Kr Singh Page 8


 Differences Between Hive Partitioning vs Bucketing
Partitioning Bucketing

Directory is created on HDFS for each


File is created on HDFS for each bucket.
partition.

You can have one or more Partition


You can have only one Bucketing column
columns

You can’t manage the number of You can manage the number of buckets to create by
partitions to create specifying the count

NA Bucketing can be created on a partitioned table

Uses PARTITIONED BY Uses CLUSTERED BY

 Storage Format in Hive

There are two dimensions that govern table storage in Hive: the row format and the file format. The row
format dictates how rows, and the fields in a particular row, are stored. The file format dictates the container
format for fields in a row. The simplest format is a plain-text file, but there are row-oriented and column-
oriented binary formats available, too. When you create a table with no ROW FORMAT or STORED AS
clauses, the default format is delimited text with one row per line. Hive has native support for the Parquet,
RCFile, and ORCFile column oriented binary formats.

 Querying Data and user Defined Functions


Functions are built for a specific purpose to perform operations like Mathematical, arithmetic, logical, and
relational on the operands of table column names. These are functions that are already available in Hive.
First, we have to check the application requirement, and then we can use these built-in functions in our
applications. We can call these functions directly in our application.

Types of Built-in Functions in HIVE

 Collection Functions
 Date Functions
 Mathematical Functions
 Conditional Functions
 String Functions
Mr. Satish Kr Singh Page 9
 Misc. Functions

In Hive, the users can define own functions to meet certain client requirements. These are known as UDFs
in Hive. User Defined Functions written in Java for specific modules.

Some of UDFs are specifically designed for the reusability of code in application frameworks. The
developer will develop these functions in Java and integrate those UDFs with the Hive. During the Query
execution, the developer can directly use the code, and UDFs will return outputs according to the user-
defined tasks. It will provide high performance in terms of coding and execution.

For example, for string stemming we don’t have any predefined function in Hive. For this, we can write
stem UDF in Java. Wherever we require Stem functionality, we can directly call this Stem UDF in Hive.

Depending on the use cases, the UDFs can be written. It will accept and produce different numbers of input
and output values. The general type of UDF will accept a single input value and produce a single output
value. If the UDF is used in the query, then UDF will be called once for each row in the result data set.

 There are three kinds of UDFs in Hive:

1. Regular UDF: UDFs works on a single row in a table and produces a single row as output. Its one to
one relationship between input and output of a function.

2. UDAF: User defined aggregate functions works on more than one row and gives single row as output.

3. UDTF: User defined tabular function works on one row as input and returns multiple rows as output. So
here the relation in one to many.

 Sorting and Aggregating in Hive

Sort by clause performs on column names of Hive tables to sort the output. We can mention DESC for
sorting the order in descending order and mention ASC for Ascending order of the sort. In this sort by it
will sort the rows before feeding to the reducer. Always sort by depends on column types. For instance, if
column types are numeric it will sort in numeric order if the columns types are string it will sort in
lexicographical order.

Hive Aggregate Functions are the most used built-in functions that take a set of values and return a single
value, when used with a group, it aggregates all values in each group and returns one value for each group.
Like in SQL, Aggregate Functions in Hive can be used with or without GROUP BY functions however
these aggregation functions are mostly used with GROUP BY hence, here I will cover examples of how to
use aggregation functions with and without applying groups.

 Map Reduce Scripts

 The Map/Reduce script is used to process large volumes of data. It works well in instances when the
data can be broken down into smaller parts. When you run this script, a structured framework creates
enough jobs to process all of these smaller parts. This technique does not require user management, it
manages automatically.
 This script also has the benefit of allowing these jobs to be processed in parallel. While deploying the
script, the user can select the level of parallelism.
 A map/reduce script, like a scheduled script, can be run manually or on a set schedule. Compared to
scheduled scripts, this script has a few advantages. One is that if a map/reduce task breaches certain
features of NetSuite governance, the map/reduce framework will automatically force the job to yield
and its work to be rescheduled for a later time without disrupting the script.

Mr. Satish Kr Singh Page 10


 Map/reduce should be used in any circumstance where you need to handle multiple records and your
logic may be broken down into little chunks. Map/reduce, on the other hand, is not suitable for instances
in which you need to execute sophisticated functions for each part of your data collection. The loading
and saving of various records are part of a lengthy process.

 Joins & Subqueries in Hive

 Joins

In Apache Hive, for combining specific fields from two tables by using values common to each one we use
Hive Join, HiveQL Select Joins Query. Basically, for combining specific fields from two tables by using
values common to each one we use Hive JOIN clause. In other words, to combine records from two or
more tables in the database we use JOIN clause. However, it is more or less similar to SQL JOIN. Also,
we use it to combine rows from multiple tables.

Moreover, there are some points we need to observe about Hive Join:

 In Joins, only Equality joins are allowed.


 However, in the same query more than two tables can be joined.
 Basically, to offer more control over ON Clause for which there is no match LEFT, RIGHT, FULL
OUTER joins exist in order.
 Also, note that Hive Joins are not Commutative
 Whether they are LEFT or RIGHT joins in Hive, even then Joins are left-associative.

 Type of Joins in Hive

Basically, there are 4 types of Hive Join. Such as:

1. Inner join in Hive


2. Left Outer Join in Hive
3. Right Outer Join in Hive
4. Full Outer Join in Hive

a. Inner Join
Basically, to combine and retrieve the records from multiple tables we use Hive Join clause. Moreover, in
SQL JOIN is as same as OUTER JOIN. Moreover, by using the primary keys and foreign keys of the tables
JOIN condition is to be raised.
b. Left Outer Join
On defining HiveQL Left Outer Join, even if there are no matches in the right table it returns all the rows
from the left table. Even if the ON clause matches 0 (zero) records in the right table, then also this Hive
JOIN still returns a row in the result. Although, it returns with NULL in each column from the right table.
In addition, it returns all the values from the left table. Also, the matched values from the right table, or
NULL in case of no matching JOIN predicate.

Mr. Satish Kr Singh Page 11


c. Right Outer Join
Basically, even if there are no matches in the left table, HiveQL Right Outer Join returns all the rows from
the right table. To be more specific, even if the ON clause matches 0 (zero) records in the left table, then
also this Hive JOIN still returns a row in the result. Although, it returns with NULL in each column from
the left table In addition, it returns all the values from the right table. Also, the matched values from the
left table or NULL in case of no matching join predicate.

d. Full Outer Join


The major purpose of this HiveQL Full outer Join is it combines the records of both the left and the right
outer tables which fulfils the Hive JOIN condition. Moreover, this joined table contains either all the
records from both the tables or fills in NULL values for missing matches on either side.

 Subqueries

A Query present within a Query is known as a sub query. The main query will depend on the values returned
by the subqueries. Hive supports subqueries in FROM clauses and in WHERE clauses of SQL statements.
A subquery is a SQL expression that is evaluated and returns a result set. Then that result set is used to
evaluate the parent query. The parent query is the outer query that contains the child subquery. Subqueries
in WHERE clauses are supported in Hive 0.13 and later.

Subqueries can be classified into two types

 Subqueries in FROM clause


 Subqueries in WHERE clause

When to use:

 To get a particular value combined from two column values from different tables
 Dependency of one table values on other tables
 Comparative checking of one column values from other tables

HBase

 HBase
HBase is an open-source, column-oriented distributed database system in a Hadoop environment. Initially,
it was Google Big Table, afterward; it was renamed as HBase and is primarily written in
Java. Apache HBase is needed for real-time Big Data applications. HBase can store massive amounts of
data from terabytes to petabytes. The tables present in HBase consist of billions of rows having millions of
columns. HBase is built for low latency operations, which is having some specific features compared to
traditional relational models. HBase provides a fault-tolerant way of storing sparse data sets, which are
common in many big data use cases. It is well suited for real-time data processing or random read/write
access to large volumes of data.

HBase relies on Zookeeper for high-performance coordination. ZooKeeper is built into HBase, but if you’re
running a production cluster, it’s suggested that you have a dedicated ZooKeeper cluster that’s integrated
with your HBase cluster. HBase works well with Hive, a query engine for batch processing of big data, to
enable fault-tolerant big data applications.

 Features of HBase

Mr. Satish Kr Singh Page 12


i. Consistency

We can use this HBase feature for high-speed requirements because it offers consistent reads and
writes.

ii. Atomic Read and Write

During one read or write process, all other processes are prevented from performing any read or
write operations this is what we call Atomic read and write. So, HBase offers atomic read and
write, on a row level.

iii. Sharding

In order to reduce I/O time and overhead, HBase offers automatic and manual splitting of
regions into smaller subregions, as soon as it reaches a threshold size.

iv. High Availability

Moreover, it offers LAN and WAN which supports failover and recovery. Basically, there is a
master server, at the core, which handles monitoring the region servers as well as all metadata
for the cluster.

v. Client API

Through Java APIs, it also offers programmatic access.

vi. Scalability

In both linear and modular form, HBase supports scalability. In addition, we can say it is linearly
scalable.

vii. Hadoop/HDFS integration

HBase can run on top of other file systems as well as like Hadoop/HDFS integration.

viii. Distributed storage

This feature of HBase supports distributed storage such as HDFS.

ix. Data Replication

HBase supports data replication across clusters.

x. Failover Support and Load Sharing

By using multiple block allocation and replications, HDFS is internally distributed and
automatically recovered and HBase runs on top of HDFS, hence HBase is automatically
recovered. Also using Region Server replication, this failover is facilitated.

xi. API Support

Because of Java APIs support in HBase, clients can access it easily.

Mr. Satish Kr Singh Page 13


xii. MapReduce Support

For parallel processing of large volume of data, HBase supports MapReduce.

xiii. Backup Support

In HBase “Backup support” means it supports back-up of Hadoop MapReduce jobs in HBase
tables.

xiv. Sorted Row Keys

It is possible to build an optimized request Since searching is done on the range of rows, and
HBase stores row keys in lexicographical orders, hence, by using these sorted row keys and
timestamp we can build an optimized request.

xv. Real-time Processing

In order to perform real-time query processing, HBase supports block cache and Bloom filters.

xvi. Faster Lookups

While it comes to faster lookups, HBase internally uses Hash tables and offers random access, as
well as it stores the data in indexed HDFS files.

xvii. Type of Data

For both semi-structured as well as structured data, HBase supports well.

xviii. Schema-less

There is no concept of fixed columns schema in HBase because it is schema-less. Hence, it


defines only column families.

xix. High Throughput

Due to high security and easy management characteristics of HBase, it offers unprecedented
high write throughput.

xx. Easy to use Java API for Client Access

While it comes to programmatic access, HBase offers easy usage Java API.

xxi. Thrift gateway and a RESTful Web services

For non-Java front-ends, HBase supports Thrift and REST API.

 HBase Data Model

The HBase Data Model is designed to handle semi-structured data that may differ in field size,
which is a form of data and columns. The data model’s layout partitions the data into simpler
components and spread them across the cluster. HBase's Data Model consists of various logical
components, such as a table, line, column, family, column, column, cell, and edition.

Mr. Satish Kr Singh Page 14


Table: An HBase table is made up of several columns. The tables in HBase defines upfront
during the time of the schema specification.

Row: An HBase row consists of a row key and one or more associated value columns. Row keys
are the bytes that are not interpreted. Rows are ordered lexicographically, with the first row
appearing in a table in the lowest order. The layout of the row key is very critical for this
purpose.

Column: A column in HBase consists of a family of columns and a qualifier of columns, which
is identified by a character: (colon).

Column Family: Apache HBase columns are separated into the families of columns. The
column families physically position a group of columns and their values to increase its
performance. Every row in a table has a similar family of columns, but there may not be
anything in a given family of columns.

Column Qualifier: The column qualifier is added to a column family. A column standard could
be content (html and pdf), which provides the content of a column unit. Although column
families are set up at table formation, column qualifiers are mutable and can vary significantly
from row to row.

Cell: A Cell store data and is quite a unique combination of row key, Column Family, and the
Column. The data stored in a cell call its value and data types, which is every time treated as a
byte.

Timestamp: In addition to each value, the timestamp is written and is the identifier for a given
version of a number. The timestamp reflects the time when the data is written on the Region
Server. But when we put data into the cell, we can assign a different timestamp value.

 The various client options for Interaction with an HBase Cluster

Mr. Satish Kr Singh Page 15


 Difference between RDBMS and HBase

S.
Parameters RDBMS HBase
No.

1. SQL It requires SQL (Structured Query Language). SQL is not required in HBase.

Mr. Satish Kr Singh Page 16


S.
Parameters RDBMS HBase
No.

It does not have a fixed schema and


2. Schema It has a fixed schema. allows for the addition of columns on
the fly.

3. Database
Type It is a row-oriented database It is a column-oriented database.

RDBMS allows for scaling up. That implies,


Scale-out is possible using HBase. It
that rather than adding new servers, we should
means that, while we require extra
4. Scalability upgrade the current server to a more capable
memory and disc space, we must add
server whenever there is a requirement for
new servers to the cluster rather than
more memory, processing power, and disc
space. upgrade the existing ones.

5. Nature It is static in nature Dynamic in nature

6. Data
retrieval In RDBMS, slower retrieval of data. In HBase, faster retrieval of data.

It follows CAP (Consistency,


It follows the ACID (Atomicity, Consistency,
7. Rule Availability, Partition-tolerance)
Isolation, and Durability) property.
theorem.
It can handle structured, unstructured
8. Type of data It can handle structured data.
as well as semi-structured data.

9. Sparse data It cannot handle sparse data. It can handle sparse data.
In HBase, the .amount of data depends
Volume of The amount of data in RDBMS is determined
10. on the number of machines deployed
data by the server’s configuration.
rather than on a single machine.
In HBase, there is no such guarantee
Transaction In RDBMS, mostly there is a guarantee
11. associated with the transaction
Integrity associated with transaction integrity.
integrity.

12. Referential When it comes to referential integrity,


Integrity Referential integrity is supported by RDBMS.
no built-in support is available.

The data in HBase is not normalized,


which means there is no logical
13. Normalize In RDBMS, you can normalize the data.
relationship or connection between
distinct tables of data.

14. It is designed to accommodate small tables. It is designed to accommodate large


Table size
Scaling is difficult. tables. HBase may scale horizontally.

 HBase Schema Design

Mr. Satish Kr Singh Page 17


HBase table can scale to billions of rows and many numbers of column based on your requirements. This
table allows you to store terabytes of data in it. The HBase table supports the high read and write throughput
at low latency. A single value in each row is indexed; this value is known as the row key.

 HBase Table Schema Design General Concepts

The HBase schema design is very different compared to the relation database schema design. Below are
some of general concept that should be followed while designing schema in Hbase:

 Row key: Each table in HBase table is indexed on row key. Data is sorted lexicographically by
this row key. There are no secondary indices available on HBase table. 
 Automaticity: Avoid designing table that requires atomicity across all rows. All operations on
HBase rows are atomic at row level. 
 Even distribution: Read and write should uniformly distributed across all nodes available in
cluster. Design row key in such a way that, related entities should be stored in adjacent rows to
increase read efficacy.

HBase Schema Row key, Column family, Column qualifier, individual and Row value Size Limit

Consider below is the size limit when designing schema in Hbase:

 Row keys: 4 KB per key


 Column families: not more than 10 column families per table
 Column qualifiers: 16 KB per qualifier
 Individual values: less than 10 MB per cell
 All values in a single row: max 10 MB

 HBase Row Key Design

When choosing row key for HBase tables, you should design table in such a way that there should not be
any hot spotting. To get best performance out of HBase cluster, you should design a row key that would
allow system to write evenly across all the nodes. Poorly designed row key can cause the full table scan
when you request some data out of it.

 Advance Indexing in HBase

Advanced indexing is triggered when obj is :

 an ndarray of type integer or Boolean


 or a tuple with at least one sequence object
 is a non-tuple sequence object

Advanced indexing returns a copy of data rather than a view of it.

There are two types of advanced indexing − Integer and Boolean.

 Integer Indexing

This mechanism helps in selecting any arbitrary item in an array based on its Ndimensional index. Each
integer array represents the number of indexes into that dimension. When the index consists of as many
integer arrays as the dimensions of the target ndarray, it becomes straightforward.

Mr. Satish Kr Singh Page 18


 Boolean Array Indexing

This type of advanced indexing is used when the resultant object is meant to be the result of Boolean
operations, such as comparison operators.

Zookeeper

 Zookeeper

ZooKeeper is an open-source Apache project that provides a centralized service for providing
configuration information, naming, synchronization and group services over large clusters in distributed
systems. Zookeeper is used in distributed systems to coordinate distributed processes and services. It
provides a simple, tree-structured data model, a simple API, and a distributed protocol to ensure data
consistency and availability. Zookeeper is designed to be highly reliable and fault-tolerant, and it can
handle high levels of read and write throughput.

Zookeeper is implemented in Java and is widely used in distributed systems, particularly in the Hadoop
ecosystem. The goal is to make these systems easier to manage with improved, more reliable propagation
of changes.

 ZooKeeper Characteristics

 ZooKeeper is simple

ZooKeeper is, at its core, a stripped-down filesystem that exposes a few simple operations and
some extra abstractions, such as ordering and notifications.

 ZooKeeper is expressive

The ZooKeeper primitives are a rich set of building blocks that can be used to build a large class
of coordination data structures and protocols. Examples include dis‐ tributed queues, distributed
locks, and leader election among a group of peers.

 ZooKeeper is highly available

ZooKeeper runs on a collection of machines and is designed to be highly available, so


applications can depend on it. ZooKeeper can help you avoid introducing single points of failure
into your system, so you can build a reliable application.

 ZooKeeper facilitates loosely coupled interactions

ZooKeeper interactions support participants that do not need to know about one another. For
example, ZooKeeper can be used as a rendezvous mechanism so that processes that otherwise
don’t know of each other’s existence (or network details) can discover and interact with one
another. Coordinating parties may not even be contemporaneous, since one process may leave a
message in ZooKeeper that is read by another after the first has shut down.

 ZooKeeper is a library

ZooKeeper provides an open source, shared repository of implementations and recipes of common
coordination patterns. Individual programmers are spared the burden of writing common protocols
themselves (which is often difficult to get right). Over time, the community can add to and improve
the libraries, which is to everyone’s benefit.

Mr. Satish Kr Singh Page 19


 Zookeeper Helps in Monitoring a Cluster

Apache ZooKeeper is an open-source server that reliably coordinates distributed processes and
applications. It allows distributed processes to coordinate with each other through a shared hierarchal
namespace which is organized similarly to a standard file system. Apache ZooKeeper provides a
hierarchical file system (with ZNodes as the system files) that helps with the discovery, registration,
configuration, locking, leader selection, queueing, etc. of services working in different machines.
ZooKeeper server maintains configuration information, naming, providing distributed synchronization, and
providing group services, used by distributed applications.

Applications Manager's ZooKeeper monitoring aims to help administrators manage their ZooKeeper
server - collect all the metrics that can help with troubleshooting, display performance graphs and be alerted
automatically of potential issues. In order to keep track of your ZooKeeper server's overall operation
efficiency, monitor key performance metrics such as:

 Resource utilization details


 Thread and JVM usage
 Performance statistics
 Cluster and configuration details 

 Resource utilization details

Automatically discover ZooKeeper clusters, monitor memory (heap and non-heap) on the Znode, and get
alerts of changes in resource consumption. Automatically collect, graph and get alerts on garbage collection
iterations, heap size, system usage, and threads. ZooKeeper hosts are deployed in a cluster and, as long as
a majority of hosts are up, the service will be available. Applications Manager's ZooKeeper monitoring
helps make sure the total node count inside the ZooKeeper tree is consistent.

 Thread and JVM usage

Analyze JVM thread dumps with Apache ZooKeeper monitoring to pinpoint the root cause of performance
issues for troubleshooting. Track thread usage with ZooKeeper monitoring metrics like daemon, peak and
live thread count. Ensure that started threads don't overload the server's memory.

 Performance statistics

With our ZooKeeper monitor, gauge the amount of time it takes for the server to respond to a client
request, queued requests and connections in the server and performance degradation due to network usage
(client packets sent and received). Get a consistent preview of the Zookeeper performance, regardless of
whether they change roles from Followers to Leader or back.

 Cluster and configuration details

Track the number of Znodes, the number of watchers setup over the nodes, and the number of followers
within the ensemble. Keep an eye on the leader selection stats and client session times. Know where the
Leader is for a quorum, and when there is a change in Leaders. Get alerts on the number of active, connected
sessions, and measure the growth rate over a specific time period.

 Build Applications with Zookeeper

 A Configuration Service

Mr. Satish Kr Singh Page 20


A Configuration Service One of the most basic services that a distributed application needs is a
configuration service, so that common pieces of configuration information can be shared by machines in a
cluster. At the simplest level, ZooKeeper can act as a highly available store for configuration, allowing
application participants to retrieve or update configuration files. Using ZooKeeper watches, it is possible
to create an active configuration service, where interested clients are notified of changes in configuration.

The contract of the write() method is that a key with the given value is written to ZooKeeper. It hides the
difference between creating a new znode and updating an ex‐ isting znode with a new value by testing first
for the znode using the exists operation and then performing the appropriate operation.

 The Resilient ZooKeeper Application

The first of the Fallacies of Distributed Computing states that “the network is reliable.” As they stand, our
programs so far have been assuming a reliable network, so when they run on a real network, they can fail
in several ways. Some possible failure modes and what we can do to correct them so that our programs are
resilient in the face of failure. Every ZooKeeper operation in the Java API declares two types of exception
in its throws clause: InterruptedException and KeeperException

 InterruptedException

An InterruptedException is thrown if the operation is interrupted. There is a standard Java mechanism for
canceling blocking methods, which is to call interrupt() on the thread from which the blocking method was
called. A successful cancellation will result in an InterruptedException. ZooKeeper adheres to this standard,
so you can cancel a ZooKeeper operation in this way. Classes or libraries that use ZooKeeper usually should
propagate the InterruptedException so that their clients can cancel their operations.

 KeeperException

A KeeperException is thrown if the ZooKeeper server signals an error or if there is a communication


problem with the server. For different error cases, there are various subclasses of KeeperException. For
example, KeeperException.NoNodeException is a subclass of KeeperException that is thrown if you try to
perform an operation on a znode that doesn’t exist.

There are two ways, then, to handle KeeperException: either catch KeeperException and test its code to
determine what remedying action to take, or catch the equivalent KeeperException subclasses and perform
the appropriate action in each catch block. KeeperExceptions fall into three broad categories.

 State exceptions

A state exception occurs when the operation fails because it cannot be applied to the znode tree. State
exceptions usually happen because another process is mutating a znode at the same time. For example, a
setData operation with a version number will fail with a KeeperException.BadVersionException if the
znode is updated by another process first because the version number does not match.

 Recoverable exceptions

Recoverable exceptions are those from which the application can recover within the same ZooKeeper
session. A recoverable exception is manifested by KeeperException.ConnectionLossException, which
means that the connection to ZooKeeper has been lost. ZooKeeper will try to reconnect, and in most cases
the re‐ connection will succeed and ensure that the session is intact.

 Unrecoverable exceptions

In some cases, the ZooKeeper session becomes invalid perhaps because of a timeout or because the session
was closed (both of these scenarios get a KeeperException.SessionExpiredException), or perhaps because
authentica‐ tion failed (KeeperException.AuthFailedException). In any case, all ephemeral nodes

Mr. Satish Kr Singh Page 21


associated with the session will be lost, so the application needs to rebuild its state before reconnecting to
ZooKeeper.

 A Lock Service
A distributed lock is a mechanism for providing mutual exclusion between a collection of processes. At
any one time, only a single process may hold the lock. Distributed locks can be used for leader election in
a large distributed system, where the leader is the process that holds the lock at any point in time.

To implement a distributed lock using ZooKeeper, we use sequential znodes to impose an order on the
processes vying for the lock. The idea is simple: first, designate a lock znode, typically describing the entity
being locked on (say, /leader); then, clients that want to acquire the lock create sequential ephemeral znodes
as children of the lock znode.

 The herd effects

Although this algorithm is correct, there are some problems with it. The first problem is that this
implementation suffers from the herd effect. Consider hundreds or thousands of clients, all trying to acquire
the lock. Each client places a watch on the lock znode for changes in its set of children. Every time the lock
is released or another process starts the lock acquisition process, the watch fires, and every client receives
a notification. The “herd effect” refers to a large number of clients being notified of the same event when
only a small number of them can actually proceed.

In this case, only one client will successfully acquire the lock, and the process of maintaining and sending
watch events to all clients causes traffic spikes, which put pressure on the ZooKeeper servers. To avoid the
herd effect, the condition for notification needs to be refined. The key observation for implementing locks
is that a client needs to be notified only when the child znode with the previous sequence number goes
away, not when any child znode is deleted (or created).

 Recoverable exceptions

Another problem with the lock algorithm as it stands is that it doesn’t handle the case when the create
operation fails due to connection loss. Recall that in this case we do not know whether the operation
succeeded or failed. Creating a sequential znode is a nonidempotent operation, so we can’t simply retry,
because if the first create had succeeded we would have an orphaned znode that would never be deleted
(until the client session ended, at least). Deadlock would be the unfortunate result.

The problem is that after reconnecting, the client can’t tell whether it created any of the child znodes. By
embedding an identifier in the znode name, if it suffers a connection loss, it can check to see whether any
of the children of the lock node have its identifier in their names. If a child contains its identifier, it knows
that the create operation succeeded, and it shouldn’t create another child znode. If no child has the identifier
in its name, the client can safely create a new sequential child znode.

 Unrecoverable exceptions

If a client’s ZooKeeper session expires, the ephemeral znode created by the client will be deleted,
effectively relinquishing the lock (or at least forfeiting the client’s turn to acquire the lock). The application
using the lock should realize that it no longer holds the lock, clean up its state, and then start again by
creating a new lock object and trying to acquire it. Notice that it is the application that controls this process,
not the lock implementation, since it cannot second-guess how the application needs to clean up its state.

Mr. Satish Kr Singh Page 22

You might also like