BDA Unit-5
BDA Unit-5
BDA Unit-5
Basics of Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize Bigdata and makes querying and analysing easy.
It is a platform used to develop SQL type scripts to do MapReduce operations.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive.
It is used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Feature of Hive
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
Hive Architecture
Hive Components
Shell
Meta Store
Compiler
Execution Engine
Driver
Hive Components
Hive is a data warehouse infrastructure software that can create interaction between user
and HDFS.
The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD
Insight (In Windows server).
Hive chooses respective database servers to store the schema or Metadata of tables,
databases, columns in a table, their data types, and HDFS mapping.
HiveQL is like SQL for querying on schema info on the Meta-store.
It is one of the replacements of traditional approach for MapReduce program. Instead of
writing MapReduce program in Java, we can write a query for MapReduce job and process it.
The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine.
Execution engine processes the query and generates results as same as MapReduce results.
Hadoop distributed file system or HBASE are the data storage techniques to store data into
file system.
Hive Architecture
Hive Architecture
Hive Command-Line Interface (Hive CLI): The most commonly used interface to interact
with Hive.
Hive Web Interface: It is a simple Graphic User Interface to interact with Hive and to
execute query.
Hive Server: This is an optional server. This can be used to submit Hive Jobs from a
remote client.
JDBC/ODBC: Jobs can be submitted from a JDBC Client. One can write a Java code to
connect to Hive and submit jobs on it.
Driver: Hive queries are sent to the driver for compilation, optimization and execution.
Meta-store: Hive table definitions and mappings to the data are stored in a Meta-store.
A Meta-store consists of the following:
Meta-store Service: Offers interface to the Hive.
Database: Stores data definitions, mappings to the data and others.
Embedded Metastore
This metastore is mainly used for unit tests.
Here, only one process is allowed to connect to the meta-store at a time. This is the
default meta-store for Hive.
It is Apache Derby Database. In this meta-store, both the database and the meta-
store service run embedded in the main Hive Server process.
Local Metastore
Metadata can be stored in any RDBMS component like MySQL.
Local meta-store allows multiple connections at a time. In this mode, the Hive meta-
store service runs in the main Hive Server process, but the meta-store database runs
in a separate process, and can be on a separate host.
Remote Metastore
In this, the Hive driver and the meta-store interface run on different JVMs (which can
run on different machines as well).
This way the database can be free-walled from the Hive user and also database
credentials are completely isolated from the users of Hive.
HiveQL:
Hive's SQL language is known as HiveQL, it is combination of SQL-92, Oracle's SQL language
and MySQL.
HiveQl provides some improved features from previous version of SQL standards, like
analytics function from SQL 2003.
Some Hive's extension like multitable inserts, TRANSFORM, MAP and REDUCE clauses, are
inspired by MapReduce.
Tables
A Hive table is logically comprised of the information being stored and the related metadata
describing the layout of the information in the table. The data normally resides in HDFS, in
spite of the fact that it might reside in any Hadoop filesystem, including the local filesystem
or S3.
Relational database is used to store Hive metadata.
Two types of table can be created in Hive: managed table and external table.
When the table is created in Hive, by default Hive will manage the data that means Hive
moves the data into Hive's warehouse directory. Such a table is known as managed table.
You can also create an external table which notifies Hive that the data is stored at a location
which is outside warehouse directory.
The difference in these two types of table can be observed in LOAD and DROP query.
First let's take example of managed table.
When the data is loaded into managed table, it is moved to the Hive's warehouse directory.
For example:
CREATE TABLE managed tbl (dummy STRING);
LOAD DATA INPATH '/user/jerry/test.txt' INTO table managed_tbl;
Above query moves the file hdfs://user/jerry/test.txt into Hive's warehouse directory for
managed_tbl table, that is hdfs://user/hive/warehouse/managed_tbl
In hive the directory name in Hive's warehouse is same as managed table name.
To drop the table following query is used:
DROP TABLE managed_tbl;
After the above query get executed, the table along with it's data and metadata is deleted.
The LOAD operation performs move operation and the DROP performs delete operation
after which data is no longer exists. This is the meaning of Hive to manage the data.
An external table acts differently. In this creation and deletion of data is controlled. The
external data location is specified at the time of table creation, as follows:
CREATE EXTERNAL TABLE external_tbl (dummy STRING)
LOCATION '/user/jerry/external_tbl';
LOAD DATA INPATH '/user/jerry/test.txt' INTO TABLE external_tbl;
The keyword EXTERNAL notifies Hive that Hive is not managing the data; hence the data is
not moved to the warehouse directory. Also it does not verify if the external location
specified is exists or not at the time it is defined.
This feature specifies that the data can be created after the table is created.
When the external table is dropped, Hive only deletes the metadata and leaves the data
untouched.
Importing Data
Data can be imported using Load Data, Inserts and Create Table ..... As Select.
1. LOAD DATA
It imports the data by moving or copying files to the table's directory.
LOAD DATA is used as follows to import data :
CREATE TABLE test (dummy STRING);
LOAD DATA INPATH I/user/jerry/demo.txti INTO table test;
2. Inserts
Insert statements are used as follows to import data into table :
INSERT OVERWRITE TABLE target_tbl
SELECT col1, col2
FROM source tbl;
In the above query OVERWRITE keyword specifies that the content of targettable is replaced
by the results of the SELECT statement.
In order to insert records into already created table Insert into table query can be used.
In Hive we can do Multitable insert. For multitable insert, the statement will start as follows:
FROM source_table
INSERT OVERWRITE TABLE target_ table
SELECT col1, col2;
Above query shows that multitable insert statement can be added in above query. Because
of this reason it is called Multitable Insert.
Multitable Insert statement is more efficient than multiple INSERT statements because the
source table needs to be scanned only once to produce the multiple disjoint outputs.
Following is an example of Multitable Insert:
FROM records2_tbl
INSERT OVERWRITE TABLE stations_by_year
SELECT year, COUNT(DISTINCT station)
GROUP BY year
INSERT OVERWRITE TABLE records_by_year
SELECT year, COUNT(1)
GROUP BY year
INSERT OVERWRITE TABLE good_records_by_year
SELECT year, COUNT(1)
WHERE temperature != 9999 AND quality IN (0, 1, 4, 5, 9) GROUP BY year;
The above shows that records2_tbl is the only source table, but three different tables are there to
store the results of different select statement.
Altering table
In Hive table definition can be changed after the table is created, since Hive supports schema
on read approach.
In Hive table can be renamed by using ALTER TABLE statement:
ALTER TABLE source_name RENAME TO target_name;
Along with updating table metadata, ALTER TABLE statement also moves the table directory
so that it shows the new name. For above query the directory
/user/hive/warehouse/source_name renamed to /user/hive/warehou se/target_name.
Hive permits to add new columns, change the definition of column or can replace the
existing columns with new set of columns.
A new column can be added to an existing table as follows :
ALTER TABLE target_name ADD COLUMNS (col3 STRING);
After the existing columns col3 is added. Since the data files are not updated, so the queries
for col3 will return null value, since Hive does not allow to update existing record.
Dropping Table
To delete the data and metadata of table the DROP TABLE statement is used. If the table is
external, then only metadata is deleted and the data is left as it is.
The DROP TABLE query can be used as follows :
DROP TABLE test_table;
In can you want to keep the table definition and only delete the data TRUNCM statement is
used. Query will be as follow :
TRUNCATE TABLE test_table;
Querying Data
MapReduce Scripts
To call an external program or script from hive, TRANSFORM, MAP and REDUCE clause is
used by using Hadoop streaming approach.
Following python script filters the poor quality readings for temperature:
import re
import sys
for line in sys.stdin:
(y, t, q) = line.strip().split()
if (t != "9999" and re.match("[01459]", q)):
print "%s\t%s" %(y, t)
Above script can be used in Hive as follows :
hive> ADD FILE /user/jerry/hive-sample/ srchtain/python/good_quality.py;
hive> FROM records2
>SELECT TFtANSFORM(year, temp, quality)
>USING’good_quality.py’
> AS year, temp;
1950 0
1950 22
1950 -11
1949 111
The above query streams the year, temp and quality fields as tab separated lines to
good_quality.py script, and tab separated output is parsed into year and temp field to form
the result of the query.
The above example doesn't have any reducer. To specify map and reduce function, nested
form of query can be used.
Following example shows the python script for map function to filter poor quality reading
for temperature and python script for reduce function to find max temp. The MAP and
REDUCE keyword is used in Hive query.
Map function for filtering poor quality reading for temperature in Python :
import re
import sys
for line in sys.stdin:
(y, t, q) = line.strip().split()
if (t != "9999" and re.match("[01459]", q)):
print "%s\t%s" % (y, t)
1. Inner Join
Inner join is simple type of join, where the match in tables given as input results in rows in
the output.
Consider two tables, first things that contain IDs and their names and second sales that
contain IDs of items user brought and name of people.
Following query shows the content of both the table :
hive> SELECT * FROM sales;
Jay 3
Paresh 5
Avinash 0
Shri 4
Harshal 3
hive> SELECT * FROM things;
3 Tie
5 Coat
4 Hat
1 Scarf
By using the predicate in the ON clause, the table in the FROM clause is joined with the table
in JOIN clause.
In Hive you can only use equality in join predicate, which in the above example is matching
on the id column on both the tables.
2. Outer Join
Outer join permits to discover non-matches in the table being joined.
In case you use LEFT OUTER JOIN, the query will display a row for each row in left table,
even though there does not exists the matching row in the table to which it is being joined.
In following example left table is sales and right table is things. LEFT OUTER JOIN can be
performed as follows :
hive> SELECT sales.*, things.*
> FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);
Jay 3 3 Tie
Paresh 5 5 Coat
Avinash 0 NULL NULL
Shri 4 4 Hat
Harshal 3 3 Tie
When we have performed inner join the row for Avinash was not returned, but in this
example row for Avinash is returned, and column from the corresponding things table are
NULL as there is no match.
RIGHT OUTER JOIN is also supported by Hive. In this join as compared to left outer join roles
of table are reversed.
In this join, all the items from things table is included, even if they don't have matching row
in left table.
In our example scarf is also included in output even though it is not purchased by anyone.
The query will be like :
hive> SELECT sales.*, things.*
> FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id); Jay 33 Tie
Harshal 3 3 Tie
Paresh 5 5 Coat
Shri 44 Hat
NULL NULL 1 Scarf
Hive supports a full outer join, in which the result have rows from both the table even
though they don't satisfy the condition.
The query will look like
hive> SELECT sales.*, things.*
> FROM sales FULL OUTER JOIN things ON (sales.id =Things.id);
Avinash 0 NULL NULL
NULL NULL 1 Scarf
Harshal 3 3 Tie
Jay 3 3 Tie
Shri 4 4 Hat
Paresh 5 5 Coat
3. Semi Join
To retrieve all items from things table which are also in sales table, SEMI JOIN can be used
The query will be like:
hive> SELECT*
>FROM things LEFT SEMI JOIN sales ON(sales.id = things.id);
2 Tie
4 Coat
3 Hat
Hive limits on using only LEFT SEMI JOIN query.
4. Map Join
Consider the following inner join query :
SELECT sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);
In case one table small in size to fit in memory, as things in our example, Hive can load that
table into its memory to perform join using mapper. This is known as map join.
Sub Queries
If a SELECT statement is embedded inside another SQL statement, then it is known as
subquery.
Hive have limited support for subquery, it allows to use subquery in either FROM clause of
SELECT statement or in WHERE clause in some cases.
Following query retrieves the mean maximum temperature for each year and weather
station:
SELECT station, year, AVG(max_temp)
FROM (
SELECT station, year, MAX(temp) AS max_temp
FROM records2
WHERE temp!= 9999 AND quality IN (0, 1, 4, 5, 9)
GROUP BY station, year
) mt
GROUP BY station, year;
In above query FROM subquery is used to find max temp for every station and date
combination, and the outer query uses AVG function to find mean maximum temperature for
every year and station.
The outer query uses output of subquery, due to this the subquery must be assign an alias
(mt in above query). The columns of subquery must have unique name so that outer query
can refer it.
Basics of Pig
Apache Pig is an abstraction from MapReduce.
It is a tool / platform for the analysis of larger data sets and their representation as data
streams.
Pig is commonly used with Hadoop; we can use Apache Pig to do all data manipulation
operations in Hadoop.
Pig provides a high-level language called Pig Latin for writing data analysis programs.
The language offers a variety of operators that programmers can use to develop their own
functions for reading, writing, and processing data.
To use Apache Pig to parse data, programmers must write scripts in Pig Latin.
All of these scripts are converted internally into Map and Reduce tasks.
Apache Pig has a component called Pig Engine that takes Pig Latin scripts as input and
converts these scripts into MapReduce jobs.
Need of Pig in Bigdata
Programmers who are not very good at Java often encounter difficulties using Hadoop,
especially when performing MapReduce tasks.
Apache Pig is the options to all of these programmers.
Pig Latin allows programmers to easily perform MapReduce tasks without entering complex
code in Java.
Apache Pig uses several query methods to reduce code length.
For example, operations that require you to enter 200 lines of code (LoC) in Java can be
easily performed by entering just 10 lines of code in Apache Pig. It reduced development
time by almost 16 times.
Pig Latin is a SQL-like language, it is easy to learn Apache Pig after familiarizing yourself
with SQL.
It has many built-in operators to support data operations such as join, filter, sort, etc.
It also provides nested data types such as tuples, packets, and maps that MapReduce lacks.
Features & Application of Pig
Rich set of operators
It provides many operators to perform operations like join, sort, filer, etc.
Ease of programming
Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL.
Optimization opportunities
The tasks in Apache Pig optimize their execution automatically, so the programmers
need to focus only on semantics of the language.
Extensibility
Using the existing operators, users can develop their own functions to read, process,
and write data.
UDF’s
Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.
Handles all kinds of data
Apache Pig analyzes all kinds of data, both structured as well as unstructured. It
stores the results in HDFS.
Application of Pig
Apache Pig is generally used by data scientists for performing tasks involving ad-hoc
processing and quick prototyping.
Apache Pig is used:
To process huge data sources such as web logs.
To perform data processing for search platforms.
To process time sensitive data loads.
Pig Run
Modes
MapReduce
Local Mode
Mode
Local Mode & MapReduce Mode
Pig Architecture
Parser
Initially Pig scripts are managed by the Parser. It checks the syntax of the script, does
type checking, and other miscellaneous checks.
The parser output will be a DAG (direct acyclic graph), which represents the Pig Latin
statements and logical operators.
In the DAG, the logical script operators are represented as the nodes and data flows
are represented as a edges.
Optimizer
The logical plan (DAG) is switched to the logical optimizer, which carries out logical
optimizations, such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally, the works of MapReduce are sent to Hadoop in sorted order. Finally, these
MauProduces works are performed in Hadoop that produce the desired results
Pig Latin Data Model
Basics of HBase
HBase is a data model similar to Google Big Table, which can quickly and randomly access
massive amounts of structured data.
HBase is a distributed column-oriented database based on the Hadoop file system.
It is an open source project and can be scaled horizontally.
It uses the fault tolerance of the Hadoop File System (HDFS).
It is part of the Hadoop ecosystem and provides real-time random read/write access to data
in the Hadoop file system.
You can save data in HDFS directly or through HBase. Data consumers randomly read data
in HDFS or access it through HBase.
HBase resides on the Hadoop file system and provides read and write access.
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
Example:
Features of HBase
HBase is linearly scalable.
It has automatic failure support.
It provides consistent read and writes.
It integrates with Hadoop, both as a source and a destination.
It has easy java API for client.
It provides data replication across clusters.
HBase Architecture
Zookeeper in HBase
Zookeeper is an open-source project that provides services like maintaining configuration
information, naming, providing distributed synchronization, etc.
Zookeeper has temporary nodes representing different region servers. Master servers use
these nodes to discover available servers.
In addition to availability, the nodes are also used to track server failures or network
partitions.
Clients communicate with region servers via zookeeper.
In pseudo and standalone modes, HBase itself will take care of zookeeper.
Basics of Zookeeper
Zookeeper is a distributed co-ordination service to manage large set of hosts.
Coordinating and managing a service in a distributed environment is a complicated process.
Zookeeper solves this issue with its simple architecture and API.
Zookeeper allows developers to focus on core application logic without worrying about the
distributed nature of the application.
The zookeeper frame was originally built in Yahoo.
To access you easily and robust applications. Subsequently, Apache Zookeeper has become a
standard for the organized service used by Hadoop, HBase and other distributed
frameworks.
For example, Apache HBase uses ZooKeeper to track the status of distributed data.
Service of Zookeeper
Apache Zookeeper is a service used by a cluster (group of nodes) to coordinate between
themselves and maintain shared data with robust synchronization techniques.
Naming service
Identifying the nodes in a cluster by name. It is similar to DNS, but for nodes.
Configuration Management
Latest and up-to-date configuration information of the system for a joining node.
Cluster management
Joining / leaving of a node in a cluster and node status at real time.
Leader election
Electing a node as leader for coordination purpose.
Locking and synchronization service
Locking the data while modifying it. This mechanism helps you in automatic fail
recovery while connecting other distributed applications like Apache HBase.
Highly reliable data registry
Availability of data even when one or a few nodes are down.
Benefits of Zookeeper
Simple Distributed Coordination Process
Synchronization
Mutual exclusion and co-operation between server processes. This process helps in
Apache HBase for configuration management.
Ordered Messages
Serialization
Encode the data according to specific rules. Ensure your application runs
consistently. This approach can be used in MapReduce to coordinate queue to
execute running threads.
Reliability
Atomicity
Data transfer either succeed or fail completely, but no transaction is partial.
Zookeeper Architecture
Client Clients, one of the nodes in our distributed application cluster,
access information from the server. For a particular time interval,
every client sends a message to the server to let the sever know
that the client is alive.
Similarly, the server sends an acknowledgement when a client
connects. If there is no response from the connected server, the
client automatically redirects the message to another server.
Server Server, one of the nodes in our ZooKeeper ensemble, provides all
the services to clients. Gives acknowledgement to client to inform
that the server is alive.