Hive
Hive
File Formats, Hive Query Language, Hive Partitions, Bucketing, Views, RCFile
Implementation, Hive User Defined Function, SerDe, UDF
__________________________________________________________________________
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that
are used to help Hadoop modules.
Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
Pig: It is a procedural language platform used to develop a script for MapReduce
operations.
Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.
Note: There are various ways to execute MapReduce operations:
The traditional approach using Java MapReduce program for structured, semi-structured,
and unstructured data.
The scripting approach for MapReduce to process structured and semi structured data using
Pig.
The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data
using Hive.
What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
1
Architecture of Hive
This component diagram contains different units. The following table describes each unit:
HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on
the Metastore. It is one of the replacements of traditional approach
for MapReduce program. Instead of writing MapReduce program
in Java, we can write a query for MapReduce job and process it.
2
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step Operation
No.
1 Execute Query
The Hive interface such as Command Line or Web UI sends queryto Driver (any database
driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check the syntax and
query plan or the requirement of query.
3 Get Metadata
4 Send Metadata
3
5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up to here, the
parsing and compiling of a query is complete.
6 Execute Plan
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine sends
the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which
is in Data node. Here, the query executes MapReduce job.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
Hive Commands
Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and User
defined functions.
create database
drop database
create table
drop table
alter table
create index
create view
Select
Where
Group By
Order By
Load Data
Join:
o Inner Join
o Left Outer Join
o Right Outer Join
o Full Outer Join
5
Hive DDL
Commands Create
Database Statement
Drop database
Create a table called Sonoo with two columns, the first being an integer and the other a string.
Create a table called HIVE_TABLE with two columns and a partition column called ds. The
partition column is a virtual column. It is not part of the data itself but is derived from the partition
that a particular dataset is loaded into.By default, tables are assumed to be of text input format and
the delimiters are assumed to be ^A(ctrl-a).
1. hive> CREATE TABLE HIVE_TABLE (foo INT, bar STRING) PARTITIONED BY (ds STRI
NG);
To understand the Hive DML commands, let's see the employee and employee_department table
first.
6
LOAD DATA
1. hive> LOAD DATA LOCAL INPATH './usr/Desktop/kv1.txt' OVERWRITE INTO TABLE Employe
e;
GROUP BY
Adding a Partition
We can add partitions to a table by altering the table. Let us assume we have a table
called employee with fields such as Id, Name, Salary, Designation, Dept, and yoj.
Syntax:
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec
[LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;
partition_spec:
: (p_column = p_col_value, p_column = p_col_value, ...)
The following query is used to add a partition to the employee table.
hive> ALTER TABLE employee
> ADD PARTITION (year=’2012’)
> location '/2012/part2012';
Renaming a Partition
7
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION
partition_spec;
The following query is used to rename a partition:
hive> ALTER TABLE employee PARTITION (year=’1203’)
> RENAME TO PARTITION (Yoj=’1203’);
Dropping a Partition
The Hive Query Language (HiveQL) is a query language for Hive to process and analyze
structured data in a Metastore. This chapter explains how to use the SELECT statement with
WHERE clause.
SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a
condition. It filters the data using the condition and gives you a finite result. The built-in operators
and functions generate an expression, which fulfils the condition.
Syntax
Example
Let us take an example for SELECT…WHERE clause. Assume we have the employee table as
given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details who earn a salary of more than Rs 30000.
8
+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+ + + + + +
The following query retrieves the employee details using the above scenario:
On successful execution of the query, you get to see the following response:
+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+ + + + + +
JDBC Program
The JDBC program to apply where clause for the given example is as follows.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "",
"");
// create statement
Statement stmt = con.createStatement();
// execute statement
9
Resultset res = stmt.executeQuery("SELECT * FROM employee WHERE salary>30000;");
System.out.println("Result:");
System.out.println(" ID \t Name \t Salary \t Designation \t Dept ");
while (res.next()) {
System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " " +
res.getString(4) + " " + res.getString(5));
}
con.close();
}
}
Save the program in a file named HiveQLWhere.java. Use the following commands to compile
and execute this program.
$ javac HiveQLWhere.java
$ java HiveQLWhere
Output:
The ORDER BY clause is used to retrieve the details based on one column and sort the result set
by ascending or descending order.
Syntax
10
Example
Let us take an example for SELECT...ORDER BY clause. Assume employee table as given
below, with the fields named Id, Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details in order by using Department name.
+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+ + + + + +
The following query retrieves the employee details using the above scenario:
On successful execution of the query, you get to see the following response:
+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1205 | Kranthi | 30000 | Op Admin | Admin |
|1204 | Krian | 40000 | Hr Admin | HR |
|1202 | Manisha | 45000 | Proofreader | PR |
|1201 | Gopal | 45000 | Technical manager | TP |
|1203 | Masthanvali | 40000 | Technical writer | TP |
+ + + + + +
JDBC Program
Here is the JDBC program to apply Order By clause for the given example.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
11
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "",
"");
// create statement
Statement stmt = con.createStatement();
// execute statement
Resultset res = stmt.executeQuery("SELECT * FROM employee ORDER BY DEPT;");
System.out.println(" ID \t Name \t Salary \t Designation \t Dept ");
while (res.next()) {
System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " " +
res.getString(4) + " " + res.getString(5));
}
con.close();
}
}
Save the program in a file named HiveQLOrderBy.java. Use the following commands to compile
and execute this program.
$ javac HiveQLOrderBy.java
$ java HiveQLOrderBy
Output:
The GROUP BY clause is used to group all the records in a result set using a particular collection
column. It is used to query a group of records.
Syntax
12
[LIMIT number];
Example
On successful execution of the query, you get to see the following response:
+ + +
| Dept | Count(*) |
+ + +
|Admin | 1 |
|PR | 2 |
|TP | 3 |
+ + +
JDBC Program
Given below is the JDBC program to apply the Group By clause for the given example.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
13
Connection con = DriverManager.
getConnection("jdbc:hive://localhost:10000/userdb", "", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
Resultset res = stmt.executeQuery(“SELECT Dept,count(*) ” + “FROM employee GROUP
BY DEPT; ”);
System.out.println(" Dept \t count(*)");
while (res.next()) {
System.out.println(res.getString(1) + " " + res.getInt(2));
}
con.close();
}
}
Save the program in a file named HiveQLGroupBy.java. Use the following commands to compile
and execute this program.
$ javac HiveQLGroupBy.java
$ java HiveQLGroupBy
Output:
Dept Count(*)
Admin 1
PR 2
TP 3
JOIN is a clause that is used for combining specific fields from two tables by using values
common to each one. It is used to combine records from two or more tables in the database.
Syntax
join_table:
Example
We will use the following two tables in this chapter. Consider the following table named
CUSTOMERS..
14
+ + + + + +
| ID | NAME | AGE | ADDRESS | SALARY |
+ + + + + +
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+ + + + + +
Consider another table ORDERS as follows:
+ + + + +
|OID | DATE | CUSTOMER_ID | AMOUNT |
+ + + + +
| 102 | 2009-10-08 00:00:00 | 3 | 3000 |
| 100 | 2009-10-08 00:00:00 | 3 | 1500 |
| 101 | 2009-11-20 00:00:00 | 2 | 1560 |
| 103 | 2008-05-20 00:00:00 | 4 | 2060 |
+ + + + +
There are different types of joins given as follows:
JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as
OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign keys
of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the
records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+ + + + +
| ID | NAME | AGE | AMOUNT |
+ + + + +
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |
15
| 4 | Chaitali | 25 | 2060 |
+ + + + +
The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no
matches in the right table. This means, if the ON clause matches 0 (zero) records in the right table,
the JOIN still returns a row in the result, but with NULL in each column from the right table.
A LEFT JOIN returns all the values from the left table, plus the matched values from the right
table, or NULL in case of no matching JOIN predicate.
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER
tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+ + + + +
| ID | NAME | AMOUNT | DATE |
+ + + + +
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
+ + + + +
The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no
matches in the left table. If the ON clause matches 0 (zero) records in the left table, the JOIN still
returns a row in the result, but with NULL in each column from the left table.
A RIGHT JOIN returns all the values from the right table, plus the matched values from the left
table, or NULL in case of no matching join predicate.
The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and ORDER
tables.
notranslate"> hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c
RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+ + + + +
16
| ID | NAME | AMOUNT | DATE |
+ + + + +
|3 | kaushik | 3000 | 2009-10-08 00:00:00 |
|3 | kaushik | 1500 | 2009-10-08 00:00:00 |
|2 | Khilan | 1560 | 2009-11-20 00:00:00 |
|4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+ + + + +
The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer tables
that fulfil the JOIN condition. The joined table contains either all the records from both the tables,
or fills in NULL values for missing matches on either side.
The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER
tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+ + + + +
| ID | NAME | AMOUNT | DATE |
+ + + + +
|1 | Ramesh | NULL | NULL |
|2 | Khilan | 1560 | 2009-11-20 00:00:00 |
|3 | kaushik | 3000 | 2009-10-08 00:00:00 |
|3 | kaushik | 1500 | 2009-10-08 00:00:00 |
|4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
|5 | Hardik | NULL | NULL |
|6 | Komal | NULL | NULL |
|7 | Muffy | NULL | NULL |
|3 | kaushik | 3000 | 2009-10-08 00:00:00 |
|3 | kaushik | 1500 | 2009-10-08 00:00:00 |
|2 | Khilan | 1560 | 2009-11-20 00:00:00 |
|4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+ + + + +
Bucketing #
• Bucketing concept is based on (hashing function on the bucketed column) mod (by total number
of buckets). The hash_function depends on the type of the bucketing column.
• Records with the same bucketed column will always be stored in the same bucket.
17
• Physically, each bucket is just a file in the table directory, and Bucket numbering is 1-based.
• Bucketing can be done along with Partitioning on Hive tables and even without partitioning.
• Bucketed tables will create almost equally distributed data file parts, unless there is skew in data.
Advantages
• Bucketed tables offer efficient sampling than by non-bucketed tables. With sampling, we can try
out queries on a fraction of data for testing and debugging purpose when the original data sets are
very huge.
• Asthe data files are equal sized parts, map-side joins will be faster on bucketed tables than non-
bucketed tables.
• Bucketingconcept also provides the flexibility to keep the records in each bucket to be sorted by
one or more columns. This makes map-side joins even more efficient, since the join of each bucket
becomes an efficient merge-sort.
Bucketing Vs Partitioning
• Partitioning
helps in elimination of data, if used in WHERE clause, where as bucketing helps in
organizing data in each partition into multiple files, so that the same set of data is always written
in same bucket.
• Hive
Bucket is nothing but another technique of decomposing data or decreasing the data into
more manageable parts or equal parts.
Sampling
• TABLESAMPLE() gives more disordered and random records from a table as compared to
LIMIT. •We can sample using the rand() function, which returns a random number.
• Here
rand() refers to any random column. •The denominator in the bucket clause represents the
number of buckets into which data will be hashed. •The numerator is the bucket number selected.
• If the
columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED
BY clause, TABLESAMPLE queries only scan the required hash partitions of the table.
18
Joins and Types #
Reduce-Side Join
Map-Side Join
• Incase one of the dataset is small, map side join takes place. •In map side join, a local job runs to
create hash-table from content of HDFS file and sends it to every node.
• The data must be bucketed on the keys used in the ON clause and the number of buckets for one
table must be a multiple of the number of buckets for the other table. •When these conditions are
met, Hive can join individual buckets between tables in the map phase, because it does not have
to fetch the entire content of one table to match against each bucket in the other table. •set
hive.optimize.bucketmapjoin =true; •SET hive.auto.convert.join =true;
SMBM Join
• SMB joins are used wherever the tables are sorted and bucketed.
• The join boils down to just merging the already sorted tables, allowing this operation to be faster
than an ordinary map-join.
•A left semi-join returns records from the lefthand table if records are found in the righthand table
that satisfy the ON predicates.
• SELECT and WHERE clauses can’t reference columns from the righthand table.
19
• The reason semi-joins are more efficient than the more general inner join is as follows:
• Fora given record in the lefthand table, Hive can stop looking for matching records in the
righthand table as soon as any match is found.
• At that point, the selected columns from the lefthand table record can be projected
• As we are dealing with structured data, each record has to be its own structure.
• These file formats mainly vary between data encoding, compression rate, usage of space and disk
I/O.
• Hive does not verify whether the data that you are loading matches the schema for the table or
not. •However, it verifies if the file format matches the table definition or not.
SerDe in Hive #
• The SerDe interface allows you to instruct Hive as to how a record should be processed.
• TheDeserializer interface takes a string or binary representation of a record, and translates it into
a Java object that Hive can manipulate.
• The Serializer,
however, will take a Java object that Hive has been working with, and turn it into
something that Hive can write to HDFS or another supported system.
• Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers
are used when writing data, such as through an INSERT-SELECT statement.
CSVSerDe
20
)
JSONSerDe
RegexSerDe
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe‘
For Example
• Partitioning
a table stores data in sub-directories categorized by table location, which allows Hive
to exclude unnecessary data from queries without reading all the data every time a new query is
made.
• Hive
does support Dynamic Partitioning (DP) where column values are only known at
EXECUTION TIME. To enable Dynamic Partitioning :
• Another situation we want to protect against dynamic partition insert is that the user may
accidentally specify all partitions to be dynamic partitions without specifying one static partition,
while the original intention is to just overwrite the sub-partitions of one root partition.
To enable bucketing:
Optimizations in Hive #
• Use Denormalisation , Filtering and Projection as early as possible to reduce data before join.
• Joinis a costly affair and requires extra map-reduce phase to accomplish query job. With De-
normalisation, the data is present in the same table so there is no need for any joins, hence the
selects are very fast.
21
• As join requires data to be shuffled across nodes, use filtering and projection as early as possible
to reduce data before join.
TUNE CONFIGURATIONS
• Parallel execution
• Applies toMapReduce jobs that can run in parallel, for example jobs processing different source
tables before a join.
USE ORCFILE
• Hivesupports ORCfile , a new table storage format that sports fantastic speed improvements
through techniques like predicate push-down, compression and more.
• Using
ORCFile for every HIVE table is extremely beneficial to get fast response times for your
HIVE queries.
USE TEZ
• With Hadoop2 and Tez , the cost of job submission and scheduling is minimized.
• AlsoTez does not restrict the job to be only Map followed by Reduce; this implies that all the
query execution can be done in a single job without having to cross job boundaries.
• Each record represents a click event, and we would like to find the latest URL for each sessionID
22
SELECT clicks.sessionID, clicks.url FROM clicks inner join (select sessionID, max(timestamp)
as max_ts from clicks group by sessionID) latest ON clicks.sessionID = latest.sessionID and
clicks.timestamp = latest.max_ts;
• Inthe above query, we build a sub-query to collect the timestamp of the latest event in each
session, and then use an inner join to filter out the rest.
• While the query is a reasonable solution —from a functional point of view— it turns out there’s
a better way to re-write this query as follows:
• Here,we use Hive’s OLAP functionality (OVER and RANK) to achieve the same thing, but
without a Join.
• Clearly,
removing an unnecessary join will almost always result in better performance, and when
using big data this is more important than ever.
• Hive has a special syntax for producing multiple aggregations from a single pass through a source
of data, rather than rescanning it for each aggregation.
• This change can save considerable processing time for large input data sets.
• For example, each of the following two queries creates a table from the same source table, history:
Optimizations in Hive
• The following rewrite achieves the same thing, but using a single pass through the source history
table:
FROM history
23