Unit-Vi Hive Hadoop & Big Data
Unit-Vi Hive Hadoop & Big Data
Applying Structure to Hadoop Data with Hive: Saying Hello to Hive, Seeing How the Hive is Put Together,
Getting Started with Apache Hive, Examining the Hive Clients, Working with Hive Data Types, Creating and
Managing Databases and Tables, Seeing How the Hive Data Manipulation Language Works, Querying and
Analyzing Data.
HIVE Introduction
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It
resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
The term ‘Big Data’ is used for collections of large datasets that include huge
volume, high velocity, and a variety of data that is increasing day by day. Using traditional
data management systems, it is difficult to process Big Data. Therefore, the Apache
Software Foundation introduced a framework called Hadoop to solve Big Data management
and processing challenges.
Hadoop
Hadoop is an open-source framework to store and process Big Data in a distributed
environment. It contains two modules, one is MapReduce and another is Hadoop
Distributed File System (HDFS).
MapReduce: It is a parallel programming model for processing large amounts of structured, semi-
structured, and unstructured data on large clusters of commodity hardware.
HDFS:Hadoop Distributed File System is a part of Hadoop framework, used to store and
process the datasets. It provides a fault-tolerant file system to run on commodity hardware.
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig,
and Hive that are used to help Hadoop modules.
Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it
up and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
Architecture of Hive
The following component diagram depicts the architecture of Hive:
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
IT Dept Page 3
UNIT-VI HIVE Hadoop & Big Data
The following table defines how Hive interacts with Hadoop framework:
Step Operation
No.
Execute Query
1 The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
Get Plan
2 The driver takes the help of query compiler that parses the query to check the syntax and
query plan or the requirement of query.
Get Metadata
3 The compiler sends metadata request to Metastore (any database).
Send Metadata
4
Metastore sends metadata as a response to the compiler.
Send Plan
5 The compiler checks the requirement and resends the plan to the driver. Up to
here, the parsing and compiling of a query is complete.
Execute Plan
6 The driver sends the execute plan to the execution engine.
Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine sends the job
7 to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data
node. Here, the query executes MapReduce job.
Metadata Ops
Meanwhile in execution, the execution engine can execute metadata operations with
7.1 Metastore.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
IT Dept Page 4
UNIT-VI HIVE Hadoop & Big Data
Hive - Data Types
All the data types in Hive are classified into four types, given as follows:
Column Types
Literals
Null Values
Complex Types
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range
exceeds the range of INT, you need to use BIGINT and if the data range is smaller
than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
VARCHAR 1 to 65355
CHAR 255
IT Dept 6
UNIT-VI HIVE Hadoop & Big Data
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY- MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create
an instance using create union. The syntax and example is as follows:
Literals
The following literals are used in Hive:
IT Dept 7
UNIT-VI HIVE Hadoop & Big Data
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type. The
-308 308
range of decimal type is approximately -10 to 10 .
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Structs
Structs in Hive is similar to using complex data with comment.
Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with
the same name already exists. We can use SCHEMA in place of
DATABASE in this command. The following query is
executed to create a database named userdb:
IT Dept 8
UNIT-VI HIVE Hadoop & Big Data
hive> CREATE DATABASE [IF NOT EXISTS] userdb;
or
The following queries are used to drop a database. Let us assume that the database name
is userdb.
The following query drops the database using CASCADE. It means dropping respective
tables before dropping the database.
Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
IT Dept 9
UNIT-VI HIVE Hadoop & Big Data
Example
Let us assume you need to create a table named employee using CREATE TABLE
statement. The following table lists the fields and their data types in employee table:
Sr.No Field Name Data Type
1 Eid int
2 Name String
3 Salary Float
4 Designation string
The following query creates a table named employee using the above data.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String, destination
String)
COMMENT ŧEmployee detailsŨ
ROW FORMAT DELIMITED FIELDS TERMINATED BY ŧ\tŨ LINES TERMINATED BY ŧ\nŨ STORED AS
TEXTFILE;
If you add the option IF NOT EXISTS, Hive ignores the statement in case the table
already exists.
OK
Time taken: 5.905 seconds
hive>
While inserting data into Hive, it is better to use LOAD DATA to store bulk records.
There are two ways to load data: one is from local file system and second is from Hadoop
file system.
IT Dept 1
0
UNIT-VI HIVE Hadoop & Big Data
Syntax
The syntax for load data is as follows:
The following query loads the given text into the table.
hive> LOAD DATA LOCAL INPATH
'/home/user/sample.txt' OVERWRITE INTO TABLE employee;
OK
Time taken: 15.905 seconds
hive>
Syntax
The statement takes any of the following syntaxes based on what attributes we wish to
modify in a table.
IT Dept 1
1
UNIT-VI HIVE Hadoop & Big Data
Change Statement
The following table contains the fields of employee table and it shows the fields to be
changed (in bold).
The following queries rename the column name and column data type using the above
data:
Replace Statement
The following query deletes all the columns from the employee table and replaces it with
emp and name columns:
hive> ALTER TABLE employee REPLACE COLUMNS (
eid INT empid Int,
ename STRING name String);
IT Dept 1
2
UNIT-VI HIVE Hadoop & Big Data
hive> DROP TABLE IF EXISTS employee;
IT Dept 1
3
UNIT-VI HIVE Hadoop & Big Data
OK
Time taken: 5.3 seconds
hive>
Operators in HIVE:
Arithmetic Operators
Logical Operators
Complex Operators
Relational Operators: These operators are used to compare two operands. The following
table describes the relational operators available in Hive:
IT Dept 1
4
UNIT-VI HIVE Hadoop & Big Data
A IS NULL all types TRUE if expression A evaluates to NULL
otherwise FALSE.
Example
Let us assume the employee table is composed of fields named Id, Name, Salary,
Designation, and Dept as shown below. Generate a query to retrieve the employee details
whose Id is 1205.
+-----+--------------+--------+---------------------------+------+
| Id | Name | Salary | Designation | Dept |
+-----+--------------+------------------------------------+------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin|
+-----+--------------+--------+---------------------------+------+
The following query is executed to retrieve the employee details using the above table:
+-----+-----------+-----------+----------------------------------+
| ID | Name | Salary | Designation | Dept |
+-----+---------------+-------+----------------------------------+
|1205 | Kranthi | 30000 | Op Admin | Admin |
+-----+-----------+-----------+----------------------------------+
The following query is executed to retrieve the employee details whose salary is
more than or equal to Rs 40000.
IT Dept 1
5
UNIT-VI HIVE Hadoop & Big Data
+-----+------------+--------+----------------------------+------+
| ID | Name | Salary | Designation | Dept |
+-----+------------+--------+----------------------------+------+
|120 | Gopal | 45000 | Technical manager | TP |
|120 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali| 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+-----+------------+--------+----------------------------+------+
Arithmetic Operators
These operators support various common arithmetic operations on the operands. All of
them return number types. The following table describes the arithmetic operators
available in Hive:
A%B all number types Gives the reminder resulting from dividing A by B.
A&B all number types Gives the result of bitwise AND of A and B.
A^B all number types Gives the result of bitwise XOR of A and B.
IT Dept 1
6
UNIT-VI HIVE Hadoop & Big Data
Example
The following query adds two numbers, 20 and 30.
Logical Operators
The operators are logical expressions. All of them return either TRUE
or
FALSE.
A || B boolean Same as A OR B.
Example
The following query is used to retrieve employee details whose Department is TP and
Salary is more than Rs 40000.
IT Dept 1
7
UNIT-VI HIVE Hadoop & Big Data
hive> SELECT * FROM employee WHERE Salary>40000 && Dept=TP;
Complex Operators
These operators provide an expression to access the elements of Complex Types.
A[n] A is an Array and n is It returns the nth element in the array A. The first
an int element has index 0.
M[key] M is a Map<K, V> and It returns the value corresponding to the key in
key has type K the map.
HiveQL - Select-Where
The Hive Query Language (HiveQL) is a query language for Hive to process
and analyze structured data in a Metastore. This chapter explains how to use the
SELECT statement with WHERE clause.
SELECT statement is used to retrieve the data from a table. WHERE clause
works similar to a condition. It filters the data using the condition and gives you a
finite result. The built-in operators and functions generate an expression, which fulfils
the condition.
Syntax
IT Dept 1
8
UNIT-VI HIVE Hadoop & Big Data
Given below is the syntax of the SELECT query:
Example
Let us take an example for SELECT…WHERE clause. Assume we have the employee table
as given below, with fields named Id, Name, Salary, Designation, and Dept. Generate
a query to retrieve the employee details who earn a salary of more than Rs 30000.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
| Gopal | 45000 | Technical manager | TP |
| Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op | Admin |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above scenario:
On successful execution of the query, you get to see the following response:
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+------+--------------+-------------+-------------------+--------+
HiveQL - Select-Order By
This chapter explains how to use the ORDER BY clause in a SELECT statement.
The ORDER BY clause is used to retrieve the details based on one column and sort
the result set by ascending or descending order.
IT Dept 1
9
UNIT-VI HIVE Hadoop & Big Data
Syntax
Given below is the syntax of the ORDER BY clause:
IT Dept 2
0
UNIT-VI HIVE Hadoop & Big Data
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference
[WHERE where_condition] [GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]] [LIMIT
number];
Example
Let us take an example for SELECT...ORDER BY clause. Assume employee
table as given below, with the fields named Id, Name, Salary, Designation, and Dept.
Generate a query to retrieve the employee details in order by using Department name.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above
scenario:
HiveQL - Select-Group By
This chapter explains the details of GROUP BY clause in a SELECT statement.
The GROUP BY clause is used to group all the records in a result set using a
particular collection column. It is used to query a group of records.
The syntax of GROUP BY clause is as follows:
[ORDER BY col_list]]
[LIMIT number];
HiveQL - Select-Joins
JOIN is a clause that is used for combining specific fields from two tables by using values
common to each one. It is used to combine records from two or more tables in the
database. It is more or less similar to SQL JOIN.
IT Dept Page 21
UNIT-VI HIVE Hadoop & Big Data
Syntax
join_table:
Example
We will use the following two tables in this chapter. Consider the following table
named CUSTOMERS..
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+
+-----+---------------------+-------------+--------+
|OID | DATE | CUSTOMER_ID | AMOUNT |
+-----+---------------------+-------------+--------+
| 102 | 2009-10-08 00:00:00 | 3 | 3000 |
| 100 | 2009-10-08 00:00:00 | 3 | 1500 |
JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same
as OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign
keys of the tables.
IT Dept Page 22
UNIT-VI HIVE Hadoop & Big Data
The following query executes JOIN on the CUSTOMER and ORDER
tables, and retrieves the records:
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and
ORDER tables:
IT Dept Page 23
UNIT-VI HIVE Hadoop & Big Data
A RIGHT JOIN returns all the values from the right table, plus the matched values from
the left table, or NULL in case of no matching join predicate.
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 |
| 3 | kaushik | 1500 | 2009-10-08 |
| 2 | Khilan | 1560 | 2009-11-20 |
| 4 | Chaitali | 2060 | 2008-05-20 |
+------+----------+--------+---------------------+
IT Dept Page 24
UNIT-VI HIVE Hadoop & Big Data
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
IMP Questions
IT Dept Page 1