Unit5 Notes
Unit5 Notes
Hive, data types and file formats, HiveQL data definition, HiveQL data manipulation, HiveQL
queries. Case study on analysing different phases of data analytics.
GGGG
HIVE
The term ‘Big Data’ is used for collections of large datasets that include
huge volume, high velocity, and a variety of data that is increasing day by day.
Using traditional data management systems, it is difficult to process Big Data.
Therefore, the Apache Software Foundation introduced a framework called
Hadoop to solve Big Data management and processing challenges.
Hadoop
What is Hive
Architecture of Hive
This component diagram contains different units. The following table describes
each unit:
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step Operation
No.
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to
Driver (any database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
4 Send Metadata
Metastore sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name
node and it assigns this job to TaskTracker, which is in Data node.
Here, the query executes MapReduce job.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
DATA TYPES AND FILE FORMATS
There are different data types in Hive, which are involved in the table
creation. All the data types in Hive are classified into four types, given as
follows:
Column Types
Literals
Null Values
Complex Types
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When
the data range exceeds the range of INT, you need to use BIGINT and if the
data range is smaller than the INT, you use SMALLINT. TINYINT is smaller
than SMALLINT.
The following table depicts various INT data types:
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
String Types
String type data types can be specified using single quotes (' ') or double
quotes (" "). It contains two data types: VARCHAR and CHAR. Hive follows
C-types escape characters.
VARCHAR 1 to 65355
CHAR 255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond
precision. It supports java.sql.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form
{{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It
is used for representing immutable arbitrary precision. The syntax and example
is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an
instance using create union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
Hive Text file format is a default storage format. You can use the text
format to interchange the data with other client application. The text file format
is very common most of the applications. Data is stored in lines, with each line
being a record. Each lines are terminated by a newline character (\n).
The text format is simple plane file format. You can use the compression
(BZIP2) on the text file to reduce the storage spaces.
Create a TEXT file by add storage option as ‘STORED AS
TEXTFILE’ at the end of a Hive CREATE TABLE command.
Syntax:
Create table textfile_table(column_specs)
stored as textfile;
Sequence files are Hadoop flat files which stores values in binary key-
value pairs. The sequence files are in binary format and these files are able to
split. The main advantages of using sequence file is to merge two or more files
into one file.
Create a sequence file by add storage option as ‘STORED AS
SEQUENCEFILE’ at the end of a Hive CREATE TABLE command.
Syntax
Create table sequencefile_table (column_specs)
stored as sequencefile;
RCFile is row columnar file format. This is another form of Hive file
format which offers high row level compression rates. If you have requirement
to perform multiple rows at a time then you can use RCFile format.
The RCFile are very much similar to the sequence file format. This file format
also stores the data as key-value pairs.
Create RCFile by specifying ‘STORED AS RCFILE’ option at the end
of a CREATE TABLE Command:
Syntax:
Create table RCfile_table(column_specs)
stored as rcfile;
The ORC file stands for Optimized Row Columnar file format. The ORC
file format provides a highly efficient way to store data in Hive table. This file
system was actually designed to overcome limitations of the other Hive file
formats. The Use of ORC files improves performance when Hive is reading,
writing, and processing data from large tables.
Syntax:
Create table orc_table(column_specs) stored as orc;
DDL
Use With
Command
CREATE Database, Table
Databases, Tables, Table Properties, Partitions, Functions,
SHOW
Index
DESCRIBE Database, Table, view
USE Database
DROP Database, Table
ALTER Database, Table
TRUNCATE Table
1.Create Database:
The CREATE DATABASE statement is used to create a database in the
Hive. The DATABASE and SCHEMA are interchangeable. We can use either
DATABASE or SCHEMA.
Syntax:
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name;
Example :
CREATE DATABASE IF NOT EXISTS bda;
2.Show Database:
The SHOW DATABASES statement lists all the databases present in the
Hive.
Syntax:
SHOW (DATABASES|SCHEMAS);
Example:
SHOW DATABASES;
3.Describe Database:
The DESCRIBE DATABASE statement in Hive shows the name of
Database in Hive, and its location on the file system.
The EXTENDED can be used to get the database properties.
Syntax:
DESCRIBE DATABASE/SCHEMA [EXTENDED] db_name;
Example
DESCRIBE DATABASE bda;
4.Use Database
The USE statement in Hive is used to select the specific database for a
session on which all subsequent HiveQL statements would be executed.
Syntax:
USE database_name;
Example
USE bda;
5.Drop Database
The DROP DATABASE statement in Hive is used to Drop (delete) the
database.
The default behavior is RESTRICT which means that the database is
dropped only when it is empty. To drop the database with tables, we can use
CASCADE.
Syntax:
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name
[RESTRICT|CASCADE];
Example :
DROP DATABASE [IF EXISTS] bda CASCADE
1. CREATE TABLE
Syntax:
CREATE TABLE [IF NOT EXISTS] [db_name.] table_name [(col_name
data_type [COMMENT col_comment], ... [COMMENT col_comment])]
[COMMENT table_comment] [ROW FORMAT row_format] [STORED AS
file_format] [LOCATION hdfs_path];
Example
CREATE TABLE IF NOT EXISTS Employee(
Emp-ID STRING COMMENT ‘this is employee-id’
Emp_designation STRING COMMENT ‘this is employee post’)
The SHOW TABLES statement in Hive lists all the base tables
and views in the current database.
Syntax:
SHOW TABLES [IN database_name];
Example
SHOW TABLES;
3. DESCRIBE TABLE in Hive
The DESCRIBE statement in Hive shows the lists of columns for the
specified table.
Syntax:
DESCRIBE [EXTENDED|FORMATTED] [db_name.]
table_name[.col_name ( [.field_name])];
Example:
DESCRIBE Employee;
It describes column name , data type and comment.
The DROP TABLE statement in Hive deletes the data for a particular
table and remove all metadata associated with it from Hive metastore.
If PURGE is not specified then the data is actually moved to the
.Trash/current directory. If PURGE is specified, then data is lost completely.
Syntax:
DROP TABLE [IF EXISTS] table_name [PURGE];
Example:
DROP TABEL IF EXISTS Employee PURGE;
6. TRUNCATE TABLE
TRUNCATE TABLE statement in Hive removes all the rows from the
table or partition.
Syntax:
TRUNCATE TABLE table_name;
Example
TRUNCATE TABLE Comp_Emp;
It removes all the rows in Comp_Emp Table.
1.Load Command:
The LOAD statement in Hive is used to move data files into the locations
corresponding to Hive tables.
If a LOCAL keyword is specified, then the LOAD command will look
for the file path in the local filesystem.
If the LOCAL keyword is not specified, then the Hive will need the
absolute URI of the file.
In case the keyword OVERWRITE is specified, then the contents of the
target table/partition will be deleted and replaced by the files referred by
filepath.
If the OVERWRITE keyword is not specified, then the files referred by
filepath will be appended to the table.
Syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO
TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)];
Example:
LOAD DATA LOCAL INPATH ‘/home/user/dab’ INTO TABEL emp_data;
2. SELECT COMMAND
The SELECT statement in Hive is similar to the SELECT statement in
SQL used for retrieving data from the database.
Syntax:
SELECT col1,col2 FROM tablename;
Example:
3. INSERT Command
The INSERT command in Hive loads the data into a Hive table. We can
do insert to both the Hive table or partition.
a. INSERT INTO
The INSERT INTO statement appends the data into existing data in the
table or partition. INSERT INTO statement works from Hive version 0.8.
Syntax:
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1,
partcol2=val2 ...)] select_statement1 FROM from_statement;
Example
CREATE TABLE IF NOT EXIST example(id STRING, name STRING,
dep STRING ,state STRING, salary STRING, year STRING
The INSERT OVERWRITE table overwrites the existing data in the table or
partition.
Example:
Here we are overwriting the existing data of the table ‘example’ with the
data of table ‘dummy’ using INSERT OVERWRITE statement.
By using the SELECT statement we can verify whether the existing data of the
table ‘example’ is overwritten by the data of table ‘dummy’ or not.
SELECT * FROM EXAMPLE;
c. INSERT .. VALUES
INSERT ..VALUES statement in Hive inserts data into the table directly
from SQL..
Example:
Inserting data into the ‘student’ table using INSERT ..VALUES statement.
INSERT INTO TABLE student VALUES(101,’Callen’,’IT’,’7.8’),
(103,’joseph’,’CS’,’8.2’),
(105,’Alex’,’IT’,’7.9’);
SELECT * FROM student;
4. DELETE command
The DELETE statement in Hive deletes the table data. If the WHERE
clause is specified, then it deletes the rows that satisfy the condition in where
clause.
Syntax:
DELETE FROM tablename [WHERE expression];
Example:
In the below example, we are deleting the data of the student from table
‘student’ whose roll_no is 105.
DELETE FROM student WHERE roll_no=105;
SELECT * FROM student;
5. UPDATE Command
The UPDATE statement in Hive deletes the table data. If the WHERE
clause is specified, then it updates the column of the rows that satisfy the
condition in WHERE clause.
Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE
expression];
Example:
In this example, we are updating the branch of the student whose roll_no is 103
in the ‘student’ table using an UPDATE statement.
UPDATE student SET branch=’IT’ WHERE roll_no=103;
SELECT * FROM student;
6. EXPORT Command
The Hive EXPORT statement exports the table or partition data along
with the metadata to the specified output location in the HDFS.
Example:
Here in this example, we are exporting the student table to the HDFS
directory “export_from_hive”.
EXPORT TABLE student TO ‘export_from_hive’
The table successfully exported. You can check for the _metadata file and data
sub-directory using ls command.
7. IMPORT Command
The Hive IMPORT command imports the data from a specified location
to a new table or already existing table.
Example:
Here in this example, we are importing the data exported in the above
example into a new Hive table ‘imported_table’.
Verifying whether the data is imported or not using hive SELECT statement.
SELECT * from imported_table;
HIVEQL QUERIES.
Liks SQL , in hiveql it contains,
Hiveql operators,
Hiveql functions,
Hiveql group by & having ,
hiveql orderby & sortby ,
HiveqlJoin
HiveQL - Operators
The HiveQL operators facilitate to perform various arithmetic and relational operations.
Here, we are going to execute such type of operations on the records :
Let's create a table and load the data into it by using the following steps: -
hive> create table employee (Id int, Name string , Salary float)
Now, load the data into the table.
hive> load data local inpath '/home/codegyani/hive/emp_data' into table employ
ee;
Operator Description
s
A/B This is used to divide A and B and returns the quotient of the
operands.
Let's see an example to find out the 10% salary of each employee.
HIVEQL Functions:
count(*) - It returns the count of the number of rows present in the file.
min(col) - It compares the values and return the minimum one from it
max(col) - It compares the values and return the minimum one from it
Example:
hive> select max(Salary) from employee_data;
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;
+------+--------------+-------------+-------------------+--------
+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1205 | Kranthi | 30000 | Op Admin | Admin |
|1204 | Krian | 40000 | Hr Admin | HR |
|1202 | Manisha | 45000 | Proofreader | PR |
|1201 | Gopal | 45000 | Technical manager | TP |
|1203 | Masthanvali | 40000 | Technical writer | TP |
+------+--------------+-------------+-------------------+--------+
hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;
Dept | Count(*) |
+------+--------------+
|Admin | 1 |
|PR | 2 |
|TP | 3 |
+------+--------------+
HAVING CLAUSE
The HQL HAVING clause is used with GROUP BY clause. Its purpose is to apply
constraints on the group of data produced by GROUP BY clause. Thus, it always returns
the data where the condition is TRUE.
having sum(salary)>=45000;
output:
DEPARTMENT SUM(SALARY) DEPARTMENT SUM(SALARY)
Tp 85000 Tp 85000
Pr 45000 Pr 45000
Hr 40000
Admin 30000
Limitations of Hive
o Hive is not capable of handling real-time data.
o It is not designed for online transaction processing.
o Hive queries contain high latency.