0% found this document useful (0 votes)
18 views

Unit5 Notes

big dta 5 th unit

Uploaded by

pullisaipriya232
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Unit5 Notes

big dta 5 th unit

Uploaded by

pullisaipriya232
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

UNIT5

Hive, data types and file formats, HiveQL data definition, HiveQL data manipulation, HiveQL
queries. Case study on analysing different phases of data analytics.

GGGG

HIVE
The term ‘Big Data’ is used for collections of large datasets that include
huge volume, high velocity, and a variety of data that is increasing day by day.
Using traditional data management systems, it is difficult to process Big Data.
Therefore, the Apache Software Foundation introduced a framework called
Hadoop to solve Big Data management and processing challenges.
Hadoop

Hadoop is an open-source framework to store and process Big Data in a


distributed environment. It contains two modules, one is MapReduce and
another is Hadoop Distributed File System (HDFS).
 MapReduce: It is a parallel programming model for processing large
amounts of structured, semi-structured, and unstructured data on large
clusters of commodity hardware.
 HDFS:Hadoop Distributed File System is a part of Hadoop framework,
used to store and process the datasets. It provides a fault-tolerant file
system to run on commodity hardware.
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop,
Pig, and Hive that are used to help Hadoop modules.
 Sqoop: It is used to import and export data to and from between HDFS
and RDBMS.
 Pig: It is a procedural language platform used to develop a script for
MapReduce operations.
 Hive: It is a platform used to develop SQL type scripts to do MapReduce
operations.

Note: There are various ways to execute MapReduce operations:

 The traditional approach using Java MapReduce program for structured,


semi-structured, and unstructured data.
 The scripting approach for MapReduce to process structured and semi
structured data using Pig.
 The Hive Query Language (HiveQL or HQL) for MapReduce to process
structured data using Hive.

What is Hive

Hive is a data warehouse infrastructure tool to process structured data in


Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source under the
name Apache Hive.
Features of Hive

 It stores schema in a database and processed data into HDFS.


 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Architecture of Hive

The following component diagram depicts the architecture of Hive:

This component diagram contains different units. The following table describes
each unit:

Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that can


create interaction between user and HDFS. The user
interfaces that Hive supports are Hive Web UI, Hive
command line, and Hive HD Insight (In Windows server).

Meta Store Hive chooses respective database servers to store the


schema or Metadata of tables, databases, columns in a
table, their data types, and HDFS mapping. It stores the
data in a traditional RDBMS format

HiveQL Process HiveQL is similar to SQL for querying on schema info on


Engine the Metastore. It is one of the replacements of traditional
approach for MapReduce program. Instead of writing
MapReduce program in Java, we can write a query for
MapReduce job and process it.

Execution The conjunction part of HiveQL process Engine and


Engine MapReduce is Hive Execution Engine. Execution engine
processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.

HDFS or Hadoop distributed file system or HBASE are the data


HBASE storage techniques to store data into file system.

Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:

Step Operation
No.

1 Execute Query
The Hive interface such as Command Line or Web UI sends query to
Driver (any database driver such as JDBC, ODBC, etc.) to execute.

2 Get Plan
The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.

3 Get Metadata
The compiler sends metadata request to Metastore (any database).

4 Send Metadata
Metastore sends metadata as a response to the compiler.

5 Send Plan
The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.

6 Execute Plan
The driver sends the execute plan to the execution engine.

7 Execute Job
Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name
node and it assigns this job to TaskTracker, which is in Data node.
Here, the query executes MapReduce job.

7.1 Metadata Ops


Meanwhile in execution, the execution engine can execute metadata
operations with Metastore.

8 Fetch Result
The execution engine receives the results from Data nodes.

9 Send Results
The execution engine sends those resultant values to the driver.

10 Send Results
The driver sends the results to Hive Interfaces.
DATA TYPES AND FILE FORMATS
There are different data types in Hive, which are involved in the table
creation. All the data types in Hive are classified into four types, given as
follows:
 Column Types
 Literals
 Null Values
 Complex Types

Column Types

Column type are used as column data types of Hive. They are as follows:

Integral Types
Integer type data can be specified using integral data types, INT. When
the data range exceeds the range of INT, you need to use BIGINT and if the
data range is smaller than the INT, you use SMALLINT. TINYINT is smaller
than SMALLINT.
The following table depicts various INT data types:

Type Postfix Example

TINYINT Y 10Y

SMALLINT S 10S

INT - 10

BIGINT L 10L

String Types
String type data types can be specified using single quotes (' ') or double
quotes (" "). It contains two data types: VARCHAR and CHAR. Hive follows
C-types escape characters.

The following table depicts various CHAR data types:

Data Type Length

VARCHAR 1 to 65355

CHAR 255

Timestamp
It supports traditional UNIX timestamp with optional nanosecond
precision. It supports java.sql.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.

Dates
DATE values are described in year/month/day format in the form
{{YYYY-MM-DD}}.

Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It
is used for representing immutable arbitrary precision. The syntax and example
is as follows:
DECIMAL(precision, scale)
decimal(10,0)

Union Types
Union is a collection of heterogeneous data types. You can create an
instance using create union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}

Literals

The following literals are used in Hive:


Floating Point Types
Floating point types are nothing but numbers with decimal points.
Generally, this type of data is composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range
than DOUBLE data type. The range of decimal type is approximately -10-308 to
10308.
Null Value

Missing values are represented by the special value NULL.


Complex Types

The Hive complex data types are as follows:


Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>

HIVE FILE FORMATS:


Following are the Apache Hive different file formats:
 Text File
 Sequence File
 RC File
 AVRO File
 ORC File
 Parquet File

Hive Text File Format:

Hive Text file format is a default storage format. You can use the text
format to interchange the data with other client application. The text file format
is very common most of the applications. Data is stored in lines, with each line
being a record. Each lines are terminated by a newline character (\n).
The text format is simple plane file format. You can use the compression
(BZIP2) on the text file to reduce the storage spaces.
Create a TEXT file by add storage option as ‘STORED AS
TEXTFILE’ at the end of a Hive CREATE TABLE command.
Syntax:
Create table textfile_table(column_specs)
stored as textfile;

Hive Sequence File Format:

Sequence files are Hadoop flat files which stores values in binary key-
value pairs. The sequence files are in binary format and these files are able to
split. The main advantages of using sequence file is to merge two or more files
into one file.
Create a sequence file by add storage option as ‘STORED AS
SEQUENCEFILE’ at the end of a Hive CREATE TABLE command.

Syntax
Create table sequencefile_table (column_specs)
stored as sequencefile;

Hive RC File Format

RCFile is row columnar file format. This is another form of Hive file
format which offers high row level compression rates. If you have requirement
to perform multiple rows at a time then you can use RCFile format.
The RCFile are very much similar to the sequence file format. This file format
also stores the data as key-value pairs.
Create RCFile by specifying ‘STORED AS RCFILE’ option at the end
of a CREATE TABLE Command:
Syntax:
Create table RCfile_table(column_specs)
stored as rcfile;

Hive AVRO File Format


Avro stores the data definition (schema) in JSON format making it easy
to read and interpret by any program. The data itself is stored in binary format
making it compact and efficient.
Syntax:
Create table avro_table(column_specs) stored as avro;
Hive ORC File Format

The ORC file stands for Optimized Row Columnar file format. The ORC
file format provides a highly efficient way to store data in Hive table. This file
system was actually designed to overcome limitations of the other Hive file
formats. The Use of ORC files improves performance when Hive is reading,
writing, and processing data from large tables.
Syntax:
Create table orc_table(column_specs) stored as orc;

Hive Parquet File Format

Parquet is a column-oriented binary file format. The parquet is highly


efficient for the types of large-scale queries. Parquet is especially good for
queries scanning particular columns within a particular table. The Parquet table
uses compression Snappy, gzip; currently Snappy by default.
Create Parquet file by specifying ‘STORED AS PARQUET’ option at
the end of a CREATE TABLE Command.
Syntax:
Create table parquet_table(column_specs) stored as parquet;

HIVEQL DATA DEFINITION


Hive DDL commands are the statements used for defining and changing
the structure of a table or database in Hive. It is used to build or modify the
tables and other objects in the database.

The several types of Hive DDL commands are:


1. CREATE
2. SHOW
3. DESCRIBE
4. USE
5. DROP
6. ALTER
7. TRUNCATE

Table-1 Hive DDL commands

DDL
Use With
Command
CREATE Database, Table
Databases, Tables, Table Properties, Partitions, Functions,
SHOW
Index
DESCRIBE Database, Table, view
USE Database
DROP Database, Table
ALTER Database, Table
TRUNCATE Table

Note that the Hive commands are case-insensitive.

1.Create Database:
The CREATE DATABASE statement is used to create a database in the
Hive. The DATABASE and SCHEMA are interchangeable. We can use either
DATABASE or SCHEMA.
Syntax:
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name;
Example :
CREATE DATABASE IF NOT EXISTS bda;

2.Show Database:
The SHOW DATABASES statement lists all the databases present in the
Hive.
Syntax:
SHOW (DATABASES|SCHEMAS);
Example:
SHOW DATABASES;

3.Describe Database:
The DESCRIBE DATABASE statement in Hive shows the name of
Database in Hive, and its location on the file system.
The EXTENDED can be used to get the database properties.
Syntax:
DESCRIBE DATABASE/SCHEMA [EXTENDED] db_name;
Example
DESCRIBE DATABASE bda;

4.Use Database
The USE statement in Hive is used to select the specific database for a
session on which all subsequent HiveQL statements would be executed.
Syntax:
USE database_name;
Example
USE bda;

5.Drop Database
The DROP DATABASE statement in Hive is used to Drop (delete) the
database.
The default behavior is RESTRICT which means that the database is
dropped only when it is empty. To drop the database with tables, we can use
CASCADE.
Syntax:
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name
[RESTRICT|CASCADE];

Example :
DROP DATABASE [IF EXISTS] bda CASCADE

CREATE DATABASE bda;


CREATE TABLE mca;
Add col(name, dob)
6.Alter Database:
The ALTER DATABASE statement in Hive is used to change the
metadata associated with the database
Syntax:
ALTER (DATABASE|SCHEMA) database_name SET
DBPROPERTIES (property_name=property_value, ...);
Example:
ALTER DATABASE bda SET DBPROPERTIES (‘createdfor’=’mca’);

In this example, we are setting the database properties of the ‘bda’


database after its creation by using the ALTER command.

Syntax for changing Database owner:


ALTER (DATABASE|SCHEMA) database_name SET OWNER
[USER|ROLE] user_or_role;
In this example, we are changing the owner role of the ‘bda’ database using the
ALTER statement.
Example:
ALTER DATABASE bda SET OWNER ROLE admin;

1. CREATE TABLE

The CREATE TABLE statement in Hive is used to create a table with


the given name. If a table or view already exists with the same name, then the
error is thrown. We can use IF NOT EXISTS to skip the error.

Syntax:
CREATE TABLE [IF NOT EXISTS] [db_name.] table_name [(col_name
data_type [COMMENT col_comment], ... [COMMENT col_comment])]
[COMMENT table_comment] [ROW FORMAT row_format] [STORED AS
file_format] [LOCATION hdfs_path];

Example
CREATE TABLE IF NOT EXISTS Employee(
Emp-ID STRING COMMENT ‘this is employee-id’
Emp_designation STRING COMMENT ‘this is employee post’)

2. SHOW TABLES in Hive

The SHOW TABLES statement in Hive lists all the base tables
and views in the current database.
Syntax:
SHOW TABLES [IN database_name];
Example
SHOW TABLES;
3. DESCRIBE TABLE in Hive

The DESCRIBE statement in Hive shows the lists of columns for the
specified table.
Syntax:
DESCRIBE [EXTENDED|FORMATTED] [db_name.]
table_name[.col_name ( [.field_name])];
Example:
DESCRIBE Employee;
It describes column name , data type and comment.

4. DROP TABLE in Hive

The DROP TABLE statement in Hive deletes the data for a particular
table and remove all metadata associated with it from Hive metastore.
If PURGE is not specified then the data is actually moved to the
.Trash/current directory. If PURGE is specified, then data is lost completely.
Syntax:
DROP TABLE [IF EXISTS] table_name [PURGE];
Example:
DROP TABEL IF EXISTS Employee PURGE;

5. ALTER TABLE in Hive

The ALTER TABLE statement in Hive enables you to change the


structure of an existing table. Using the ALTER TABLE statement we can
rename the table, add columns to the table, change the table properties, etc.

Syntax to Rename a table:


ALTER TABLE table_name RENAME TO new_table_name;
Example
ALTER TABEL Employee RENAME TO Comp_Emp;

In this example, we are trying to rename the ‘Employee’ table to ‘Com_Emp’


using the ALTER statement.
Syntax to Add columns to a table:
ALTER TABLE table_name ADD COLUMNS (column1, column2) ;
Example
ALTER TABEL Comp_Emp ADD COLUMNS (emp_dob STRING
,emp_contact STRING);
In this example, we are adding two columns ‘Emp_DOB’ and
‘Emp_Contact’ in the ‘Comp_Emp’ table using the ALTER command.

6. TRUNCATE TABLE

TRUNCATE TABLE statement in Hive removes all the rows from the
table or partition.
Syntax:
TRUNCATE TABLE table_name;
Example
TRUNCATE TABLE Comp_Emp;
It removes all the rows in Comp_Emp Table.

HIVEQL DATA MANIPULATION


Hive DML (Data Manipulation Language) commands are used to insert,
update, retrieve, and delete data from the Hive table once the table and database
schema has been defined using Hive DDL commands.
The various Hive DML commands are:
1. LOAD
2. SELECT
3. INSERT
4. DELETE
5. UPDATE
6. EXPORT
7. IMPORT

1.Load Command:
The LOAD statement in Hive is used to move data files into the locations
corresponding to Hive tables.
 If a LOCAL keyword is specified, then the LOAD command will look
for the file path in the local filesystem.
 If the LOCAL keyword is not specified, then the Hive will need the
absolute URI of the file.
 In case the keyword OVERWRITE is specified, then the contents of the
target table/partition will be deleted and replaced by the files referred by
filepath.
 If the OVERWRITE keyword is not specified, then the files referred by
filepath will be appended to the table.

Syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO
TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)];
Example:
LOAD DATA LOCAL INPATH ‘/home/user/dab’ INTO TABEL emp_data;

2. SELECT COMMAND
The SELECT statement in Hive is similar to the SELECT statement in
SQL used for retrieving data from the database.
Syntax:
SELECT col1,col2 FROM tablename;

Example:

SELECT * FROM emp_data;

It will display all the rows in emp_data

3. INSERT Command
The INSERT command in Hive loads the data into a Hive table. We can
do insert to both the Hive table or partition.
a. INSERT INTO

The INSERT INTO statement appends the data into existing data in the
table or partition. INSERT INTO statement works from Hive version 0.8.
Syntax:
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1,
partcol2=val2 ...)] select_statement1 FROM from_statement;
Example
CREATE TABLE IF NOT EXIST example(id STRING, name STRING,
dep STRING ,state STRING, salary STRING, year STRING

INSERT statement to load data into table “example”.


Example:
INSERT INTO TABLE example SELECT emp.emp_id, emp.emp_name,
emp.emp_dep, emp.state, emp.salary, emp.year_of_joing from emp_data
emp;

SELECT * FROM example;


b. INSERT OVERWRITE

The INSERT OVERWRITE table overwrites the existing data in the table or
partition.

Example:
Here we are overwriting the existing data of the table ‘example’ with the
data of table ‘dummy’ using INSERT OVERWRITE statement.

INSERT OVERWRITE TABLE example SELECT dmy.enroll, dmy.name,


dmy.department, dmy.salary, dmy.year from dummy dmy;

By using the SELECT statement we can verify whether the existing data of the
table ‘example’ is overwritten by the data of table ‘dummy’ or not.
SELECT * FROM EXAMPLE;
c. INSERT .. VALUES

INSERT ..VALUES statement in Hive inserts data into the table directly
from SQL..
Example:
Inserting data into the ‘student’ table using INSERT ..VALUES statement.
INSERT INTO TABLE student VALUES(101,’Callen’,’IT’,’7.8’),
(103,’joseph’,’CS’,’8.2’),
(105,’Alex’,’IT’,’7.9’);
SELECT * FROM student;

4. DELETE command
The DELETE statement in Hive deletes the table data. If the WHERE
clause is specified, then it deletes the rows that satisfy the condition in where
clause.
Syntax:
DELETE FROM tablename [WHERE expression];
Example:
In the below example, we are deleting the data of the student from table
‘student’ whose roll_no is 105.
DELETE FROM student WHERE roll_no=105;
SELECT * FROM student;
5. UPDATE Command
The UPDATE statement in Hive deletes the table data. If the WHERE
clause is specified, then it updates the column of the rows that satisfy the
condition in WHERE clause.
Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE
expression];
Example:
In this example, we are updating the branch of the student whose roll_no is 103
in the ‘student’ table using an UPDATE statement.
UPDATE student SET branch=’IT’ WHERE roll_no=103;
SELECT * FROM student;

6. EXPORT Command
The Hive EXPORT statement exports the table or partition data along
with the metadata to the specified output location in the HDFS.
Example:
Here in this example, we are exporting the student table to the HDFS
directory “export_from_hive”.
EXPORT TABLE student TO ‘export_from_hive’

The table successfully exported. You can check for the _metadata file and data
sub-directory using ls command.
7. IMPORT Command
The Hive IMPORT command imports the data from a specified location
to a new table or already existing table.
Example:
Here in this example, we are importing the data exported in the above
example into a new Hive table ‘imported_table’.

Verifying whether the data is imported or not using hive SELECT statement.
SELECT * from imported_table;

HIVEQL QUERIES.
Liks SQL , in hiveql it contains,
 Hiveql operators,
 Hiveql functions,
 Hiveql group by & having ,
 hiveql orderby & sortby ,
 HiveqlJoin

HiveQL - Operators
The HiveQL operators facilitate to perform various arithmetic and relational operations.
Here, we are going to execute such type of operations on the records :

Let's create a table and load the data into it by using the following steps: -

Select the database in which we want to create a table.

hive> use hql;

Create a hive table using the following command: -

hive> create table employee (Id int, Name string , Salary float)
Now, load the data into the table.
hive> load data local inpath '/home/codegyani/hive/emp_data' into table employ
ee;

Let's fetch the loaded data by using the following command: -

hive> select * from employee;

Arithmetic Operators in Hive


In Hive, the arithmetic operator accepts any numeric type. The commonly used
arithmetic operators are: -

Operator Description
s

A+B This is used to add A and B.

A-B This is used to subtract B from A.

A*B This is used to multiply A and B.

A/B This is used to divide A and B and returns the quotient of the
operands.

A%B This returns the remainder of A / B.

A|B This is used to determine the bitwise OR of A and B.

A&B This is used to determine the bitwise AND of A and B.

A^B This is used to determine the bitwise XOR of A and B.


~A This is used to determine the bitwise NOT of A.

hive> select id, name, salary + 50 from employee;

Let's see an example to decrease the salary of each employee by 50.

hive> select id, name, salary - 50 from employee;

Let's see an example to find out the 10% salary of each employee.

hive> select id, name, (salary * 10) /100 from employee;

Relational operators in HIVE:


hive> select * from employee where salary >= 25000;

hive> select * from employee where salary < 25000;

HIVEQL Functions:

There are so many mathematical function in Hive like


Round(num) , floor(num), sqrt(num), abs(num)
hive> select Id, Name, sqrt(Salary) from employee_data ;
Aggregate Functions in Hive
the aggregate function returns a single value resulting from computation over many
rows. Let''s see some commonly used aggregate functions: -

count(*) - It returns the count of the number of rows present in the file.

sum(col), - It returns the sum of values.


sum(DISTINCT col)- It returns the sum of distinct values

avg(col) - It returns the average of values.

min(col) - It compares the values and return the minimum one from it
max(col) - It compares the values and return the minimum one from it
Example:
hive> select max(Salary) from employee_data;

hive> select min(Salary) from employee_data;

HiveQL - GROUP BY and HAVING Clause

+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;
+------+--------------+-------------+-------------------+--------
+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1205 | Kranthi | 30000 | Op Admin | Admin |
|1204 | Krian | 40000 | Hr Admin | HR |
|1202 | Manisha | 45000 | Proofreader | PR |
|1201 | Gopal | 45000 | Technical manager | TP |
|1203 | Masthanvali | 40000 | Technical writer | TP |
+------+--------------+-------------+-------------------+--------+
hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;
Dept | Count(*) |
+------+--------------+
|Admin | 1 |
|PR | 2 |
|TP | 3 |
+------+--------------+

HAVING CLAUSE
The HQL HAVING clause is used with GROUP BY clause. Its purpose is to apply
constraints on the group of data produced by GROUP BY clause. Thus, it always returns
the data where the condition is TRUE.

hive> select department, sum(salary) from emp group by department

having sum(salary)>=45000;

output:
DEPARTMENT SUM(SALARY) DEPARTMENT SUM(SALARY)
Tp 85000 Tp 85000
Pr 45000 Pr 45000
Hr 40000
Admin 30000

Differences between Hive and Pig


Hive Pig

Hive is commonly used by Data Pig is commonly used by


Analysts. programmers.

It follows SQL-like queries. It follows the data-flow language.

It can handle structured data. It can handle semi-structured data.

It works on server-side of HDFS cluster. It works on client-side of HDFS cluster.

Hive is slower than Pig. Pig is comparatively faster than Hive.

Limitations of Hive
o Hive is not capable of handling real-time data.
o It is not designed for online transaction processing.
o Hive queries contain high latency.

You might also like