0% found this document useful (0 votes)
61 views46 pages

New - Hive

The document provides information about Hive, including that it is a data warehousing system for Hadoop that runs SQL-like queries called HQL. It supports data definition language, data manipulation language, and user defined functions. The document also discusses Hive tutorials that cover topics like installation, data types, partitioning, commands, and more. It notes prerequisites, the intended audience, and how to report problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views46 pages

New - Hive

The document provides information about Hive, including that it is a data warehousing system for Hadoop that runs SQL-like queries called HQL. It supports data definition language, data manipulation language, and user defined functions. The document also discusses Hive tutorials that cover topics like installation, data types, partitioning, commands, and more. It notes prerequisites, the intended audience, and how to report problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Hive tutorial provides basic and advanced concepts of Hive.

Our Hive tutorial is


designed for beginners and professionals.

Apache Hive is a data ware house system for Hadoop that runs SQL like queries called
HQL (Hive query language) which gets internally converted to map reduce jobs. Hive
was developed by Facebook. It supports Data definition Language, Data Manipulation
Language and user defined functions.

Our Hive tutorial includes all topics of Apache Hive with Hive Installation, Hive Data
Types, Hive Table partitioning, Hive DDL commands, Hive DML commands, Hive sort by
vs order by, Hive Joining tables etc.

Prerequisite
Before learning Hive, you must have the knowledge of Hadoop and Java.

Audience
Our Hive tutorial is designed to help beginners and professionals.

Problem
We assure that you will not find any problem in this Hive tutorial. But if there is any
mistake, please post the problem in contact form.

Hive is a data warehouse system which is used to analyze structured data. It is built on
the top of Hadoop. It was developed by Facebook.

Hive provides the functionality of reading, writing, and managing large datasets residing
in distributed storage. It runs SQL like queries called HQL (Hive query language) which
gets internally converted to MapReduce jobs.

Using Hive, we can skip the requirement of the traditional approach of writing complex
MapReduce programs. Hive supports Data Definition Language (DDL), Data
Manipulation Language (DML), and User Defined Functions (UDF).

Features of Hive
These are the following features of Hive:
o Hive is fast and scalable.
o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to
MapReduce or Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and HBase.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem.
o It supports user-defined functions (UDFs) where user can provide its functionality.

Limitations of Hive
o Hive is not capable of handling real-time data.
o It is not designed for online transaction processing.
o Hive queries contain high latency.

Differences between Hive and Pig

Hive Pig

Hive is commonly used by Data Analysts. Pig is commonly used by programmers.

It follows SQL-like queries. It follows the data-flow language.

It can handle structured data. It can handle semi-structured data.

It works on server-side of HDFS cluster. It works on client-side of HDFS cluster.

Hive is slower than Pig. Pig is comparatively faster than Hive.


Next TopicHive Architecture

he following architecture explains the flow of submission of query into Hive.


Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-

o Thrift Server - It is a cross-language service provider platform that serves the


request from all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java
applications. The JDBC Driver is present in the class
org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to
connect to Hive.

Hive Services
The following are the services provided by Hive:-
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information
of various tables and partitions in the warehouse. It also includes metadata of
column and its type information, the serializers and deserializers which is used to
read and write data and the corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts
HiveQL statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG
of map-reduce tasks and HDFS tasks. In the end, the execution engine executes
the incoming tasks in the order of their dependencies.

Next TopicHive Installation

HIVE Data Types


Hive data types are categorized in numeric types, string types, misc types, and complex
types. A list of Hive data types is given below.

Integer Types

Type Size Range

TINYINT 1-byte signed integer -128 to 127

SMALLINT 2-byte signed integer 32,768 to 32,767


INT 4-byte signed integer 2,147,483,648 to 2,147,483,647

BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

Decimal Type

Type Size Range

FLOAT 4-byte Single precision floating point number

DOUBLE 8-byte Double precision floating point number

Date/Time Types
TIMESTAMP

o It supports traditional UNIX timestamp with optional nanosecond precision.


o As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
o As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with
decimal precision.
o As string, it follows java.sql.Timestamp format "YYYY-MM-DD HH:MM:SS.fffffffff" (9
decimal place precision)

DATES

The Date value is used to specify a particular year, month and day, in the form YYYY--
MM--DD. However, it didn't provide the time of the day. The range of Date type lies
between 0000--01--01 to 9999--12--31.

Exception Handling in Java - Javatpoint

String Types
STRING
The string is a sequence of characters. It values can be enclosed within single quotes (')
or double quotes (").

Varchar

The varchar is a variable length type whose range lies between 1 and 65535, which
specifies that the maximum number of characters allowed in the character string.

CHAR

The char is a fixed-length type whose maximum length is fixed at 255.

Complex Type

Type Size Range

Struct It is similar to C struct or an object where fields are accessed using the struct('James','Roy')
"dot" notation.

Map It contains the key-value tuples where the fields are accessed using map('first','James','last','Roy
array notation.

Array It is a collection of similar type of values that indexable using zero- array('James','Roy')
based integers.

Next Topic Hive Create Database

Hive - Create Database


In Hive, the database is considered as a catalog or namespace of tables. So, we can
maintain multiple tables within a database where a unique name is assigned to each
table. Hive also provides a default database with a name default.

o Initially, we check the default database provided by Hive. So, to check the list of existing
databases, follow the below command: -
hive> show databases;  

Here, we can see the existence of a default database provided by Hive.

o Let's create a new database by using the following command: -

hive> create database demo;  

So, a new database is created.

o Let's check the existence of a newly created database.

hive> show databases;  
o Each database must contain a unique name. If we create two databases with the same
name, the following error generates: -

o If we want to suppress the warning generated by Hive on creating the database with the
same name, follow the below command: -

hive> create a database if not exists demo;  

o Hive also allows assigning properties with the database in the form of key-value pair.
1. hive>create the database demo  
2.         >WITH DBPROPERTIES ('creator' = 'Gaurav Chawla', 'date' = '2019-06-03');   

o Let's retrieve the information associated with the database.

1. hive> describe database extended demo;  

Next Topic Hive Drop Database

← Prev
Next → Hive - Create Table
In Hive, we can create a table by using the conventions similar to the SQL. It supports a
wide range of flexibility where the data files for tables are stored. It provides two types
of table: -

o Internal table
o External table

Internal Table
The internal tables are also called managed tables as the lifecycle of their data is
controlled by the Hive. By default, these tables are stored in a subdirectory under the
directory defined by hive.metastore.warehouse.dir (i.e. /user/hive/warehouse). The
internal tables are not flexible enough to share with other tools like Pig. If we try to drop
the internal table, Hive deletes both table schema and data.

o Let's create an internal table by using the following command:-

1. hive> create table demo.employee (Id int, Name string , Salary float)  
2. row format delimited  
3. fields terminated by ',' ;  

Here, the command also includes the information that the data is separated by ','.

o Let's see the metadata of the created table by using the following command:-

1. hive> describe demo.employee  
o Let's see the result when we try to create the existing table again.

In such a case, the exception occurs. If we want to ignore this type of exception, we can
use if not exists command while creating the table.

Skip Ad

1. hive> create table if not exists demo.employee (Id int, Name string , Salary float)  
2. row format delimited  
3. fields terminated by ',' ;   
o While creating a table, we can add the comments to the columns and can also define the
table properties.

1. hive> create table demo.new_employee (Id int comment 'Employee Id', Name string comment 'E
mployee Name', Salary float comment 'Employee Salary')  
2. comment 'Table Description'  
3. TBLProperties ('creator'='Gaurav Chawla', 'created_at' = '2019-06-06 11:00:00');  

o Let's see the metadata of the created table by using the following command: -

1. hive> describe new_employee;  

o Hive allows creating a new table by using the schema of an existing table.

1. hive> create table if not exists demo.copy_employee  
2. like demo.employee;  
Here, we can say that the new table is a copy of an existing table.

External Table
The external table allows us to create and access a table and a data externally.
The external keyword is used to specify the external table, whereas
the location keyword is used to determine the location of loaded data.

As the table is external, the data is not present in the Hive directory. Therefore, if we try
to drop the table, the metadata of the table will be deleted, but the data still exists.

To create an external table, follow the below steps: -

o Let's create a directory on HDFS by using the following command: -

1. hdfs dfs -mkdir /HiveDirectory  
o Now, store the file on the created directory.

1. hdfs dfs -put hive/emp_details /HiveDirectory  
o Let's create an external table using the following command: -
1. hive> create external table emplist (Id int, Name string , Salary float)  
2. row format delimited  
3.  fields terminated by ','   
4. location '/HiveDirectory';  

o Now, we can use the following command to retrieve the data: -

1. select * from emplist;  

Next Topic Hive Load Data

Hive - Load Data


Once the internal table has been created, the next step is to load the data into it. So, in
Hive, we can easily load data from any file to the database.

o Let's load the data of the file into the database by using the following command: -
1. load data local inpath '/home/codegyani/hive/emp_details' into table demo.employee;  

Here, emp_details is the file name that contains the data.

o Now, we can use the following command to retrieve the data from the database.

1. select * from demo.employee;  
o If we want to add more data into the current database, execute the same query again by
just updating the new file name.

1. load data local inpath '/home/codegyani/hive/emp_details1' into table demo.employee;  
o Let's check the data of an updated table: -

o In Hive, if we try to load unmatched data (i.e., one or more column data doesn't match
the data type of specified table columns), it will not throw any exception. However, it
stores the Null value at the position of unmatched tuple.
o Let's add one more file to the current table. This file contains the unmatched data.
Here, the third column contains the data of string type, and the table allows the float
type data. So, this condition arises in an unmatched data situation.

o Now, load the data into the table.

1. load data local inpath '/home/codegyani/hive/emp_details2' into table demo.employee;  

Here, data loaded successfully.

Skip Ad

o Let's fetch the records of the table.

1. select * from demo.employee  
Here, we can see the Null values at the position of unmatched data.

Next Topic Hive - Drop Table

Hive - Drop Table


Hive facilitates us to drop a table by using the SQL  drop table command. Let's follow
the below steps to drop the table from the database.

o Let's check the list of existing databases by using the following command: -

1. hive> show databases;  
o Now select the database from which we want to delete the table by using the following
command: -

1. hive> use demo;  

o Let's check the list of existing tables in the corresponding database.

1. hive> show tables;  

o Now, drop the table by using the following command: -

1. hive> drop table new_employee;   

o Let's check whether the table is dropped or not.


1. hive> show tables;  

As we can see, the table  new_employee is not present in the list. Hence, the table is
dropped successfully.

Hive - Alter Table


In Hive, we can perform modifications in the existing table like changing the table name,
column name, comments, and table properties. It provides SQL like commands to alter
the table.

Rename a Table
If we want to change the name of an existing table, we can rename that table by using
the following signature: -

1. Alter table old_table_name rename to new_table_name;  
o Let's see the existing tables present in the current database.

o Now, change the name of the table by using the following command: -

1. Alter table emp rename to employee_data;  
o Let's check whether the name has changed or not.

Here, we got the desired output.

Adding column
In Hive, we can add one or more columns in an existing table by using the following
signature: -

24.4M

552

Exception Handling in Java - Javatpoint

1. Alter table table_name add columns(column_name datatype);  
o Let's see the schema of the table.
o Let's see the data of columns exists in the table.

o Now, add a new column to the table by using the following command: -

1. Alter table employee_data add columns (age int);  

o Let's see the updated schema of the table.


o Let's see the updated data of the table.

As we didn't add any data to the new column, hive consider NULL as the value.

Change Column
In Hive, we can rename a column, change its type and position. Here, we are changing
the name of the column by using the following signature: -

1. Alter table table_name change old_column_name new_column_name  datatype;   
o Let's see the existing schema of the table.
o Now, change the name of the column by using the following command: -

1. Alter table employee_data change name first_name string;  

o Let's check whether the column name has changed or not.

Delete or Replace Column


Hive allows us to delete one or more columns by replacing them with the new columns.
Thus, we cannot drop the column directly.
o Let's see the existing schema of the table.

o Now, drop a column from the table.

1. alter table employee_data replace columns( id string, first_name string, age int);  

o Let's check whether the column has dropped or not.

Here, we got the desired output.


Partitioning in Hive
The partitioning in Hive means dividing the table into some parts based on the values of
a particular column like date, course, city or country. The advantage of partitioning is
that since the data is stored in slices, the query response time becomes faster.

As we know that Hadoop is used to handle the huge amount of data, it is always
required to use the best approach to deal with it. The partitioning in Hive is the best
example of it.

Let's assume we have a data of 10 million students studying in an institute. Now, we


have to fetch the students of a particular course. If we use a traditional approach, we
have to go through the entire data. This leads to performance degradation. In such a
case, we can adopt the better approach i.e., partitioning in Hive and divide the data
among the different datasets based on particular columns.

The partitioning in Hive can be executed in two ways -

o Static partitioning
o Dynamic partitioning

Static Partitioning
In static or manual partitioning, it is required to pass the values of partitioned columns
manually while loading the data into the table. Hence, the data file doesn't contain the
partitioned columns.

Example of Static Partitioning

o First, select the database in which we want to create a table.

1. hive> use test;  
o Create the table and provide the partitioned columns by using the following command: -

1. hive> create table student (id int, name string, age int,  institute string)   
2. partitioned by (course string)  
3. row format delimited  
4. fields terminated by ',';  

o Let's retrieve the information associated with the table.

1. hive> describe student;  

o Load the data into the table and pass the values of partition columns with it by using the
following command: -

1. hive> load data local inpath '/home/codegyani/hive/student_details1' into table student  
2. partition(course= "java");    
Here, we are partitioning the students of an institute based on courses.

o Load the data of another file into the same table and pass the values of partition
columns with it by using the following command: -

1. hive> load data local inpath '/home/codegyani/hive/student_details2' into table student  
2. partition(course= "hadoop");  

In the following screenshot, we can see that the table student is divided into two
categories.
o Let's retrieve the entire data of the able by using the following command: -

1. hive> select * from student;  
o Now, try to retrieve the data based on partitioned columns by using the following
command: -

1. hive> select * from student where course="java";  

In this case, we are not examining the entire data. Hence, this approach improves query
response time.
o Let's also retrieve the data of another partitioned dataset by using the following
command: -

1. hive> select * from student where course= "hadoop";  

Next Topic Dynamic Partitioning

Dynamic Partitioning
In dynamic partitioning, the values of partitioned columns exist within the table. So, it is
not required to pass the values of partitioned columns manually.

o First, select the database in which we want to create a table.

1. hive> use show;  

o Enable the dynamic partition by using the following commands: -

1. hive> set hive.exec.dynamic.partition=true;    
2. hive> set hive.exec.dynamic.partition.mode=nonstrict;  
o Create a dummy table to store the data.
1. hive> create table stud_demo(id int, name string, age int, institute string, course string)   
2. row format delimited  
3. fields terminated by ',';  

o Now, load the data into the table.

1. hive> load data local inpath '/home/codegyani/hive/student_details' into table stud_demo;  

o Create a partition table by using the following command: -

1. hive> create table student_part (id int, name string, age int, institute string)   
2. partitioned by (course string)  
3. row format delimited  
4. fields terminated by ',';  
o Now, insert the data of dummy table into the partition table.

1. hive> insert into student_part  
2. partition(course)  
3. select id, name, age, institute, course  
4. from stud_demo;  
o In the following screenshot, we can see that the table student_part is divided into two
categories.
o Let's retrieve the entire data of the table by using the following command: -

1. hive> select * from student_part;  
o Now, try to retrieve the data based on partitioned columns by using the following
command: -

1. hive> select * from student_part where course= "java ";  

In this case, we are not examining the entire data. Hence, this approach improves query
response time.

o Let's also retrieve the data of another partitioned dataset by using the following
command: -

1. hive> select * from student_part where course= "hadoop";  
Next Topic Bucketing in Hive

Bucketing in Hive
The bucketing in Hive is a data organizing technique. It is similar to partitioning in Hive
with an added functionality that it divides large datasets into more manageable parts
known as buckets. So, we can use bucketing in Hive when the implementation of
partitioning becomes difficult. However, we can also divide partitions further in buckets.

Working of Bucketing in Hive

o The concept of bucketing is based on the hashing technique.


o Here, modules of current column value and the number of required buckets is calculated
(let say, F(x) % 3).
o Now, based on the resulted value, the data is stored into the corresponding bucket.
Example of Bucketing in Hive
o First, select the database in which we want to create a table.

1. hive> use showbucket;  

o Create a dummy table to store the data.

1. hive> create table emp_demo (Id int, Name string , Salary float)    
2. row format delimited    
3. fields terminated by ',' ;   

o Now, load the data into the table.

1. hive> load data local inpath '/home/codegyani/hive/emp_details' into table emp_demo;  
o Enable the bucketing by using the following command: -

1. hive> set hive.enforce.bucketing = true;  
o Create a bucketing table by using the following command: -

1. hive> create table emp_bucket(Id int, Name string , Salary float)    
2. clustered by (Id) into 3 buckets  
3. row format delimited    
4. fields terminated by ',' ;    

o Now, insert the data of dummy table into the bucketed table.

1. hive> insert overwrite table emp_bucket select * from emp_demo;    
o Here, we can see that the data is divided into three buckets.

o Let's retrieve the data of bucket 0.


According to hash function :
6%3=0
3%3=0
So, these columns stored in bucket 0.

o Let's retrieve the data of bucket 1.

According to hash function :


7%3=1
4%3=1
1%3=1
So, these columns stored in bucket 1.

o Let's retrieve the data of bucket 2.

According to hash function :


8%3=2
5%3=2
2%3=2
So, these columns stored in bucket 2.

You might also like