New - Hive
New - Hive
Apache Hive is a data ware house system for Hadoop that runs SQL like queries called
HQL (Hive query language) which gets internally converted to map reduce jobs. Hive
was developed by Facebook. It supports Data definition Language, Data Manipulation
Language and user defined functions.
Our Hive tutorial includes all topics of Apache Hive with Hive Installation, Hive Data
Types, Hive Table partitioning, Hive DDL commands, Hive DML commands, Hive sort by
vs order by, Hive Joining tables etc.
Prerequisite
Before learning Hive, you must have the knowledge of Hadoop and Java.
Audience
Our Hive tutorial is designed to help beginners and professionals.
Problem
We assure that you will not find any problem in this Hive tutorial. But if there is any
mistake, please post the problem in contact form.
Hive is a data warehouse system which is used to analyze structured data. It is built on
the top of Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large datasets residing
in distributed storage. It runs SQL like queries called HQL (Hive query language) which
gets internally converted to MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach of writing complex
MapReduce programs. Hive supports Data Definition Language (DDL), Data
Manipulation Language (DML), and User Defined Functions (UDF).
Features of Hive
These are the following features of Hive:
o Hive is fast and scalable.
o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to
MapReduce or Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and HBase.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem.
o It supports user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive
o Hive is not capable of handling real-time data.
o It is not designed for online transaction processing.
o Hive queries contain high latency.
Hive Pig
Hive Services
The following are the services provided by Hive:-
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information
of various tables and partitions in the warehouse. It also includes metadata of
column and its type information, the serializers and deserializers which is used to
read and write data and the corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts
HiveQL statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG
of map-reduce tasks and HDFS tasks. In the end, the execution engine executes
the incoming tasks in the order of their dependencies.
Integer Types
Decimal Type
Date/Time Types
TIMESTAMP
DATES
The Date value is used to specify a particular year, month and day, in the form YYYY--
MM--DD. However, it didn't provide the time of the day. The range of Date type lies
between 0000--01--01 to 9999--12--31.
String Types
STRING
The string is a sequence of characters. It values can be enclosed within single quotes (')
or double quotes (").
Varchar
The varchar is a variable length type whose range lies between 1 and 65535, which
specifies that the maximum number of characters allowed in the character string.
CHAR
Complex Type
Struct It is similar to C struct or an object where fields are accessed using the struct('James','Roy')
"dot" notation.
Map It contains the key-value tuples where the fields are accessed using map('first','James','last','Roy
array notation.
Array It is a collection of similar type of values that indexable using zero- array('James','Roy')
based integers.
o Initially, we check the default database provided by Hive. So, to check the list of existing
databases, follow the below command: -
hive> show databases;
hive> create database demo;
hive> show databases;
o Each database must contain a unique name. If we create two databases with the same
name, the following error generates: -
o If we want to suppress the warning generated by Hive on creating the database with the
same name, follow the below command: -
hive> create a database if not exists demo;
o Hive also allows assigning properties with the database in the form of key-value pair.
1. hive>create the database demo
2. >WITH DBPROPERTIES ('creator' = 'Gaurav Chawla', 'date' = '2019-06-03');
1. hive> describe database extended demo;
← Prev
Next → Hive - Create Table
In Hive, we can create a table by using the conventions similar to the SQL. It supports a
wide range of flexibility where the data files for tables are stored. It provides two types
of table: -
o Internal table
o External table
Internal Table
The internal tables are also called managed tables as the lifecycle of their data is
controlled by the Hive. By default, these tables are stored in a subdirectory under the
directory defined by hive.metastore.warehouse.dir (i.e. /user/hive/warehouse). The
internal tables are not flexible enough to share with other tools like Pig. If we try to drop
the internal table, Hive deletes both table schema and data.
1. hive> create table demo.employee (Id int, Name string , Salary float)
2. row format delimited
3. fields terminated by ',' ;
Here, the command also includes the information that the data is separated by ','.
o Let's see the metadata of the created table by using the following command:-
1. hive> describe demo.employee
o Let's see the result when we try to create the existing table again.
In such a case, the exception occurs. If we want to ignore this type of exception, we can
use if not exists command while creating the table.
Skip Ad
1. hive> create table if not exists demo.employee (Id int, Name string , Salary float)
2. row format delimited
3. fields terminated by ',' ;
o While creating a table, we can add the comments to the columns and can also define the
table properties.
1. hive> create table demo.new_employee (Id int comment 'Employee Id', Name string comment 'E
mployee Name', Salary float comment 'Employee Salary')
2. comment 'Table Description'
3. TBLProperties ('creator'='Gaurav Chawla', 'created_at' = '2019-06-06 11:00:00');
o Let's see the metadata of the created table by using the following command: -
1. hive> describe new_employee;
o Hive allows creating a new table by using the schema of an existing table.
1. hive> create table if not exists demo.copy_employee
2. like demo.employee;
Here, we can say that the new table is a copy of an existing table.
External Table
The external table allows us to create and access a table and a data externally.
The external keyword is used to specify the external table, whereas
the location keyword is used to determine the location of loaded data.
As the table is external, the data is not present in the Hive directory. Therefore, if we try
to drop the table, the metadata of the table will be deleted, but the data still exists.
1. hdfs dfs -mkdir /HiveDirectory
o Now, store the file on the created directory.
1. hdfs dfs -put hive/emp_details /HiveDirectory
o Let's create an external table using the following command: -
1. hive> create external table emplist (Id int, Name string , Salary float)
2. row format delimited
3. fields terminated by ','
4. location '/HiveDirectory';
1. select * from emplist;
o Let's load the data of the file into the database by using the following command: -
1. load data local inpath '/home/codegyani/hive/emp_details' into table demo.employee;
o Now, we can use the following command to retrieve the data from the database.
1. select * from demo.employee;
o If we want to add more data into the current database, execute the same query again by
just updating the new file name.
1. load data local inpath '/home/codegyani/hive/emp_details1' into table demo.employee;
o Let's check the data of an updated table: -
o In Hive, if we try to load unmatched data (i.e., one or more column data doesn't match
the data type of specified table columns), it will not throw any exception. However, it
stores the Null value at the position of unmatched tuple.
o Let's add one more file to the current table. This file contains the unmatched data.
Here, the third column contains the data of string type, and the table allows the float
type data. So, this condition arises in an unmatched data situation.
1. load data local inpath '/home/codegyani/hive/emp_details2' into table demo.employee;
Skip Ad
1. select * from demo.employee
Here, we can see the Null values at the position of unmatched data.
o Let's check the list of existing databases by using the following command: -
1. hive> show databases;
o Now select the database from which we want to delete the table by using the following
command: -
1. hive> use demo;
1. hive> show tables;
1. hive> drop table new_employee;
As we can see, the table new_employee is not present in the list. Hence, the table is
dropped successfully.
Rename a Table
If we want to change the name of an existing table, we can rename that table by using
the following signature: -
1. Alter table old_table_name rename to new_table_name;
o Let's see the existing tables present in the current database.
o Now, change the name of the table by using the following command: -
1. Alter table emp rename to employee_data;
o Let's check whether the name has changed or not.
Adding column
In Hive, we can add one or more columns in an existing table by using the following
signature: -
24.4M
552
1. Alter table table_name add columns(column_name datatype);
o Let's see the schema of the table.
o Let's see the data of columns exists in the table.
o Now, add a new column to the table by using the following command: -
1. Alter table employee_data add columns (age int);
As we didn't add any data to the new column, hive consider NULL as the value.
Change Column
In Hive, we can rename a column, change its type and position. Here, we are changing
the name of the column by using the following signature: -
1. Alter table table_name change old_column_name new_column_name datatype;
o Let's see the existing schema of the table.
o Now, change the name of the column by using the following command: -
1. Alter table employee_data change name first_name string;
1. alter table employee_data replace columns( id string, first_name string, age int);
As we know that Hadoop is used to handle the huge amount of data, it is always
required to use the best approach to deal with it. The partitioning in Hive is the best
example of it.
o Static partitioning
o Dynamic partitioning
Static Partitioning
In static or manual partitioning, it is required to pass the values of partitioned columns
manually while loading the data into the table. Hence, the data file doesn't contain the
partitioned columns.
1. hive> use test;
o Create the table and provide the partitioned columns by using the following command: -
1. hive> create table student (id int, name string, age int, institute string)
2. partitioned by (course string)
3. row format delimited
4. fields terminated by ',';
1. hive> describe student;
o Load the data into the table and pass the values of partition columns with it by using the
following command: -
1. hive> load data local inpath '/home/codegyani/hive/student_details1' into table student
2. partition(course= "java");
Here, we are partitioning the students of an institute based on courses.
o Load the data of another file into the same table and pass the values of partition
columns with it by using the following command: -
1. hive> load data local inpath '/home/codegyani/hive/student_details2' into table student
2. partition(course= "hadoop");
In the following screenshot, we can see that the table student is divided into two
categories.
o Let's retrieve the entire data of the able by using the following command: -
1. hive> select * from student;
o Now, try to retrieve the data based on partitioned columns by using the following
command: -
1. hive> select * from student where course="java";
In this case, we are not examining the entire data. Hence, this approach improves query
response time.
o Let's also retrieve the data of another partitioned dataset by using the following
command: -
1. hive> select * from student where course= "hadoop";
Dynamic Partitioning
In dynamic partitioning, the values of partitioned columns exist within the table. So, it is
not required to pass the values of partitioned columns manually.
1. hive> use show;
1. hive> set hive.exec.dynamic.partition=true;
2. hive> set hive.exec.dynamic.partition.mode=nonstrict;
o Create a dummy table to store the data.
1. hive> create table stud_demo(id int, name string, age int, institute string, course string)
2. row format delimited
3. fields terminated by ',';
1. hive> load data local inpath '/home/codegyani/hive/student_details' into table stud_demo;
1. hive> create table student_part (id int, name string, age int, institute string)
2. partitioned by (course string)
3. row format delimited
4. fields terminated by ',';
o Now, insert the data of dummy table into the partition table.
1. hive> insert into student_part
2. partition(course)
3. select id, name, age, institute, course
4. from stud_demo;
o In the following screenshot, we can see that the table student_part is divided into two
categories.
o Let's retrieve the entire data of the table by using the following command: -
1. hive> select * from student_part;
o Now, try to retrieve the data based on partitioned columns by using the following
command: -
1. hive> select * from student_part where course= "java ";
In this case, we are not examining the entire data. Hence, this approach improves query
response time.
o Let's also retrieve the data of another partitioned dataset by using the following
command: -
1. hive> select * from student_part where course= "hadoop";
Next Topic Bucketing in Hive
Bucketing in Hive
The bucketing in Hive is a data organizing technique. It is similar to partitioning in Hive
with an added functionality that it divides large datasets into more manageable parts
known as buckets. So, we can use bucketing in Hive when the implementation of
partitioning becomes difficult. However, we can also divide partitions further in buckets.
1. hive> use showbucket;
1. hive> create table emp_demo (Id int, Name string , Salary float)
2. row format delimited
3. fields terminated by ',' ;
1. hive> load data local inpath '/home/codegyani/hive/emp_details' into table emp_demo;
o Enable the bucketing by using the following command: -
1. hive> set hive.enforce.bucketing = true;
o Create a bucketing table by using the following command: -
1. hive> create table emp_bucket(Id int, Name string , Salary float)
2. clustered by (Id) into 3 buckets
3. row format delimited
4. fields terminated by ',' ;
o Now, insert the data of dummy table into the bucketed table.
1. hive> insert overwrite table emp_bucket select * from emp_demo;
o Here, we can see that the data is divided into three buckets.