Hive Data Types and Data Models
Hive Data Types and Data Models
HIVE
Hive is a data warehouse infrastructure tool to process structured data in
Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
Introduction
The term „Big Data‟ is used for collections of large datasets that include
huge volume, high velocity, and a variety of data that is increasing day by
day. Using traditional data management systems, it is difficult to process
Big Data. Therefore, the Apache Software Foundation introduced a
framework called Hadoop to solve Big Data management and processing
challenges.
Hadoop
Hadoop is an open-source framework to store and process Big Data in a
distributed environment. It contains two modules, one is MapReduce and
another is Hadoop Distributed File System (HDFS).
The traditional approach using Java MapReduce program for structured, semi-
structured, and unstructured data.
The scripting approach for MapReduce to process structured and semi structured
data using Pig.
The Hive Query Language (HiveQL or HQL) for MapReduce to process structured
data using Hive.
What is Hive
Hive is a data warehouse infrastructure tool to process structured data in
Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
Hive is not
A relational database
Features of Hive
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
Architecture of Hive
The following component diagram depicts the architecture of Hive:
HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on
the Metastore. It is one of the replacements of traditional
approach for MapReduce program. Instead of writing
MapReduce program in Java, we can write a query for
MapReduce job and process it.
HDFS or HBASE Hadoop distributed file system or HBASE are the data
storage techniques to store data into file system.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop
Hive - Data Types
This chapter takes you through the different data types in Hive, which are
involved in the table creation. All the data types in Hive are classified into
four types, given as follows:
Column Types
Literals
Null Values
Complex Types
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the
data range exceeds the range of INT, you need to use BIGINT and if the
data range is smaller than the INT, you use SMALLINT. TINYINT is smaller
than SMALLINT.
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
String Types
String type data types can be specified using single quotes (' ') or double
quotes (" "). It contains two data types: VARCHAR and CHAR. Hive follows
C-types escape characters.
VARCHAR 1 to 65355
CHAR 255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision.
It supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff”
and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY-
MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is
used for representing immutable arbitrary precision. The syntax and
example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an
instance using create union. The syntax and example is as follows:
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE
-308 308
data type. The range of decimal type is approximately -10 to 10 .
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
Here, IF NOT EXISTS is an optional clause, which notifies the user that a
database with the same name already exists. We can use SCHEMA in place
of DATABASE in this command. The following query is executed to create a
database named userdb:
or
JDBC Program
The JDBC program to create a database is given below.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
Class.forName(driverName);
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
Statement stmt = con.createStatement();
con.close();
}
}
$ javac HiveCreateDb.java
$ java HiveCreateDb
Output:
Database userdb created successfully.
Hive - Drop Database
Drop Database Statement
Drop Database is a statement that drops all the tables and deletes the
database. Its syntax is as follows:
The following queries are used to drop a database. Let us assume that the
database name is userdb.
The following query drops the database using CASCADE. It means dropping
respective tables before dropping the database.
JDBC Program
The JDBC program to drop a database is given below.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
Statement stmt = con.createStatement();
stmt.executeQuery("DROP DATABASE userdb");
con.close();
}
}
Save the program in a file named HiveDropDb.java. Given below are the
commands to compile and execute this program.
$ javac HiveDropDb.java
$ java HiveDropDb
Output:
Drop userdb database successful.
Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
Example
Let us assume you need to create a table named employee using CREATE
TABLE statement. The following table lists the fields and their data types in
employee table:
1 Eid int
2 Name String
3 Salary Float
4 Designation string
The following query creates a table named employee using the above data.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
If you add the option IF NOT EXISTS, Hive ignores the statement in case
the table already exists.
OK
Time taken: 5.905 seconds
hive>
JDBC Program
The JDBC program to create a table is given example.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
stmt.executeQuery("CREATE TABLE IF NOT EXISTS "
+" employee ( eid int, name String, "
+" salary String, destignation String)"
+" COMMENT ‘Employee details’"
+" ROW FORMAT DELIMITED"
+" FIELDS TERMINATED BY ‘\t’"
+" LINES TERMINATED BY ‘\n’"
+" STORED AS TEXTFILE;");
$ javac HiveCreateDb.java
$ java HiveCreateDb
Output
Table employee created.
While inserting data into Hive, it is better to use LOAD DATA to store bulk
records. There are two ways to load data: one is from local file system and
second is from Hadoop file system.
Syntax
The syntax for load data is as follows:
PARTITION is optional.
Example
We will insert the following data into the table. It is a text file
namedsample.txt in /home/user directory.
The following query loads the given text into the table.
OK
Time taken: 15.905 seconds
hive>
JDBC Program
Given below is the JDBC program to load given data into the table.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
stmt.executeQuery("LOAD DATA LOCAL INPATH '/home/user/sample.txt'" + "OVERWRITE
INTO TABLE employee;");
System.out.println("Load Data into employee successful");
con.close();
}
}
$ javac HiveLoadData.java
$ java HiveLoadData
Output:
Load Data into employee successful
Syntax
The statement takes any of the following syntaxes based on what attributes
we wish to modify in a table.
JDBC Program
The JDBC program to rename a table is as follows.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
stmt.executeQuery("ALTER TABLE employee RENAME TO emp;");
System.out.println("Table Renamed Successfully");
con.close();
}
}
$ javac HiveAlterRenameTo.java
$ java HiveAlterRenameTo
Output:
Table renamed successfully.
Change Statement
The following table contains the fields of employee table and it shows the
fields to be changed (in bold).
The following queries rename the column name and column data type using
the above data:
JDBC Program
Given below is the JDBC program to change a column.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
stmt.executeQuery("ALTER TABLE employee CHANGE name ename String;");
stmt.executeQuery("ALTER TABLE employee CHANGE salary salary Double;");
$ javac HiveAlterChangeColumn.java
$ java HiveAlterChangeColumn
Output:
Change column successful.
JDBC Program
The JDBC program to add a column to a table is given below.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
stmt.executeQuery("ALTER TABLE employee ADD COLUMNS " + " (dept STRING COMMENT
'Department name');");
System.out.prinln("Add column successful.");
con.close();
}
}
Output:
Add column successful.
Replace Statement
The following query deletes all the columns from the employee table and
replaces it with emp and name columns:
JDBC Program
Given below is the JDBC program to replace eid column
with empid andename column with name.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
stmt.executeQuery("ALTER TABLE employee REPLACE COLUMNS "
+" (eid INT empid Int,"
+" ename STRING name String);");
$ javac HiveAlterReplaceColumn.java
$ java HiveAlterReplaceColumn
Output:
Replace column successful.
OK
Time taken: 5.3 seconds
hive>
JDBC Program
The following JDBC program drops the employee table.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
stmt.executeQuery("DROP TABLE IF EXISTS employee;");
System.out.println("Drop table successful.");
con.close();
}
}
Save the program in a file named HiveDropTable.java. Use the following
commands to compile and execute this program.
$ javac HiveDropTable.java
$ java HiveDropTable
Output:
Drop table successful