HIVE
HIVE
Hive is a data warehouse system which is used to analyze structured data. It is built
on the top of Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large datasets
residing in distributed storage. It runs SQL like queries called HQL (Hive query
language) which gets internally converted to MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach of writing
complex MapReduce programs. Hive supports Data Definition Language (DDL),
Data Manipulation Language (DML), and User Defined Functions (UDF).
Initially Hive was developed by Facebook; later the Apache Software Foundation
took it up and developed it further as an open source under the name Apache Hive.
It is used by different companies For example Amazon uses it in Amazon Elastic
MapReduce.
Hive is not a
o A relational database
Features of Hive
These are the following features of Hive:
o Hive is fast and scalable.
o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to
MapReduce or Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and HBase.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem.
o It supports user-defined functions (UDFs) where user can provide its
functionality.
Limitations of Hive
o Hive is not capable of handling real-time data.
o It is not designed for online transaction processing.
o Hive queries contain high latency.
Hive Datatypes
Primitive types
1. Integers: small int, bigint,int
2. Boolean
3. Float type: Float, Double
4. string
Complex types
1. Structs:{a:int,b:int}
2. Maps:[group]
3. Array[‘a’,’b’,’c’]
String Types
String type data types can be specified using single quotes (' ') or double quotes ("
"). It contains two data types: VARCHAR and CHAR. Hive follows C-types
escape characters.
The following table depicts various CHAR data types:
Data Type Length
VARCHAR 1 to 65355
CHAR 255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It
supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and
format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-
DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used
for representing immutable arbitrary precision. The syntax and example is as
follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance
using create union. The syntax and example is as follows:
UNION TYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type. The
range of decimal type is approximately -10-308 to 10308.
Null Value
Missing values are represented by the special value NULL.
Hive Architecture
User Interface: Hive Is a data warehouse infrastructure software That Can Create
Interaction Between User And HDFS. The user interfaces that hive supports are
Hive Web UI, Hive Command Line, And Hive HD Insight(In windows server).
Meta Store: Hive chooses respective database servers to store The Schema or
Meta data Of Tables, Databases, Columns In A Table, Their Data Types, And
HDFS Mapping.
Hive QL process Engine: HiveQL is similar to SQL for querying on schema info
on the Metastore. It is one of the replacements of traditional approach for Map
Reduce program. Instead of writing Map Reduce program
Execution Engine: The conjunction part of Hive QL process Engine and Map
Reduce is Hive Execution Engine Execution engine processes the query and
generates results as same as Map Reduce results It uses the flavor of Map Reduce
HDFS or HBASE: Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.
Working Hive
Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver
(any database driver such as JDBC, ODBC, etc.) to execute
Get Plan
The driver takes the help of query compiler that parses the query to check
the syntax and query plan or the requirement of query.
Get Metadata
The compiler sends metadata request to Meta store (any database).
Send Meta data
Meta store sends metadata as a response to the compiler.
Send plan
The compiler checks the requirement and resends the plan to the driver. Up
to here, the parsing and compiling of a query is complete.
Execute Plan
The driver sends the execute plan to the execution engine.
Execute Job
Internally, the process of execution job is a Map Reduce job. The execution
engine sends the job to Job Tracker, which is in Name node and it assigns
this job to Task Tracker, which is in Data node. Here, the query executes
Map Reduce job.
Metadata Ops
Meanwhile in execution, the execution engine can execute metadata
operations with Meta store.
Fetch Result
The execution engine receives the results from Data nodes.
Send Results
The execution engine sends those resultant values to the driver.
The driver sends the results to Hive Interfaces.
What is Hive Meta store?
Meta store is the central repository of Apache Hive metadata. It stores metadata
for Hive tables (like their schema and location) and partitions in a relational
database. It provides client access to this information by using meta store service
API.
Hive Meta store consists of two fundamental units:
1. A service that provides Meta store access to other Apache Hive services.
2. Disk storage for the Hive metadata which is separate from HDFS storage.
Hive Meta store Modes
There are three modes for Hive Meta store deployment:
Ii.Local Metastore
Hive is the data-warehousing framework, so hive does not prefer single session. To
overcome this limitation of Embedded Meta store, for Local Meta store was
introduced. This mode allows us to have many Hive sessions i.e. many users can
use the metastore at the same time.
We can achieve by using any JDBC compliant like MySQL which runs in a
separate JVM or different machines than that of the Hive service and metastore
service which are running in the same JVM.
Local Metastore
This configuration is called as local metastore because metastore service still runs
in the same process as the Hive. But it connects to a database running in a separate
process, either on the same machine or on a remote machine.
Before starting Apache Hive client, add the JDBC / ODBC driver libraries to the
Hive lib folder.
MySQL is a popular choice for the standalone metastore. In this case,
the javax.jdo.option.ConnectionURL property is set
to jdbc:mysql://host/dbname? createDatabaseIfNotExist=true,
and javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver. The
JDBC driver JAR file for MySQL (Connector/J) must be on Hive’s classpath,
which is achieved by placing it in Hive’s lib directory.
iii. Remote Metastore
Moving further, another metastore configuration called Remote Metastore. In this
mode, metastore runs on its own separate JVM, not in the Hive service JVM. If
other processes want to communicate with the metastore server they can
communicate using Thrift Network APIs.
We can also have one more metastore servers in this case to provide more
availability. This also brings better manageability/security because the database
tier can be completely firewalled off. And the clients no longer need share database
credentials with each Hiver user to access the metastore database.
Remote Metastore
To use this remote metastore, you should configure Hive service by
setting hive.metastore.uris to the metastore server URI(s). Metastore server URIs
are of the form thrift://host:port, where the port corresponds to the one set by
METASTORE_PORT when starting the metastore server.
Databases Supported by Hive
Hive supports 5 backend databases which are as follows:
Derby
MySQL
MS SQL Server
Oracle
Postgres
So, this was all in Hive Metastore. Hope you likeour explanation.
Local Metastore
This configuration is called as local metastore because metastore service still runs
in the same process as the Hive. But it connects to a database running in a separate
process, either on the same machine or on a remote machine.
Before starting Apache Hive client, add the JDBC / ODBC driver libraries to the
Hive lib folder.
MySQL is a popular choice for the standalone metastore. In this case,
the javax.jdo.option.ConnectionURL property is set
to jdbc:mysql://host/dbname? createDatabaseIfNotExist=true,
and javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver. The
JDBC driver JAR file for MySQL (Connector/J) must be on Hive’s class path,
which is achieved by placing it in Hive’s lib directory.
iii. Remote Metastore
Moving further, another metastore configuration called Remote Metastore. In this
mode, metastore runs on its own separate JVM, not in the Hive service JVM. If
other processes want to communicate with the metastore server they can
communicate using Thrift Network APIs.
We can also have one more metastore servers in this case to provide more
availability. This also brings better manageability/security because the database
tier can be completely firewalled off. And the clients no longer need share database
credentials with each Hiver user to access the metastore database.
Remote Metastore
To use this remote metastore, you should configure Hive service by
setting hive.metastore.uris to the metastore server URI(s). Metastore server URIs
are of the form thrift://host:port, where the port corresponds to the one set by
METASTORE_PORT when starting the metastore server.
Databases Supported by Hive
Hive supports 5 backend databases which are as follows:
Derby
MySQL
MS SQL Server
Oracle
Postgres
So, this was all in Hive Metastore. Hope you likeour explanation.
Conclusion – Hive Metastore
In conclusion, we can say that Hive Metadata is a central repository for storing all
the Hive metadata information. Metadata includes various types of information like
the structure of tables, relations etc. Above we have also discussed all the three
metastore modes in detail.
HIVE Tables
Create Table Statement
Create Table is a statement used to create a table in Hive. The syntax and example
are as follows:
Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]
table_name
1 Eid int
2 Name String
3 Salary Float
4 Designation string
The following data is a Comment, Row formatted fields such as Field terminator,
Lines terminator, and Stored File type.
COMMENT ‘Employee details’
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED IN TEXT FILE
The following query creates a table named employee using the above data.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
If you add the option IF NOT EXISTS, Hive ignores the statement in case the table
already exists.
On successful creation of table, you get to see the following response:
OK
Time taken: 5.905 seconds
hive>
Alter Table Statement
It is used to alter a table in Hive.
Syntax
The statement takes any of the following syntaxes based on what attributes we
wish to modify in a table.
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
Rename To… Statement
The following query renames the table from employee to emp.
hive> ALTER TABLE employee RENAME TO emp;
Drop Table Statement
The syntax is as follows:
DROP TABLE [IF EXISTS] table_name;
The following query drops a table named employee:
hive> DROP TABLE IF EXISTS employee;
On successful execution of the query, you get to see the following response:
OK
Time taken: 5.3 seconds
hive>
Text File Format: Text files are simple plain text files where each line represents
a record. Hive can handle various text file formats, such as CSV (Comma-
Separated Values), TSV (Tab-Separated Values), and custom delimited formats.
-- Create a table using Text file format
CREATE TABLE my_table_text (
id INT,
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Sequence File Format: Sequence files are binary files that contain a sequence of
key-value pairs. They are suitable for storing large amounts of structured or
unstructured data efficiently.
-- Create a table using Sequence file format
CREATE TABLE my_table_sequence (
id INT,
name STRING
)
STORED AS SEQUENCEFILE;
Parquet File Format: Parquet is a columnar storage file format commonly used in
Hive. It offers efficient compression, predicate pushdown, and column-level
pruning, making it highly suitable for analytical workloads.