0% found this document useful (0 votes)
10 views16 pages

HIVE

Hive is a data warehouse system built on Hadoop, developed by Facebook, that allows for the analysis of structured data using SQL-like queries called HQL. It supports various data types and features such as scalability, indexing, and user-defined functions, but has limitations in handling real-time data and online transaction processing. The Hive Meta store serves as a central repository for metadata, supporting different deployment modes and backend databases.

Uploaded by

kolli1432003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views16 pages

HIVE

Hive is a data warehouse system built on Hadoop, developed by Facebook, that allows for the analysis of structured data using SQL-like queries called HQL. It supports various data types and features such as scalability, indexing, and user-defined functions, but has limitations in handling real-time data and online transaction processing. The Hive Meta store serves as a central repository for metadata, supporting different deployment modes and backend databases.

Uploaded by

kolli1432003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

HIVE

Hive is a data warehouse system which is used to analyze structured data. It is built
on the top of Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large datasets
residing in distributed storage. It runs SQL like queries called HQL (Hive query
language) which gets internally converted to MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach of writing
complex MapReduce programs. Hive supports Data Definition Language (DDL),
Data Manipulation Language (DML), and User Defined Functions (UDF).
Initially Hive was developed by Facebook; later the Apache Software Foundation
took it up and developed it further as an open source under the name Apache Hive.
It is used by different companies For example Amazon uses it in Amazon Elastic
MapReduce.

Hive is not a
o A relational database

o A design for On Line Transaction Processing (OLTP)

o A language for real-time queries and row-level updates

Features of Hive
These are the following features of Hive:
o Hive is fast and scalable.
o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to
MapReduce or Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and HBase.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem.
o It supports user-defined functions (UDFs) where user can provide its
functionality.

Limitations of Hive
o Hive is not capable of handling real-time data.
o It is not designed for online transaction processing.
o Hive queries contain high latency.

Hive Datatypes
Primitive types
1. Integers: small int, bigint,int
2. Boolean
3. Float type: Float, Double
4. string
Complex types
1. Structs:{a:int,b:int}
2. Maps:[group]
3. Array[‘a’,’b’,’c’]
String Types
String type data types can be specified using single quotes (' ') or double quotes ("
"). It contains two data types: VARCHAR and CHAR. Hive follows C-types
escape characters.
The following table depicts various CHAR data types:
Data Type Length

VARCHAR 1 to 65355

CHAR 255

Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It
supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and
format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-
DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used
for representing immutable arbitrary precision. The syntax and example is as
follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance
using create union. The syntax and example is as follows:
UNION TYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type. The
range of decimal type is approximately -10-308 to 10308.

Null Value
Missing values are represented by the special value NULL.
Hive Architecture

User Interface: Hive Is a data warehouse infrastructure software That Can Create
Interaction Between User And HDFS. The user interfaces that hive supports are
Hive Web UI, Hive Command Line, And Hive HD Insight(In windows server).

Meta Store: Hive chooses respective database servers to store The Schema or
Meta data Of Tables, Databases, Columns In A Table, Their Data Types, And
HDFS Mapping.

Hive QL process Engine: HiveQL is similar to SQL for querying on schema info
on the Metastore. It is one of the replacements of traditional approach for Map
Reduce program. Instead of writing Map Reduce program
Execution Engine: The conjunction part of Hive QL process Engine and Map
Reduce is Hive Execution Engine Execution engine processes the query and
generates results as same as Map Reduce results It uses the flavor of Map Reduce

HDFS or HBASE: Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.

Working Hive

Execute Query
 The Hive interface such as Command Line or Web UI sends query to Driver
(any database driver such as JDBC, ODBC, etc.) to execute
Get Plan
 The driver takes the help of query compiler that parses the query to check
the syntax and query plan or the requirement of query.
Get Metadata
 The compiler sends metadata request to Meta store (any database).
Send Meta data
 Meta store sends metadata as a response to the compiler.
Send plan
 The compiler checks the requirement and resends the plan to the driver. Up
to here, the parsing and compiling of a query is complete.
Execute Plan
 The driver sends the execute plan to the execution engine.
Execute Job
 Internally, the process of execution job is a Map Reduce job. The execution
engine sends the job to Job Tracker, which is in Name node and it assigns
this job to Task Tracker, which is in Data node. Here, the query executes
Map Reduce job.
Metadata Ops
 Meanwhile in execution, the execution engine can execute metadata
operations with Meta store.
Fetch Result
 The execution engine receives the results from Data nodes.
Send Results
 The execution engine sends those resultant values to the driver.
 The driver sends the results to Hive Interfaces.
What is Hive Meta store?
Meta store is the central repository of Apache Hive metadata. It stores metadata
for Hive tables (like their schema and location) and partitions in a relational
database. It provides client access to this information by using meta store service
API.
Hive Meta store consists of two fundamental units:

1. A service that provides Meta store access to other Apache Hive services.
2. Disk storage for the Hive metadata which is separate from HDFS storage.
Hive Meta store Modes
There are three modes for Hive Meta store deployment:

 Embedded Meta store


 Local Meta store
 Remote Meta store
Let’s now discuss the above three Hive Meta store deployment modes one by one-

i. Embedded Meta store


In Hive by default, Meta store service runs in the same JVM as the Hive service. It
uses embedded derby database stored on the local file system in this mode. Thus
both Meta store service and hive service runs in the same JVM by using embedded
Derby Database.
But, this mode also has limitation that, as only one embedded Derby database can
access the database files on disk at any one time, so only one Hive session could be
open at a time.

Embedded Deployment mode for Hive Metastore


If we try to start the second session it produces an error when it attempts to open a
connection to the meta store. So, to allow many services to connect the Meta store,
it configures Derby as a network server. This mode is good for unit testing. But it
is not good for the practical solutions.

Ii.Local Metastore
Hive is the data-warehousing framework, so hive does not prefer single session. To
overcome this limitation of Embedded Meta store, for Local Meta store was
introduced. This mode allows us to have many Hive sessions i.e. many users can
use the metastore at the same time.
We can achieve by using any JDBC compliant like MySQL which runs in a
separate JVM or different machines than that of the Hive service and metastore
service which are running in the same JVM.
Local Metastore
This configuration is called as local metastore because metastore service still runs
in the same process as the Hive. But it connects to a database running in a separate
process, either on the same machine or on a remote machine.

Before starting Apache Hive client, add the JDBC / ODBC driver libraries to the
Hive lib folder.
MySQL is a popular choice for the standalone metastore. In this case,
the javax.jdo.option.ConnectionURL property is set
to jdbc:mysql://host/dbname? createDatabaseIfNotExist=true,
and javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver. The
JDBC driver JAR file for MySQL (Connector/J) must be on Hive’s classpath,
which is achieved by placing it in Hive’s lib directory.
iii. Remote Metastore
Moving further, another metastore configuration called Remote Metastore. In this
mode, metastore runs on its own separate JVM, not in the Hive service JVM. If
other processes want to communicate with the metastore server they can
communicate using Thrift Network APIs.
We can also have one more metastore servers in this case to provide more
availability. This also brings better manageability/security because the database
tier can be completely firewalled off. And the clients no longer need share database
credentials with each Hiver user to access the metastore database.
Remote Metastore
To use this remote metastore, you should configure Hive service by
setting hive.metastore.uris to the metastore server URI(s). Metastore server URIs
are of the form thrift://host:port, where the port corresponds to the one set by
METASTORE_PORT when starting the metastore server.
Databases Supported by Hive
Hive supports 5 backend databases which are as follows:

 Derby
 MySQL
 MS SQL Server
 Oracle
 Postgres
So, this was all in Hive Metastore. Hope you likeour explanation.

Conclusion – Hive Metastore


In conclusion, we can say that Hive Metadata is a central repository for storing all
the Hive metadata information. Metadata includes various types of information like
the structure of tables, relations etc. Above we have also discussed all the three
metastore modes in detail.
What is Hive Metastore?
Metastore is the central repository of Apache Hive metadata. It stores metadata for
Hive tables (like their schema and location) and partitions in a relational database.
It provides client access to this information by using metastore service API.
Hive metastore consists of two fundamental units:

1. A service that provides metastore access to other Apache Hive services.


2. Disk storage for the Hive metadata which is separate from HDFS storage.
Hive Meta store Modes
There are three modes for Hive Metastore deployment:
 Embedded Metastore
 Local Metastore
 Remote Metastore
Let’s now discuss the above three Hive Metastore deployment modes one by one-
i. Embedded Metastore
In Hive by default, metastore service runs in the same JVM as the Hive service. It
uses embedded derby database stored on the local file system in this mode. Thus
both metastore service and hive service runs in the same JVM by using embedded
Derby Database.
But, this mode also has limitation that, as only one embedded Derby database can
access the database files on disk at any one time, so only one Hive session could be
open at a time.

Embedded Deployment mode for Hive Metastore


If we try to start the second session it produces an error when it attempts to open a
connection to the metastore. So, to allow many services to connect the Metastore,
it configures Derby as a network server. This mode is good for unit testing. But it
is not good for the practical solutions.

ii. Local Metastore


Hive is the data-warehousing framework, so hive does not prefer single session. To
overcome this limitation of Embedded Metastore, for Local Metastore was
introduced. This mode allows us to have many Hive sessions i.e. many users can
use the metastore at the same time.
We can achieve by using any JDBC compliant like MySQL which runs in a
separate JVM or different machines than that of the Hive service and metastore
service which are running in the same JVM.

Local Metastore
This configuration is called as local metastore because metastore service still runs
in the same process as the Hive. But it connects to a database running in a separate
process, either on the same machine or on a remote machine.
Before starting Apache Hive client, add the JDBC / ODBC driver libraries to the
Hive lib folder.
MySQL is a popular choice for the standalone metastore. In this case,
the javax.jdo.option.ConnectionURL property is set
to jdbc:mysql://host/dbname? createDatabaseIfNotExist=true,
and javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver. The
JDBC driver JAR file for MySQL (Connector/J) must be on Hive’s class path,
which is achieved by placing it in Hive’s lib directory.
iii. Remote Metastore
Moving further, another metastore configuration called Remote Metastore. In this
mode, metastore runs on its own separate JVM, not in the Hive service JVM. If
other processes want to communicate with the metastore server they can
communicate using Thrift Network APIs.
We can also have one more metastore servers in this case to provide more
availability. This also brings better manageability/security because the database
tier can be completely firewalled off. And the clients no longer need share database
credentials with each Hiver user to access the metastore database.

Remote Metastore
To use this remote metastore, you should configure Hive service by
setting hive.metastore.uris to the metastore server URI(s). Metastore server URIs
are of the form thrift://host:port, where the port corresponds to the one set by
METASTORE_PORT when starting the metastore server.
Databases Supported by Hive
Hive supports 5 backend databases which are as follows:
 Derby
 MySQL
 MS SQL Server
 Oracle
 Postgres
So, this was all in Hive Metastore. Hope you likeour explanation.
Conclusion – Hive Metastore
In conclusion, we can say that Hive Metadata is a central repository for storing all
the Hive metadata information. Metadata includes various types of information like
the structure of tables, relations etc. Above we have also discussed all the three
metastore modes in detail.

HIVE Tables
Create Table Statement
Create Table is a statement used to create a table in Hive. The syntax and example
are as follows:
Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]
table_name

[(col_name data_type [COMMENT col_comment], ...)]


[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]
Example
Let us assume you need to create a table named employee using CREATE
TABLE statement. The following table lists the fields and their data types in
employee table:

Sr.No Field Name Data Type

1 Eid int

2 Name String

3 Salary Float

4 Designation string

The following data is a Comment, Row formatted fields such as Field terminator,
Lines terminator, and Stored File type.
COMMENT ‘Employee details’
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED IN TEXT FILE
The following query creates a table named employee using the above data.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
If you add the option IF NOT EXISTS, Hive ignores the statement in case the table
already exists.
On successful creation of table, you get to see the following response:
OK
Time taken: 5.905 seconds
hive>
Alter Table Statement
It is used to alter a table in Hive.
Syntax
The statement takes any of the following syntaxes based on what attributes we
wish to modify in a table.
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
Rename To… Statement
The following query renames the table from employee to emp.
hive> ALTER TABLE employee RENAME TO emp;
Drop Table Statement
The syntax is as follows:
DROP TABLE [IF EXISTS] table_name;
The following query drops a table named employee:
hive> DROP TABLE IF EXISTS employee;
On successful execution of the query, you get to see the following response:
OK
Time taken: 5.3 seconds
hive>

HIVE FILE FORMATS

Text File Format: Text files are simple plain text files where each line represents
a record. Hive can handle various text file formats, such as CSV (Comma-
Separated Values), TSV (Tab-Separated Values), and custom delimited formats.
-- Create a table using Text file format
CREATE TABLE my_table_text (
id INT,
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Sequence File Format: Sequence files are binary files that contain a sequence of
key-value pairs. They are suitable for storing large amounts of structured or
unstructured data efficiently.
-- Create a table using Sequence file format
CREATE TABLE my_table_sequence (
id INT,
name STRING
)
STORED AS SEQUENCEFILE;
Parquet File Format: Parquet is a columnar storage file format commonly used in
Hive. It offers efficient compression, predicate pushdown, and column-level
pruning, making it highly suitable for analytical workloads.

-- Create a table using Parquet file format


CREATE TABLE my_table_parquet (
id INT,
name STRING
)
STORED AS PARQUET;
Avro File Format: Avro is a binary file format that provides a compact and
efficient way to serialize structured data. It supports schema evolution and is often
used for data serialization in Hive.
-- Create a table using Avro file format
CREATE TABLE my_table_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS AVRO
AS
SELECT id, name FROM my_table_text;

You might also like