Introduction To Hive

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Introduction to HIVE

Hive is a data warehousing and SQL-like query language software that facilitates
querying and managing large datasets in distributed storage.

It is built on top of Hadoop and provides a high-level interface for working with data
stored in Hadoop Distributed File System (HDFS).

2007: Inception at Facebook • Facebook engineers, including Jeff Hammerbacher


and Joydeep Sen Sarma, played a significant role
in its creation.
• Hive originated at Facebook in 2007 as an open-
source project.
• It was developed to provide a SQL-like interface
for querying and analyzing data stored in Hadoop.
2008: Contribution to Apache • In September 2008, Facebook open-sourced Hive
Hadoop and contributed it to the Apache Software
Foundation.
• This move allowed the broader community to
participate in its development and improvement.
2009: Graduation as a Top- Hive became a top-level Apache project in July 2009,
Level Project indicating its maturity and stability within the Apache
Software Foundation.
2010-2012: Continued • During this period, several releases of Hive
Development and brought enhancements, bug fixes, and
Improvements: performance improvements.
• The community around Hive grew, and it became
widely adopted in the industry.
2013-2015: Stinger Initiative • The Stinger Initiative, announced in 2013, aimed to
and Performance improve the speed, scale, and SQL compatibility of
Improvements: Hive.
• Various releases under this initiative introduced
optimizations, including the introduction of
Apache Tez as an execution engine.
2016-2017: Hive 2.0 and ACID • In 2016, Hive 2.0 was released, introducing ACID
Transactions: (Atomicity, Consistency, Isolation, Durability)
transactions for Hive tables.
• This allowed for more reliable and consistent
operations on Hive tables.
2018-2019: Further Releases during this period focused on improving the
Enhancements and Apache performance of Apache ORC (Optimized Row
ORC Improvements: Columnar) file format, which is commonly used with
Hive for storing and processing data efficiently.
2020-2022: Hive 3.0 and Hive 3.0, released in 2020, introduced significant
Continued Evolution: improvements, including support for ACID operations in
non-transactional tables, enhancements to vectorization,
and improvements to the Hive Metastore.
HIVE

• A data-store, data warehouse infrastructure.


• Provides data summarization.
• Creates Tables, Files, Databases.
• A processing tool on the top of the Hadoop.
Main features of Hive

Characteristics of Hive
1. Capability to translate the queries into MapReduce jobs, making Hive scalable.
2. Handles data warehouse applications, therefore, suitable for the analysis of
static data of extremely large size, where fast response time is not a criterion.
3. Supports web interface as well, which means application API as well as web
browser client that can access the Hive DB serve.
4. Provides an SQL dialect, Hive Query Language, (abbreviated HiveQL or HQL)
5. Results of HiveQL Query and the data load in the tables which store at Hadoop
cluster.

Hive Limitations

• Not a full database


• Does not provide update, alter and deletion of records in the database
• Not developed for unstructured data.
• Not designed for real time queries.
Hive architecture

Hive Server (Thrift)

• An optional service.
• Remote client submits requests to Hive.
• Retrieves results.
• Thrift Server exposes a very simple client API to execute HiveQL statements.

Client Requests to Hive Server (Thrift)

• Requests can be in a variety of programming languages.

Hive CLI (Command Line Interface)

• A popular interface
• Interact with Hive
• Hive runs in local mode that uses local storage when running the CLI on a
Hadoop cluster instead of HDFS.

Hive Web Interface (HWI)

• Hive can be accessed using a web browser as well.


• An HWI Server running on some designated code.

Metastore

• The system catalogs.


• Stores the schema or metadata of tables, databases, columns in a table, their
data types, and HDFS mapping.

Hive Driver

• Manages the lifecycle of a HiveQL statement during compilation, optimization


and execution.
Comparison of HIVE with RDBMS

Characteristics Hive RDBMS


Record level queries No Update and Delete Insert, Update and Delete
Transaction support No Yes
Latency Minutes or more In fractions of a second
Data size Petabytes Terabytes
Data per query Petabytes Gigabytes
Query language HiveQL SQL
Support JDBC/ODBC Limited Full

Hive Data Types and File Formats


Hive defines various primitive, complex, string, date/time, collection data types and
file formats for handling and storing different data formats.

Data TypeName Description


TINYINT 1 byte signed integer. Postfix letter is Y.
SMALLINT 2 byte signed integer. Postfix letter is S
INT 4 byte signed integer
BIGINT 8 byte signed integer. Postfix letter is L
FLOAT 4 byte single-precision floating-point number
DOUBLE 8-byte double-precision floating-point number
BOOLEAN True or False
TIMESTAMP UNIX timestamp with optional nanosecond precision. It supports
0ava .sql.Timestamp format "YYYY-MM-DD HH:MM:SS.fffffffff'
DATE YYYY-MM-DD format
VARCHAR 1 to 65355 bytes. Use single quotes ('') or double quotes ("")
CHAR 255 bytes
DECIMAL Used for representing immutable arbitrary precision. DECIMAL
(precision, scale) format
Hive three Collection data types

File Format Description


Text file • The default file format, and a line represents a record. The
delimiting characters separates the lines.
• Text file examples are CSV, TSV, JSON and XML.
Sequential file Flat file which stores binary key-value pairs, and supports
compression.
RCFile Record Columnar file
ORCFILE ORC stands for Optimized Row Columnar which means it can
store data in an optimized way than in the other file formats.
HIVE Data Model
The Hive data model is designed to handle large-scale data processing and analysis
on top of distributed storage systems, typically Hadoop Distributed File System
(HDFS).
key components of the Hive data model:

Name Description
Database • In Hive, a database is a logical container for tables.
• It helps organize and manage tables.
• Users can switch between databases to isolate their tables and queries.
Table • Tables in Hive are similar to tables in a relational database.
• They consist of rows and columns, and each column has a specified
data type.
• Hive supports both managed tables (where Hive manages the data)
and external tables (where data is stored outside Hive, and Hive
simply provides a schema).
Partition • Partitions are a way to organize data in Hive tables based on specific
columns.
• Partitioning is beneficial for improving query performance, as it
allows for the elimination of irrelevant data during query processing.
• For example, a table of log data might be partitioned by date.
Bucketing • Bucketing is a technique in Hive to distribute data within partitions
further.
• It involves dividing data into a fixed number of buckets based on the
hash of a column.
• Bucketing can be useful for optimizing certain types of queries, such
as join operations.
Hive Integration and Workflow Steps

The workflow steps

Execute Query Hive interface (CLI or Web Interface) sends a query to Database
Driver to execute the query
Get Plan Driver sends the query to query compiler that parses the query
to check the syntax and query plan or the requirement of the
query.
Get Metadata Compiler sends metadata request to Metastore (of any database,
such as MySQL)
Send Metadata Metastore sends metadata as a response to compiler.
Send Plan: Compiler checks the requirement and resends the plan to driver.
The parsing and compiling of the query are complete at this
place.
Execute Plan Driver sends the execute plan to execution engine
Execute Job • Internally, the process of execution job is a MapReduce
job.
• The execution engine sends the job to Job Tracker, which
is in Name node and it assigns this job to Task Tracker,
which is in Data node. Then, the query executes the job
Metadata Meanwhile the execution engine can execute the metadata
Operations operations with Metastore.
Fetch Result Execution engine receives the results from Data nodes.
Send Results Execution engine sends the result to Driver.
Send Results Driver sends the results to Hive Interfaces.
Hive Built-in Functions

Return Syntax Description


Type
BIGINT round(doublea) Returns the rounded BIGINT (8 Byte
integer) value of the 8 Byte double
precision floating point number a.
BIGINT floor(doublea) Returns the maximum BIGINT value that
is equal to or less than the double.
BIGINT ceil(double a) Returns the minimum BIGINT value that
is equal to or greater than the double.
double rand(), rand(intseed) Returns a random number (double) that
distributes uniformly from O to 1 and that
changes in each row. Integer seed ensured
that random number sequence is
deterministic.
string concate(string strl, string str2, ...) Returns the string resulting from
concatenating strl with str2,
string substr(string str, int start) Returns the substring of str starting from a
start position till the end of string str.
string substr(string str, int start,int Returns the substring of str starting from
length) the start position with the given length.
string upper(string str), ucase (string str) Returns the string resulting from
converting all characters of str to upper
case.
string lower(string str), lcase(stringstr) Returns the string resulting from
converting all characters of str to lower
case.
string trim(stringstr) Returns the string resulting from trimming
spaces from both ends. trim ('12A34 56')
returns '12A3456'
string ltrim(string str); rtrim(stringstr) Returns the string resulting from trimming
spaces (only one end, left or right-hand
side or right-hand side spaces trimmed).
ltrim('12A34 56') returns '12A3456' and
rtrim(' 12A34 56 ')returns '12A3456'
string rtrim(stringstr) Returns the string resulting from trimming
spaces from the end (right hand side) of
str.
int year(string date) Returns the year part of a date or a
timestamp string.
int month(strin gdate) Returns the month part of a date or a
timestamp string.
int day(string date) Returns the day part of a date or a
timestamp string.
Example:
A sequence of data types, likely representing the return types of functions or columns
in a database.

In the context of handling projects in domains like banking or insurance using Oracle
SQL or HiveQL, these return data types would be used to define the structure of
database tables, and they would correspond to various attributes or properties of
entities in the system.
BIGINT:
Commonly used for representing large integer values such as account numbers, policy
numbers, or unique identifiers.
DOUBLE:
Used for floating-point numbers, suitable for storing financial values that require
decimal precision, such as amounts or percentages.
STRING:
Typically used for storing textual data like names, addresses, or descriptions.

In the context of banking, it could be used for customer names or branch locations.

In insurance, it might store policyholder names or descriptions of coverage.


INT:
Used for storing integer values, which could represent various attributes like age,
duration, or any other numeric data that doesn't require a large range.
Banking Domain:

• BIGINT: Account numbers, transaction IDs.


• DOUBLE: Account balances, transaction amounts.
• STRING: Customer names, branch names, transaction descriptions.
• INT: Customer ages, transaction types.
Insurance Domain:

• BIGINT: Policy numbers, claim IDs.


• DOUBLE: Coverage amounts, claim amounts.
• STRING: Policyholder names, insurance types, claim descriptions.
• INT: Policy durations, insured item quantities.
CREATE TABLE BankAccounts (

account_number BIGINT,

balance DOUBLE,

customer_name STRING,

age INT

);

CREATE TABLE InsurancePolicies (

policy_number BIGINT,

coverage_amount DOUBLE,

policyholder_name STRING,

duration INT

);

You might also like