0% found this document useful (0 votes)

28 views

Introduction To Hive

Uploaded by

shivaraj BG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Introduction To Hive

Uploaded by

shivaraj BG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Introduction to HIVE

Hive is a data warehousing and SQL-like query language software that facilitates
querying and managing large datasets in distributed storage.

It is built on top of Hadoop and provides a high-level interface for working with data
stored in Hadoop Distributed File System (HDFS).

2007: Inception at Facebook • Facebook engineers, including Jeff Hammerbacher

and Joydeep Sen Sarma, played a significant role
in its creation.
• Hive originated at Facebook in 2007 as an open-
source project.
• It was developed to provide a SQL-like interface
for querying and analyzing data stored in Hadoop.
2008: Contribution to Apache • In September 2008, Facebook open-sourced Hive
Hadoop and contributed it to the Apache Software
Foundation.
• This move allowed the broader community to
participate in its development and improvement.
2009: Graduation as a Top- Hive became a top-level Apache project in July 2009,
Level Project indicating its maturity and stability within the Apache
Software Foundation.
2010-2012: Continued • During this period, several releases of Hive
Development and brought enhancements, bug fixes, and
Improvements: performance improvements.
• The community around Hive grew, and it became
widely adopted in the industry.
2013-2015: Stinger Initiative • The Stinger Initiative, announced in 2013, aimed to
and Performance improve the speed, scale, and SQL compatibility of
Improvements: Hive.
• Various releases under this initiative introduced
optimizations, including the introduction of
Apache Tez as an execution engine.
2016-2017: Hive 2.0 and ACID • In 2016, Hive 2.0 was released, introducing ACID
Transactions: (Atomicity, Consistency, Isolation, Durability)
transactions for Hive tables.
• This allowed for more reliable and consistent
operations on Hive tables.
2018-2019: Further Releases during this period focused on improving the
Enhancements and Apache performance of Apache ORC (Optimized Row
ORC Improvements: Columnar) file format, which is commonly used with
Hive for storing and processing data efficiently.
2020-2022: Hive 3.0 and Hive 3.0, released in 2020, introduced significant
Continued Evolution: improvements, including support for ACID operations in
non-transactional tables, enhancements to vectorization,
and improvements to the Hive Metastore.
HIVE

• A data-store, data warehouse infrastructure.

• Provides data summarization.
• Creates Tables, Files, Databases.
• A processing tool on the top of the Hadoop.
Main features of Hive

Characteristics of Hive
1. Capability to translate the queries into MapReduce jobs, making Hive scalable.
2. Handles data warehouse applications, therefore, suitable for the analysis of
static data of extremely large size, where fast response time is not a criterion.
3. Supports web interface as well, which means application API as well as web
browser client that can access the Hive DB serve.
4. Provides an SQL dialect, Hive Query Language, (abbreviated HiveQL or HQL)
5. Results of HiveQL Query and the data load in the tables which store at Hadoop
cluster.

Hive Limitations

• Not a full database

• Does not provide update, alter and deletion of records in the database
• Not developed for unstructured data.
• Not designed for real time queries.
Hive architecture

Hive Server (Thrift)

• An optional service.
• Remote client submits requests to Hive.
• Retrieves results.
• Thrift Server exposes a very simple client API to execute HiveQL statements.

Client Requests to Hive Server (Thrift)

• Requests can be in a variety of programming languages.

Hive CLI (Command Line Interface)

• A popular interface
• Interact with Hive
• Hive runs in local mode that uses local storage when running the CLI on a
Hadoop cluster instead of HDFS.

Hive Web Interface (HWI)

• Hive can be accessed using a web browser as well.

• An HWI Server running on some designated code.

Metastore

• The system catalogs.

• Stores the schema or metadata of tables, databases, columns in a table, their
data types, and HDFS mapping.

Hive Driver

• Manages the lifecycle of a HiveQL statement during compilation, optimization

and execution.
Comparison of HIVE with RDBMS

Characteristics Hive RDBMS

Record level queries No Update and Delete Insert, Update and Delete
Transaction support No Yes
Latency Minutes or more In fractions of a second
Data size Petabytes Terabytes
Data per query Petabytes Gigabytes
Query language HiveQL SQL
Support JDBC/ODBC Limited Full

Hive Data Types and File Formats

Hive defines various primitive, complex, string, date/time, collection data types and
file formats for handling and storing different data formats.

Data TypeName Description

TINYINT 1 byte signed integer. Postfix letter is Y.
SMALLINT 2 byte signed integer. Postfix letter is S
INT 4 byte signed integer
BIGINT 8 byte signed integer. Postfix letter is L
FLOAT 4 byte single-precision floating-point number
DOUBLE 8-byte double-precision floating-point number
BOOLEAN True or False
TIMESTAMP UNIX timestamp with optional nanosecond precision. It supports
0ava .sql.Timestamp format "YYYY-MM-DD HH:MM:SS.fffffffff'
DATE YYYY-MM-DD format
VARCHAR 1 to 65355 bytes. Use single quotes ('') or double quotes ("")
CHAR 255 bytes
DECIMAL Used for representing immutable arbitrary precision. DECIMAL
(precision, scale) format
Hive three Collection data types

File Format Description

Text file • The default file format, and a line represents a record. The
delimiting characters separates the lines.
• Text file examples are CSV, TSV, JSON and XML.
Sequential file Flat file which stores binary key-value pairs, and supports
compression.
RCFile Record Columnar file
ORCFILE ORC stands for Optimized Row Columnar which means it can
store data in an optimized way than in the other file formats.
HIVE Data Model
The Hive data model is designed to handle large-scale data processing and analysis
on top of distributed storage systems, typically Hadoop Distributed File System
(HDFS).
key components of the Hive data model:

Name Description
Database • In Hive, a database is a logical container for tables.
• It helps organize and manage tables.
• Users can switch between databases to isolate their tables and queries.
Table • Tables in Hive are similar to tables in a relational database.
• They consist of rows and columns, and each column has a specified
data type.
• Hive supports both managed tables (where Hive manages the data)
and external tables (where data is stored outside Hive, and Hive
simply provides a schema).
Partition • Partitions are a way to organize data in Hive tables based on specific
columns.
• Partitioning is beneficial for improving query performance, as it
allows for the elimination of irrelevant data during query processing.
• For example, a table of log data might be partitioned by date.
Bucketing • Bucketing is a technique in Hive to distribute data within partitions
further.
• It involves dividing data into a fixed number of buckets based on the
hash of a column.
• Bucketing can be useful for optimizing certain types of queries, such
as join operations.
Hive Integration and Workflow Steps

The workflow steps

Execute Query Hive interface (CLI or Web Interface) sends a query to Database
Driver to execute the query
Get Plan Driver sends the query to query compiler that parses the query
to check the syntax and query plan or the requirement of the
query.
Get Metadata Compiler sends metadata request to Metastore (of any database,
such as MySQL)
Send Metadata Metastore sends metadata as a response to compiler.
Send Plan: Compiler checks the requirement and resends the plan to driver.
The parsing and compiling of the query are complete at this
place.
Execute Plan Driver sends the execute plan to execution engine
Execute Job • Internally, the process of execution job is a MapReduce
job.
• The execution engine sends the job to Job Tracker, which
is in Name node and it assigns this job to Task Tracker,
which is in Data node. Then, the query executes the job
Metadata Meanwhile the execution engine can execute the metadata
Operations operations with Metastore.
Fetch Result Execution engine receives the results from Data nodes.
Send Results Execution engine sends the result to Driver.
Send Results Driver sends the results to Hive Interfaces.
Hive Built-in Functions

Return Syntax Description

Type
BIGINT round(doublea) Returns the rounded BIGINT (8 Byte
integer) value of the 8 Byte double
precision floating point number a.
BIGINT floor(doublea) Returns the maximum BIGINT value that
is equal to or less than the double.
BIGINT ceil(double a) Returns the minimum BIGINT value that
is equal to or greater than the double.
double rand(), rand(intseed) Returns a random number (double) that
distributes uniformly from O to 1 and that
changes in each row. Integer seed ensured
that random number sequence is
deterministic.
string concate(string strl, string str2, ...) Returns the string resulting from
concatenating strl with str2,
string substr(string str, int start) Returns the substring of str starting from a
start position till the end of string str.
string substr(string str, int start,int Returns the substring of str starting from
length) the start position with the given length.
string upper(string str), ucase (string str) Returns the string resulting from
converting all characters of str to upper
case.
string lower(string str), lcase(stringstr) Returns the string resulting from
converting all characters of str to lower
case.
string trim(stringstr) Returns the string resulting from trimming
spaces from both ends. trim ('12A34 56')
returns '12A3456'
string ltrim(string str); rtrim(stringstr) Returns the string resulting from trimming
spaces (only one end, left or right-hand
side or right-hand side spaces trimmed).
ltrim('12A34 56') returns '12A3456' and
rtrim(' 12A34 56 ')returns '12A3456'
string rtrim(stringstr) Returns the string resulting from trimming
spaces from the end (right hand side) of
str.
int year(string date) Returns the year part of a date or a
timestamp string.
int month(strin gdate) Returns the month part of a date or a
timestamp string.
int day(string date) Returns the day part of a date or a
timestamp string.
Example:
A sequence of data types, likely representing the return types of functions or columns
in a database.

In the context of handling projects in domains like banking or insurance using Oracle
SQL or HiveQL, these return data types would be used to define the structure of
database tables, and they would correspond to various attributes or properties of
entities in the system.
BIGINT:
Commonly used for representing large integer values such as account numbers, policy
numbers, or unique identifiers.
DOUBLE:
Used for floating-point numbers, suitable for storing financial values that require
decimal precision, such as amounts or percentages.
STRING:
Typically used for storing textual data like names, addresses, or descriptions.

In the context of banking, it could be used for customer names or branch locations.

In insurance, it might store policyholder names or descriptions of coverage.

INT:
Used for storing integer values, which could represent various attributes like age,
duration, or any other numeric data that doesn't require a large range.
Banking Domain:

• BIGINT: Account numbers, transaction IDs.

• DOUBLE: Account balances, transaction amounts.
• STRING: Customer names, branch names, transaction descriptions.
• INT: Customer ages, transaction types.
Insurance Domain:

• BIGINT: Policy numbers, claim IDs.

• DOUBLE: Coverage amounts, claim amounts.
• STRING: Policyholder names, insurance types, claim descriptions.
• INT: Policy durations, insured item quantities.
CREATE TABLE BankAccounts (

account_number BIGINT,

balance DOUBLE,

customer_name STRING,

age INT

);

CREATE TABLE InsurancePolicies (

policy_number BIGINT,

coverage_amount DOUBLE,

policyholder_name STRING,

duration INT

);

Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Hadoop HIVE
No ratings yet
Hadoop HIVE
41 pages
Hive
No ratings yet
Hive
30 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
Ibiz Hive
No ratings yet
Ibiz Hive
27 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
7.Hive
No ratings yet
7.Hive
30 pages
Hive
No ratings yet
Hive
23 pages
Unit 5 Handouts
No ratings yet
Unit 5 Handouts
16 pages
Big Data & Analytics (CSE6005) L6 (2)
No ratings yet
Big Data & Analytics (CSE6005) L6 (2)
56 pages
Unit 3
No ratings yet
Unit 3
8 pages
Unit-Vi Hive Hadoop & Big Data
100% (1)
Unit-Vi Hive Hadoop & Big Data
24 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
BDA Unit-5
No ratings yet
BDA Unit-5
25 pages
Hive Full Lecture
No ratings yet
Hive Full Lecture
17 pages
hive
No ratings yet
hive
49 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
Hive Final (1)
No ratings yet
Hive Final (1)
75 pages
Hive
No ratings yet
Hive
5 pages
Hive
No ratings yet
Hive
12 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Course3 Module2 Intro To Hive Slides
No ratings yet
Course3 Module2 Intro To Hive Slides
76 pages
Hive
No ratings yet
Hive
26 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Hiveppt
No ratings yet
Hiveppt
29 pages
01 Introduction To Hive
No ratings yet
01 Introduction To Hive
17 pages
hive
No ratings yet
hive
47 pages
Emailing Hive PDF
No ratings yet
Emailing Hive PDF
25 pages
BDA IA-3 QB-1[1]
No ratings yet
BDA IA-3 QB-1[1]
17 pages
Hive Tutorial
No ratings yet
Hive Tutorial
19 pages
Hive - Self Learning Notes
No ratings yet
Hive - Self Learning Notes
69 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
hive updated
No ratings yet
hive updated
18 pages
Hive
No ratings yet
Hive
63 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Hive_Main
No ratings yet
Hive_Main
33 pages
Unit-4_Hive_
No ratings yet
Unit-4_Hive_
10 pages
Apache Hive
No ratings yet
Apache Hive
17 pages
bda report
No ratings yet
bda report
16 pages
HIVE
No ratings yet
HIVE
80 pages
Unit-3 FBDA
No ratings yet
Unit-3 FBDA
34 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Apache Hive: General Information About Hive
No ratings yet
Apache Hive: General Information About Hive
3 pages
Hive PPT
No ratings yet
Hive PPT
61 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
Final Doc Presentation Hive
No ratings yet
Final Doc Presentation Hive
20 pages
Unit 5 Hive and Pig
No ratings yet
Unit 5 Hive and Pig
16 pages
Big Data
No ratings yet
Big Data
120 pages
Actividad 7. Investigación Hive
No ratings yet
Actividad 7. Investigación Hive
25 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
HIVE
No ratings yet
HIVE
16 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
24 pages
Hive Slides-2
No ratings yet
Hive Slides-2
25 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
4 pages
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Chapter 1 5 Group3
No ratings yet
Chapter 1 5 Group3
52 pages
Black Decker kc4815 Manual Do Utilizador
No ratings yet
Black Decker kc4815 Manual Do Utilizador
20 pages
5 Phraseology. Classifications of Phraseological Units
No ratings yet
5 Phraseology. Classifications of Phraseological Units
31 pages
Sensitization and Awareness On The Existence and Importance of Guidance Counseling and The Effectiveness of The Services in Cameroon State Universities
No ratings yet
Sensitization and Awareness On The Existence and Importance of Guidance Counseling and The Effectiveness of The Services in Cameroon State Universities
19 pages
Leadin G: Group 6 Concon, Alyssa Anne P. Al-Ghazali, Ahmed Qasem Omar Melbert Mabingnay
No ratings yet
Leadin G: Group 6 Concon, Alyssa Anne P. Al-Ghazali, Ahmed Qasem Omar Melbert Mabingnay
31 pages
ARCore Supported Devices Google Developers
No ratings yet
ARCore Supported Devices Google Developers
28 pages
Sai Group 18 Final Report
No ratings yet
Sai Group 18 Final Report
13 pages
os
No ratings yet
os
6 pages
Enbridge Line 61 Pipeline Upgrade Summary
No ratings yet
Enbridge Line 61 Pipeline Upgrade Summary
2 pages
State of The Great Lakes 2022 Report
No ratings yet
State of The Great Lakes 2022 Report
40 pages
Ranson Dantis Project Black Book Tybms
No ratings yet
Ranson Dantis Project Black Book Tybms
87 pages
Bestiarum Phantasia V2
No ratings yet
Bestiarum Phantasia V2
31 pages
RLC Measurement3
No ratings yet
RLC Measurement3
15 pages
WIP Limit
No ratings yet
WIP Limit
15 pages
36Ecosystem BioHack
No ratings yet
36Ecosystem BioHack
9 pages
Plumbing System in High Rise Building
No ratings yet
Plumbing System in High Rise Building
7 pages
63996.indd - KVSApplDesign - Catalog
No ratings yet
63996.indd - KVSApplDesign - Catalog
56 pages
The Nameless One
No ratings yet
The Nameless One
3 pages
1.1.8. MLCP Technical Specification - SAG - New - E-House - 3BHS353481 - Rev
No ratings yet
1.1.8. MLCP Technical Specification - SAG - New - E-House - 3BHS353481 - Rev
11 pages
AST113 Lab1 PDF
No ratings yet
AST113 Lab1 PDF
4 pages
Ldiag v04.34.001 Bootable Uefi Ug
No ratings yet
Ldiag v04.34.001 Bootable Uefi Ug
201 pages
Renal Failure - The Facts
No ratings yet
Renal Failure - The Facts
282 pages
Tutorial Extra BENT3753
No ratings yet
Tutorial Extra BENT3753
3 pages
TM 2001 Drawworks en Rev00
No ratings yet
TM 2001 Drawworks en Rev00
91 pages
Chapter 6
No ratings yet
Chapter 6
63 pages
Formato Brose 8-D-Problem Solving Schemexlsx
No ratings yet
Formato Brose 8-D-Problem Solving Schemexlsx
17 pages
Midterm Lab Quiz 2 - Attempt Review
No ratings yet
Midterm Lab Quiz 2 - Attempt Review
6 pages
Literature Review
100% (1)
Literature Review
35 pages
CH 2 - Resources PDF
No ratings yet
CH 2 - Resources PDF
40 pages
The Tropical Rainforest Notes
No ratings yet
The Tropical Rainforest Notes
8 pages

Introduction To Hive

Uploaded by

Introduction To Hive

Uploaded by

Introduction to HIVE

2007: Inception at Facebook • Facebook engineers, including Jeff Hammerbacher

• A data-store, data warehouse infrastructure.

• Not a full database

Hive Server (Thrift)

Client Requests to Hive Server (Thrift)

• Requests can be in a variety of programming languages.

Hive CLI (Command Line Interface)

Hive Web Interface (HWI)

• Hive can be accessed using a web browser as well.

• The system catalogs.

• Manages the lifecycle of a HiveQL statement during compilation, optimization

Characteristics Hive RDBMS

Hive Data Types and File Formats

Data TypeName Description

File Format Description

The workflow steps

Return Syntax Description

In insurance, it might store policyholder names or descriptions of coverage.

• BIGINT: Account numbers, transaction IDs.

• BIGINT: Policy numbers, claim IDs.

CREATE TABLE InsurancePolicies (

You might also like