0% found this document useful (0 votes)

41 views24 pages

Hive - A Warehousing Solution Over A Map-Reduce Framework

This document provides an overview of Hive, a data warehousing solution built on Hadoop. It describes how Hive addresses the challenges of working with large datasets by allowing data analysts to query data using SQL-like language called HiveQL. Hive organizes data into tables, partitions, and buckets stored in HDFS and uses a metastore to store metadata. It translates HiveQL queries into MapReduce jobs which are executed to analyze the data in parallel. The document also discusses some pros and cons of Hive and compares it to the Pig framework.

Uploaded by

Ashwin Ajmera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views24 pages

Hive - A Warehousing Solution Over A Map-Reduce Framework

Uploaded by

Ashwin Ajmera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 24

Hive - A Warehousing Solution

Over a Map-Reduce Framework

Overview
• Why Hive?

• What is Hive?

• Hive Data Model

• Hive Architecture

• HiveQL

• Hive SerDe’s

• Pros and Cons

• Hive v/s Pig

• Graphs
Challenges that Data Analysts
faced

• Data Explosion

- TBs of data generated everyday

Solution – HDFS to store data and Hadoop Map-

Reduce framework to parallelize processing of Data

What is the catch?

- Hadoop Map Reduce is Java intensive

- Thinking in Map Reduce paradigm not trivial

… Enter Hive!
Hive Key Principles
HiveQL to MapReduce
Hive Framework

Data Analyst

SELECT COUNT(1) FROM Sales;

rowcount, N
rowcount,1 rowcount,1

Sales: Hive table

MR JOB Instance
Hive Data Model

Data in Hive organized into :

• Tables

• Partitions

• Buckets
Hive Data Model Contd.

•Tables
- Analogous to relational tables
- Each table has a corresponding directory in HDFS
- Data serialized and stored as files within that directory
- Hive has default serialization built in which supports
compression and lazy deserialization
- Users can specify custom serialization –deserialization
schemes (SerDe’s)
Hive Data Model Contd.

•Partitions
- Each table can be broken into partitions

- Partitions determine distribution of data within subdirectories

Example -

CREATE_TABLE Sales (sale_id INT, amount FLOAT)

PARTITIONED BY (country STRING, year INT, month INT)

So each partition will be split out into different folders like

Sales/country=US/year=2012/month=12
Hierarchy of Hive Partitions

/hivebase/Sales

/country=US
/country=CANADA

/year=2012 /year=2012
/year=2015
/year=2014
/month=12
/month=11 /month=11
File File File
Hive Data Model Contd.

• Buckets
- Data in each partition divided into buckets

- Based on a hash function of the column

- H(column) mod NumBuckets = bucket number

- Each bucket is stored as a file in partition directory

Architecture
Externel Interfaces- CLI, WebUI, JDBC,
ODBC programming interfaces

Thrift Server – Cross Language service

framework .

Metastore - Meta data about the Hive

tables, partitions

Driver - Brain of Hive! Compiler,

Optimizer and Execution engine
Hive Thrift Server

• Framework for cross language services

• Server written in Java
• Support for clients written in different languages
- JDBC(java), ODBC(c++), php, perl, python scripts
Metastore

• System catalog which contains metadata about the Hive tables

• Stored in RDBMS/local fs. HDFS too slow(not optimized for random
access)
• Objects of Metastore
 Database - Namespace of tables
 Table - list of columns, types, owner, storage, SerDes
 Partition – Partition specific column, Serdes and storage
Hive Driver

• Driver - Maintains the lifecycle of HiveQL statement

• Query Compiler – Compiles HiveQL in a DAG of map reduce tasks
• Executor - Executes the tasks plan generated by the compiler in proper
dependency order. Interacts with the underlying Hadoop instance
Compiler
• Converts the HiveQL into a plan for execution

• Plans can

- Metadata operations for DDL statements e.g. CREATE

- HDFS operations e.g. LOAD

• Semantic Analyzer – checks schema information, type checking, implicit

type conversion, column verification

• Optimizer – Finding the best logical plan e.g. Combines multiple joins in a
way to reduce the number of map reduce jobs, Prune columns early to
minimize data transfer

• Physical plan generator – creates the DAG of map-reduce jobs

HiveQL
DDL :
CREATE DATABASE
CREATE TABLE
ALTER TABLE
SHOW TABLE
DESCRIBE

DML:
LOAD TABLE
INSERT
QUERY:
SELECT
GROUP BY
JOIN
MULTI TABLE INSERT
Hive SerDe

• SELECT Query

Hive built in Serde: Record

Avro, ORC, Regex etc Reader

Can use Custom Hive Table

Deserialize
SerDe’s (e.g. for
unstructured data like
audio/video data,
semistructured XML Hive Row Object
data) End User
Object Inspector Map
Fields
Good Things

• Boon for Data Analysts

• Easy Learning curve

• Completely transparent to underlying Map-Reduce

• Partitions(speed!)

• Flexibility to load data from localFS/HDFS into

Hive Tables
Cons and Possible
Improvements
• Extending the SQL queries support(Updates, Deletes)

• Parallelize firing independent jobs from the work DAG

• Table Statistics in Metastore

• Explore methods for multi query optimization

• Perform N- way generic joins in a single map reduce job

• Better debug support in shell

Hive v/s Pig
Similarities:
 Both High level Languages which work on top of map reduce framework
 Can coexist since both use the under lying HDFS and map reduce

Differences:
Language
 Pig is a procedural ; (A = load ‘mydata’; dump A)
 Hive is Declarative (select * from A)

 Work Type
Pig more suited for adhoc analysis (on demand analysis of click stream
search logs)
Hive a reporting tool (e.g. weekly BI reporting)
Hive v/s Pig
Differences:

 Users
 Pig – Researchers, Programmers (build complex data pipelines,
machine learning)
 Hive – Business Analysts
 Integration
 Pig - Doesn’t have a thrift server(i.e no/limited cross language support)
 Hive - Thrift server

 User’s need
 Pig – Better dev environments, debuggers expected
 Hive - Better integration with technologies expected(e.g JDBC, ODBC)
Head-to-Head
(the bee, the pig, the elephant)

Version: Hadoop – 0.18x, Pig:786346, Hive:786346

REFERENCES

• https://hive.apache.org/

• https://cwiki.apache.org/confluence/display/Hive/Presentatio
ns

• https://developer.yahoo.com/blogs/hadoop/comparing-pig-
latin-sql-constructing-data-processing-pipelines-444.html

• http://www.qubole.com/blog/big-data/hive-best-practices/

• Hortonworks tutorials (youtube)

• Graph :
https://issues.apache.org/jira/secure/attachment/12411185/hi
ve_benchmark_2009-06-18.pdf

Java SE 8 Question Bank
100% (1)
Java SE 8 Question Bank
107 pages
147-Reddish HTB Official Writeup Tamarisk
No ratings yet
147-Reddish HTB Official Writeup Tamarisk
18 pages
Matlab HMWK 2 F13
0% (1)
Matlab HMWK 2 F13
5 pages
Hive Slides-2
No ratings yet
Hive Slides-2
25 pages
Actividad 7. Investigación Hive
No ratings yet
Actividad 7. Investigación Hive
25 pages
Hive
No ratings yet
Hive
28 pages
Hive
No ratings yet
Hive
52 pages
Hive
No ratings yet
Hive
49 pages
Hive
No ratings yet
Hive
5 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
HIVE
No ratings yet
HIVE
18 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
Hive Tutorial
No ratings yet
Hive Tutorial
19 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
Hive
No ratings yet
Hive
12 pages
Hive
No ratings yet
Hive
30 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
BDA Unit-5
No ratings yet
BDA Unit-5
25 pages
Hive Full Lecture
No ratings yet
Hive Full Lecture
17 pages
Unit 3
No ratings yet
Unit 3
8 pages
Hive Updated
No ratings yet
Hive Updated
18 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Unit-4 Hive
No ratings yet
Unit-4 Hive
10 pages
HIVE
No ratings yet
HIVE
16 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
7 Hive
No ratings yet
7 Hive
30 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
LectureNotes Hive Final
No ratings yet
LectureNotes Hive Final
36 pages
Apache Hive: General Information About Hive
No ratings yet
Apache Hive: General Information About Hive
3 pages
DSS U4 HIVE Rev1.1
No ratings yet
DSS U4 HIVE Rev1.1
23 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Unit-5 Sgs
No ratings yet
Unit-5 Sgs
10 pages
Bda Exp-6
No ratings yet
Bda Exp-6
10 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
HIVE
No ratings yet
HIVE
3 pages
Unit 5 Handouts
No ratings yet
Unit 5 Handouts
16 pages
Module 4
No ratings yet
Module 4
34 pages
A Warehouse Solution Over Map-Reduce Framework: Dony Ang
No ratings yet
A Warehouse Solution Over Map-Reduce Framework: Dony Ang
26 pages
Introduction To HIVE
No ratings yet
Introduction To HIVE
8 pages
Chapter 7
No ratings yet
Chapter 7
84 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
Architecture and Working of Hive
No ratings yet
Architecture and Working of Hive
7 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
4 pages
Ibiz Hive
No ratings yet
Ibiz Hive
27 pages
Hive
No ratings yet
Hive
23 pages
Hive
No ratings yet
Hive
4 pages
HIVE
No ratings yet
HIVE
33 pages
HIVE
No ratings yet
HIVE
7 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Notes
No ratings yet
Notes
3 pages
1 Introduction Bash Shell Linux Mac Os m1 Overview Slides PDF
No ratings yet
1 Introduction Bash Shell Linux Mac Os m1 Overview Slides PDF
6 pages
Name of The Student Student ID Session 2. Present Address
No ratings yet
Name of The Student Student ID Session 2. Present Address
9 pages
Activity Clock PDF
No ratings yet
Activity Clock PDF
2 pages
Chapter 05 Slides
No ratings yet
Chapter 05 Slides
35 pages
A Living Archive of Modern Protest Memory Making in The Women S March
No ratings yet
A Living Archive of Modern Protest Memory Making in The Women S March
10 pages
Map Reduce
No ratings yet
Map Reduce
1 page
Framing The Women's March On Washington
No ratings yet
Framing The Women's March On Washington
10 pages
Re Producing Feminine Bodies Emergent Spaces Through Contestation in The Women S March On Washington PDF
No ratings yet
Re Producing Feminine Bodies Emergent Spaces Through Contestation in The Women S March On Washington PDF
12 pages
Pig: Building High-Level Dataflows Over Map-Reduce: Utkarsh Srivastava
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce: Utkarsh Srivastava
46 pages
Emergent and Divergent Spaces in The Women S March The Challenges of Intersectionality and Inclusion
No ratings yet
Emergent and Divergent Spaces in The Women S March The Challenges of Intersectionality and Inclusion
9 pages
Black Insurgency (McAdam)
No ratings yet
Black Insurgency (McAdam)
21 pages
Farm Worker Movement (Jenkins, Perrow)
No ratings yet
Farm Worker Movement (Jenkins, Perrow)
21 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
12 Sympathizers (Oegema, Klandermans) PDF
No ratings yet
12 Sympathizers (Oegema, Klandermans) PDF
21 pages
Social Networks (Snow)
No ratings yet
Social Networks (Snow)
16 pages
S MapReduce Types Formats Features 03
No ratings yet
S MapReduce Types Formats Features 03
16 pages
Linux Lab Manual by Zoom PDF
No ratings yet
Linux Lab Manual by Zoom PDF
184 pages
Importance of GD&T in Mechanical Design
No ratings yet
Importance of GD&T in Mechanical Design
8 pages
Project An Grade 7
No ratings yet
Project An Grade 7
3 pages
CPR Operator Manual
No ratings yet
CPR Operator Manual
36 pages
"Milestone Deliverables" A Guide To Managing IT Implementation Projects
0% (1)
"Milestone Deliverables" A Guide To Managing IT Implementation Projects
8 pages
KP3 Plus MIDIimp
No ratings yet
KP3 Plus MIDIimp
13 pages
FYP Report Drone PDF
No ratings yet
FYP Report Drone PDF
39 pages
Theme Hospital Manual
No ratings yet
Theme Hospital Manual
60 pages
LM2990T 15
No ratings yet
LM2990T 15
12 pages
(DMCS01) Assignment-1 M.Sc. Degree Examination, May - 2017 First Year Computer Science Data Structures Maximum Marks: 30 Answer ALL Questions
No ratings yet
(DMCS01) Assignment-1 M.Sc. Degree Examination, May - 2017 First Year Computer Science Data Structures Maximum Marks: 30 Answer ALL Questions
22 pages
AI For Managers - Assignment: Startups
No ratings yet
AI For Managers - Assignment: Startups
5 pages
Basic Concepts of Gis
No ratings yet
Basic Concepts of Gis
9 pages
Optimize Press User Manual PDF
No ratings yet
Optimize Press User Manual PDF
97 pages
RRL RRS
No ratings yet
RRL RRS
5 pages
Roshani Chanrjan CV
No ratings yet
Roshani Chanrjan CV
3 pages
List of Equipment: Procedures Manual On Utpras Unified TVET Program Registration and Accreditation System
No ratings yet
List of Equipment: Procedures Manual On Utpras Unified TVET Program Registration and Accreditation System
3 pages
Nordica10 & Nordica 12: DLS Ultimate Series
No ratings yet
Nordica10 & Nordica 12: DLS Ultimate Series
3 pages
Nyse Phi 2019
No ratings yet
Nyse Phi 2019
53 pages
Linear Programming Applications: Assignment Problem
No ratings yet
Linear Programming Applications: Assignment Problem
27 pages
Mems and Microsystems Design and Manufacture
No ratings yet
Mems and Microsystems Design and Manufacture
57 pages
If4093 Syllabus1
No ratings yet
If4093 Syllabus1
2 pages
BI Tool 1
No ratings yet
BI Tool 1
5 pages
Campus Sync - One Central Hub For Attendance Events and On-Duty Management
No ratings yet
Campus Sync - One Central Hub For Attendance Events and On-Duty Management
6 pages
BSTM 3 1 Mr. and Ms. CHTM Concept Paper
No ratings yet
BSTM 3 1 Mr. and Ms. CHTM Concept Paper
40 pages
Django 11 Model Managers
No ratings yet
Django 11 Model Managers
4 pages
Pharmacology 6th Sem Important Questions B Pharm Shahruddin Khan
No ratings yet
Pharmacology 6th Sem Important Questions B Pharm Shahruddin Khan
55 pages
Harshit Mendiratta 02244820
No ratings yet
Harshit Mendiratta 02244820
62 pages
RELE - ABB - REF630 - 2012 (10kV)
No ratings yet
RELE - ABB - REF630 - 2012 (10kV)
124 pages
Lab 11 Open-Ended Lab
100% (1)
Lab 11 Open-Ended Lab
3 pages

Hive - A Warehousing Solution Over A Map-Reduce Framework

Uploaded by

Hive - A Warehousing Solution Over A Map-Reduce Framework

Uploaded by

Hive - A Warehousing Solution

Over a Map-Reduce Framework

• Hive Data Model

• Pros and Cons

• Hive v/s Pig

- TBs of data generated everyday

Solution – HDFS to store data and Hadoop Map-

What is the catch?

- Hadoop Map Reduce is Java intensive

- Thinking in Map Reduce paradigm not trivial

SELECT COUNT(1) FROM Sales;

Sales: Hive table

Data in Hive organized into :

- Partitions determine distribution of data within subdirectories

CREATE_TABLE Sales (sale_id INT, amount FLOAT)

PARTITIONED BY (country STRING, year INT, month INT)

So each partition will be split out into different folders like

- Based on a hash function of the column

- H(column) mod NumBuckets = bucket number

- Each bucket is stored as a file in partition directory

Thrift Server – Cross Language service

Metastore - Meta data about the Hive

Driver - Brain of Hive! Compiler,

• Framework for cross language services

• System catalog which contains metadata about the Hive tables

• Driver - Maintains the lifecycle of HiveQL statement

- Metadata operations for DDL statements e.g. CREATE

- HDFS operations e.g. LOAD

• Semantic Analyzer – checks schema information, type checking, implicit

• Physical plan generator – creates the DAG of map-reduce jobs

Hive built in Serde: Record

Can use Custom Hive Table

• Boon for Data Analysts

• Easy Learning curve

• Completely transparent to underlying Map-Reduce

• Flexibility to load data from localFS/HDFS into

• Parallelize firing independent jobs from the work DAG

• Table Statistics in Metastore

• Explore methods for multi query optimization

• Perform N- way generic joins in a single map reduce job

• Better debug support in shell

Version: Hadoop – 0.18x, Pig:786346, Hive:786346

• Hortonworks tutorials (youtube)

You might also like