0% found this document useful (0 votes)

24 views49 pages

Hive

Uploaded by

shritis2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views49 pages

Hive

Uploaded by

shritis2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

Introduction to Hive

Big data and Hadoop

• The term ‘Big Data’ is used for

collections of large datasets that
include huge volume, high velocity, and
a variety of data that is increasing day
by day.
• Using traditional data management
systems, it is diffi cult to process Big
Data. Therefore, the Apache Software
Foundation introduced a framework
called Hadoop to solve Big Data
management and processing challenges.
Hadoop

• Hadoop is an open-source framework to store

and process Big Data in a distributed
environment. It contains two modules, one is
MapReduce and another is Hadoop Distributed
File System (HDFS).
– MapReduce: It is a parallel programming
model for processing large amounts of
structured, semistructured, and
unstructured data on large clusters of
commodity hardware.
– HDFS: Hadoop Dist ribut ed File Syst em is a
part of Hadoop framework, used to store and
process the datasets. It provides a fault-
tolerant file system to run on commodity
Hadoop Tools

• The Hadoop ecosystem contains diff erent

sub- projects (tools) such as Sqoop, Pig,
and Hive that are used to help Hadoop
modules.
– Sqoop: It is used to import and
export data to and f rom HDFS and
RDBMS.
– Pig: It is a procedural language
platform used to develop a script for
MapReduce operations.
– Hive: It is a platform used to
develop SQL type scripts to do
Ways t o execut e MapReduce

• The traditional approach using Java

MapReduce program for structured,
semi-structured, and unstructured data.
• The scripting approach for MapReduce
to process structured and semi
structured data using Pig.
• The Hive Query Language (HiveQL or
HQL) for MapReduce to process
structured data using Hive.
Challenges that Data Analysts faced

Data Explosion
- TBs of data generated everyday
Solution – HDFS to store data and Hadoop
Map-Reduce framework to parallelize
processing of Data
What is the catch?
-Hadoop Map Reduce is Java intensive
-Thinking in Map Reduce paradigm can get
tricky
… Enter Hive!
Hive Key Principles

Data Warehouse :DWs are

central repositories of
integrated data from one
or more disparate sources
- A system used for
reporting and data
analysis.
ETL – process of extracting
data from source and
bringing it into data
warehouse. Extract
Transform and Load.
What is hive?

• Hive is a dat a warehouse inf rastruct ure

t ool t o process structured data in
Hadoop.
• It resides on top of Hadoop to
summarize Big Data, and makes
querying and analyzing easy.
• Initially Hive was developed by Facebook,
later the Apache Software Foundation
took it up and developed it f urt her as an
open source under t he name Apache Hive.
• It is used by diff erent companies. For
example, Amazon uses it in Amazon
HiveQL to MapReduce

Hive Framework

Data Analyst

SELECT COUNT(1) FROM Sales;

rowcount, N
rowcount,1 rowcount,1

Sales: Hive table

MR JOB Instance
Hive Data Model

Data in Hive organized into :

Tables
Partitions
Buckets
Hive Data Model Contd.

Tables
- Analogous to relational tables
-Each table has a corresponding directory
in HDFS
-Data serialized and stored as files within
that directory
- Hive has default serialization built in which
supports compression and lazy deserialization
- Users can specify custom serialization –
deserialization schemes (SerDe’s)
Hive Data Model Contd.

Partitions
-Each table can be broken into partitions
-Partitions determine distribution of data within
subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount
FLOAT)
PARTITIONED BY (country STRING, year INT,
month INT)
So each partition will be split out into different
folders like
Sales/country=US/year=2012/month=12
Hierarchy of Hive Partitions

/hivebase/Sales

/country=US
/country=CANADA

/year=2012 /year=2012
/year=2015
/year=2014
/month=12
/month=11 /month=11
File File File
Hive Data Model Contd.

Buckets
-Data in each partition divided into buckets
-Based on a hash function of the column
-H(column) mod NumBuckets = bucket
number
-Each bucket is stored as a file in partition
directory
Hive is not-

• A relational database
• A design for OnLine Transaction
Processing (OLTP)
• A language for real-time queries and
row-level updates
Features of Hive

• It stores schema in a database and

processed data into HDFS.
• It is designed for OLAP.
• It provides SQL t ype language for
querying called HiveQL or HQL.
• It is familiar, fast, scalable, and
extensible.
Hive Architecture
Hive Architecture

• User Interface
– Hive is a data warehouse infrastructure
software that can create interaction
between user and HDFS. The user
interf aces t hat Hive support s are Hive
Web UI, Hive command line, and Hive
HD Insight (In Windows server).
• Meta Store
– Hive chooses respective database servers to
store the schema or Metadata of
tables, databases, columns in a table,
their data types, and HDFS mapping.
Hive Architecture

• HiveQL Process Engine

– HiveQL is similar to SQL for querying on schema info on
the Metastore. It is one of the replacements of
traditional approach for MapReduce program. Instead
of writing MapReduce program in Java, we can write a
query for MapReduce job and process it.
• Execution Engine
– The conjunction part of HiveQL process Engine and
MapReduce is Hive Execution Engine. Execution engine
executes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
• HDFS or HBASE
– Hadoop distributed file system or HBASE are the data
storage techniques to store data into file system.
Working of Hive

Execute Job

Execute Plan

7.1 execute metadata

operations

Get Plan

5. Send Plan
Execution of Hive

1. Execute Query
The Hive int erface such as Command Line or Web UI
sends query to Driver (any database driver such as
JDBC, ODBC, etc.) to execute.
2. Get Plan
The driver takes the help of query compiler that
parses the query to check the syntax and query plan or
the requirement of query.
3. Get Metadata
The compiler sends metadata request to
Metastore (any database).
4. Send Metadata
Metastore sends metadata as a response to the
compiler.
Execution of Hive

5 Send Plan
The compiler checks the requirement and resends the
plan to the driver. Up to here, the parsing and
compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution
engine.
7 Execute Job
Internally, the process of execution job is a
MapReduce job. The execution engine sends the job
to JobTracker, which is in Name node and it assigns this
job to TaskTracker, which is in Data node. Here, the
query executes MapReduce job.
Execution of Hive

7.1 Metadata Ops

Meanwhile in execution, the execution
engine can execute metadata operations
with Metastore.
8 Fetch Result
The execution engine receives the results
from Data nodes.
9 Send Results
The execution engine sends those resultant
values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
HiveQL

DDL :
CREATE DATABASE
CREATE TABLE
ALTER TABLE
SHOW TABLE
DESCRIBE

DML:
LOAD TABLE
INSERT
QUERY:
SELECT
GROUP BY
JOIN
MULTI TABLE INSERT
Hive SerDe

SELECT Query

 Hive built in Serde: Record

Avro, ORC, Regex etc Reader

 Can use Custom Hive Table

Deserialize
SerDe’s (e.g. for
unstructured data
like audio/video
data, semistructured Hive Row Object
XML data) End User
Object Inspector Map
Fields
Data Hierarchy

Hive is organised hierarchically into:

Databases: namespaces that separate tables and other objects
Tables: homogeneous units of data with the same schema
Analogous to tables in an RDBMS
Partitions: determine how the data is stored
Allow efficient access to subsets of the data
Buckets/clusters
For subsampling within a partition
Join optimization
HiveQL

HiveQL / HQL provides the basic SQL-like

operations:
Select columns using SELECT
Filter rows using WHERE
JOIN between tables
Evaluate aggregates using GROUP BY
Store query results into another table
Download results to a local directory (i.e., export from HDFS)
Manage tables and queries with CREATE, DROP, and ALTER
Primitive Data Types

Type Comments
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8-byte integers
BOOLEAN TRUE/FALSE
FLOAT, DOUBLE Single and double precision real numbers
STRING Character string
TIMESTAMP Unix-epoch offset or datetime string
DECIMAL Arbitrary-precision decimal
BINARY Opaque; ignore these bytes
Complex Data Types

Type Comments
STRUCT A collection of elements
If S is of type STRUCT {a INT, b INT}:
S.a returns element a
MAP Key-value tuple
If M is a map from 'group' to GID:
M['group'] returns value of GID
ARRAY Indexed list
If A is an array of elements ['a','b','c']:
A[0] returns 'a'
Hive Warehouse

Hive tables are stored in the Hive

“warehouse”
Default HDFS location: /user/hive/warehouse
Tables are stored as sub-directories in the
warehouse directory
Partitions are subdirectories of tables
External tables are supported in Hive
The actual data is stored in flat files
Table types

Hive deals with two types of table structures

like Internal and External tables depending
on the loading and design of schema in Hive.
Create Table Syntax

CREATE TABLE table_name

(col1 data_type,
col2 data_type,
col3 data_type,
col4 datatype )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS format_type;
Loading And Inserting Data: Summary

Use this For this purpose

LOAD Load data from a file or directory
INSERT Load data from a query
• One partition at a time
• Use multiple INSERTs to insert into
multiple partitions in the one query
CREATE TABLE AS (CTAS) Insert data while creating a table
Add/modify external file Load new data into external table
Sample Select Clauses

Select from a single table

SELECT *
FROM sales
WHERE amount > 10 AND
region = "US";
Select from a partitioned table
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01'
AND
page_views.date <= '2008-03-31'
Relational Operators

ALL and DISTINCT

Specify whether duplicate rows should be returned
ALL is the default (all matching rows are returned)
DISTINCT removes duplicate rows from the result set
WHERE
Filters by expression
Does not support IN, EXISTS or sub-queries in the WHERE clause
LIMIT
Indicates the number of rows to be returned
Relational Operators

GROUP BY
Group data by column values
Select statement can only include columns included in the
GROUP BY clause
ORDER BY / SORT BY
ORDER BY performs total ordering
Slow, poor performance
SORT BY performs partial ordering
Sorts output from each reducer
Simple Table

CREATE TABLE page_view

(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User' )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
External Table

CREATE EXTERNAL TABLE page_view_stg

(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/staging/page_view';
More About Tables

CREATE TABLE
LOAD: file moved into Hive’s data warehouse directory
DROP: both metadata and data deleted
CREATE EXTERNAL TABLE
LOAD: no files moved
DROP: only metadata deleted
Use this when sharing with other Hadoop applications, or when you want to use
multiple schemas on the same data
Partitioning

Can make some queries faster

Divide data based on partition column
Use PARTITION BY clause when creating table
Use PARTITION clause when loading data
SHOW PARTITIONS will show a table’s
partitions
Bucketing

Can speed up queries that involve sampling

the data
Sampling works without bucketing, but Hive has to scan the entire dataset
Use CLUSTERED BY when creating table
For sorted buckets, add SORTED BY
To query a sample of your data, use
TABLESAMPLE
Browsing Tables And Partitions

Command Comments
SHOW TABLES; Show all the tables in the database
SHOW TABLES 'page.*'; Show tables matching the
specification ( uses regex syntax )
SHOW PARTITIONS page_view; Show the partitions of the page_view
table
DESCRIBE page_view; List columns of the table
DESCRIBE EXTENDED page_view; More information on columns (useful
only for debugging )
DESCRIBE page_view List information about a partition
PARTITION (ds='2008-10-31');
Loading Data

Use LOAD DATA to load data from a file or

directory
Will read from HDFS unless LOCAL keyword is specified
Will append data unless OVERWRITE specified
PARTITION required if destination table is partitioned

LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-

8_us.txt'
OVERWRITE INTO TABLE page_view
PARTITION (date='2008-06-08', country='US')
Inserting Data During Table Creation

Use AS SELECT in the CREATE TABLE

statement to populate a table as it is created

CREATE TABLE page_view AS

SELECT pvs.viewTime, pvs.userid, pvs.page_url,
pvs.referrer_url
FROM page_view_stg pvs
WHERE pvs.country = 'US';
Loading And Inserting Data: Summary

Use this For this purpose

SELECT [ALL | DISTINCT] select_expr,

select_expr, ... FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list |
[DISTRIBUTE BY col_list]
[SORT BY col_list]]
[LIMIT number];
hive> SELECT * FROM employee WHERE
salary>30000;
Thank you
This presentation is created using LibreOffice Impress 4.2.8.2, can be used freely as per GNU General Public
License

Web Resources Blogs http://digitallocha.bl

http://mitu.co.in h ogspot.in
ttp://tusharkute.co http://kyamputar.blogspo
m t.in

[email protected]

As in The Counselling Room
100% (1)
As in The Counselling Room
4 pages
Dmba203 - Marketing Management
No ratings yet
Dmba203 - Marketing Management
6 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
01 Introduction To Hive
No ratings yet
01 Introduction To Hive
17 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Unit-Vi Hive Hadoop & Big Data
100% (1)
Unit-Vi Hive Hadoop & Big Data
24 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Hive
No ratings yet
Hive
23 pages
01 Introduction To Hive (1) 2 15
No ratings yet
01 Introduction To Hive (1) 2 15
14 pages
Unit 3
No ratings yet
Unit 3
8 pages
Hive
No ratings yet
Hive
30 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
HIVE
No ratings yet
HIVE
18 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Hive Final
No ratings yet
Hive Final
75 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Hive Tutorial
No ratings yet
Hive Tutorial
19 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Hive
No ratings yet
Hive
52 pages
Unit 5-Hive
No ratings yet
Unit 5-Hive
18 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
Course3 Module2 Intro To Hive Slides
No ratings yet
Course3 Module2 Intro To Hive Slides
76 pages
Unit-IV - BDA
No ratings yet
Unit-IV - BDA
42 pages
7 Hive
No ratings yet
7 Hive
30 pages
Ibiz Hive
No ratings yet
Ibiz Hive
27 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
Assignment 4-Gcc: Hive Is Not
No ratings yet
Assignment 4-Gcc: Hive Is Not
3 pages
Hive
No ratings yet
Hive
28 pages
1 - Introduction
No ratings yet
1 - Introduction
5 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
HIVE
No ratings yet
HIVE
7 pages
Unit 3 BDA
No ratings yet
Unit 3 BDA
44 pages
01 Introduction To Hive
No ratings yet
01 Introduction To Hive
14 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Web Based Data Management of Apache Hive
No ratings yet
Web Based Data Management of Apache Hive
22 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
Hive Data Types and Data Models
No ratings yet
Hive Data Types and Data Models
24 pages
04 Bigdata Hive
No ratings yet
04 Bigdata Hive
22 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
Hive
No ratings yet
Hive
5 pages
BDA Unit-5
No ratings yet
BDA Unit-5
25 pages
HIVE Lect
No ratings yet
HIVE Lect
91 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
Bigdata Lecture 5
No ratings yet
Bigdata Lecture 5
19 pages
Hive
No ratings yet
Hive
12 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
24 pages
Hive Slides-2
No ratings yet
Hive Slides-2
25 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
HIVE
No ratings yet
HIVE
16 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
OVERVIEW Cost Quality
No ratings yet
OVERVIEW Cost Quality
2 pages
Important Effective Teaching Methods and Techniques
No ratings yet
Important Effective Teaching Methods and Techniques
26 pages
Before - Reading Questions (Text 1)
100% (1)
Before - Reading Questions (Text 1)
8 pages
HIV Prevention in Ethiopia National Road Map 2018 - 2020 FINAL - FINAL
No ratings yet
HIV Prevention in Ethiopia National Road Map 2018 - 2020 FINAL - FINAL
52 pages
Course Handout
No ratings yet
Course Handout
5 pages
Lessons From Gattinoni
No ratings yet
Lessons From Gattinoni
28 pages
Customers To Be Linkedfinal
No ratings yet
Customers To Be Linkedfinal
8 pages
Poppy Nature Study by Books and Willows
No ratings yet
Poppy Nature Study by Books and Willows
16 pages
2021 - IEEE Style Manual
0% (1)
2021 - IEEE Style Manual
70 pages
HCIA-HarmonyOS Device Developer V1.0 学员用书
No ratings yet
HCIA-HarmonyOS Device Developer V1.0 学员用书
166 pages
Unit 2 Unit 2
No ratings yet
Unit 2 Unit 2
12 pages
Symbiosis Institute of Business Management SIBM Bengaluru
No ratings yet
Symbiosis Institute of Business Management SIBM Bengaluru
7 pages
Leading For The Future
No ratings yet
Leading For The Future
4 pages
Saint Mary'S University: School of Accountancy and Business
No ratings yet
Saint Mary'S University: School of Accountancy and Business
2 pages
Retail Supply Chain Management
No ratings yet
Retail Supply Chain Management
12 pages
12 - Memory Management, Garbage Collection, Immutability, and Design by Contrac
No ratings yet
12 - Memory Management, Garbage Collection, Immutability, and Design by Contrac
3 pages
Astm F 1145
100% (2)
Astm F 1145
12 pages
Ferry Corsten
No ratings yet
Ferry Corsten
12 pages
NM FCL 301 Exam Essay
No ratings yet
NM FCL 301 Exam Essay
14 pages
Additional Topics in Variance Analysis: True / False Questions
No ratings yet
Additional Topics in Variance Analysis: True / False Questions
232 pages
MCS-012 Block 3
No ratings yet
MCS-012 Block 3
94 pages
RLT A Question of Trust
No ratings yet
RLT A Question of Trust
3 pages
PA 6.0 Amplifier Datasheet
No ratings yet
PA 6.0 Amplifier Datasheet
6 pages
Aroon Kumar: "Award Winning Global Marketer and Digital Business Leader"
No ratings yet
Aroon Kumar: "Award Winning Global Marketer and Digital Business Leader"
6 pages
Noun. (1) The French Indirect Object Pronouns Are
No ratings yet
Noun. (1) The French Indirect Object Pronouns Are
4 pages
Lesson Plan
No ratings yet
Lesson Plan
5 pages
Bioplastic 2
No ratings yet
Bioplastic 2
13 pages
Plato's Apology Essay
No ratings yet
Plato's Apology Essay
2 pages

Hive

Uploaded by

Hive

Uploaded by

Introduction to Hive

Big data and Hadoop

• The term ‘Big Data’ is used for

• Hadoop is an open-source framework to store

• The Hadoop ecosystem contains diff erent

• The traditional approach using Java

Data Warehouse :DWs are

• Hive is a dat a warehouse inf rastruct ure

SELECT COUNT(1) FROM Sales;

Sales: Hive table

Data in Hive organized into :

• It stores schema in a database and

• HiveQL Process Engine

7.1 execute metadata

7.1 Metadata Ops

 Hive built in Serde: Record

 Can use Custom Hive Table

Hive is organised hierarchically into:

HiveQL / HQL provides the basic SQL-like

Hive tables are stored in the Hive

Hive deals with two types of table structures

CREATE TABLE table_name

Use this For this purpose

Select from a single table

ALL and DISTINCT

CREATE TABLE page_view

CREATE EXTERNAL TABLE page_view_stg

Can make some queries faster

Can speed up queries that involve sampling

Use LOAD DATA to load data from a file or

LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-

Use AS SELECT in the CREATE TABLE

CREATE TABLE page_view AS

Use this For this purpose

SELECT [ALL | DISTINCT] select_expr,

Web Resources Blogs http://digitallocha.bl

You might also like