Hive
Hive
Data Explosion
- TBs of data generated everyday
Solution – HDFS to store data and Hadoop
Map-Reduce framework to parallelize
processing of Data
What is the catch?
-Hadoop Map Reduce is Java intensive
-Thinking in Map Reduce paradigm can get
tricky
… Enter Hive!
Hive Key Principles
Hive Framework
Data Analyst
rowcount, N
rowcount,1 rowcount,1
Tables
- Analogous to relational tables
-Each table has a corresponding directory
in HDFS
-Data serialized and stored as files within
that directory
- Hive has default serialization built in which
supports compression and lazy deserialization
- Users can specify custom serialization –
deserialization schemes (SerDe’s)
Hive Data Model Contd.
Partitions
-Each table can be broken into partitions
-Partitions determine distribution of data within
subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount
FLOAT)
PARTITIONED BY (country STRING, year INT,
month INT)
So each partition will be split out into different
folders like
Sales/country=US/year=2012/month=12
Hierarchy of Hive Partitions
/hivebase/Sales
/country=US
/country=CANADA
/year=2012 /year=2012
/year=2015
/year=2014
/month=12
/month=11 /month=11
File File File
Hive Data Model Contd.
Buckets
-Data in each partition divided into buckets
-Based on a hash function of the column
-H(column) mod NumBuckets = bucket
number
-Each bucket is stored as a file in partition
directory
Hive is not-
• A relational database
• A design for OnLine Transaction
Processing (OLTP)
• A language for real-time queries and
row-level updates
Features of Hive
• User Interface
– Hive is a data warehouse infrastructure
software that can create interaction
between user and HDFS. The user
interf aces t hat Hive support s are Hive
Web UI, Hive command line, and Hive
HD Insight (In Windows server).
• Meta Store
– Hive chooses respective database servers to
store the schema or Metadata of
tables, databases, columns in a table,
their data types, and HDFS mapping.
Hive Architecture
Execute Job
Execute Plan
Get Plan
5. Send Plan
Execution of Hive
1. Execute Query
The Hive int erface such as Command Line or Web UI
sends query to Driver (any database driver such as
JDBC, ODBC, etc.) to execute.
2. Get Plan
The driver takes the help of query compiler that
parses the query to check the syntax and query plan or
the requirement of query.
3. Get Metadata
The compiler sends metadata request to
Metastore (any database).
4. Send Metadata
Metastore sends metadata as a response to the
compiler.
Execution of Hive
5 Send Plan
The compiler checks the requirement and resends the
plan to the driver. Up to here, the parsing and
compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution
engine.
7 Execute Job
Internally, the process of execution job is a
MapReduce job. The execution engine sends the job
to JobTracker, which is in Name node and it assigns this
job to TaskTracker, which is in Data node. Here, the
query executes MapReduce job.
Execution of Hive
DDL :
CREATE DATABASE
CREATE TABLE
ALTER TABLE
SHOW TABLE
DESCRIBE
DML:
LOAD TABLE
INSERT
QUERY:
SELECT
GROUP BY
JOIN
MULTI TABLE INSERT
Hive SerDe
SELECT Query
Type Comments
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8-byte integers
BOOLEAN TRUE/FALSE
FLOAT, DOUBLE Single and double precision real numbers
STRING Character string
TIMESTAMP Unix-epoch offset or datetime string
DECIMAL Arbitrary-precision decimal
BINARY Opaque; ignore these bytes
Complex Data Types
Type Comments
STRUCT A collection of elements
If S is of type STRUCT {a INT, b INT}:
S.a returns element a
MAP Key-value tuple
If M is a map from 'group' to GID:
M['group'] returns value of GID
ARRAY Indexed list
If A is an array of elements ['a','b','c']:
A[0] returns 'a'
Hive Warehouse
GROUP BY
Group data by column values
Select statement can only include columns included in the
GROUP BY clause
ORDER BY / SORT BY
ORDER BY performs total ordering
Slow, poor performance
SORT BY performs partial ordering
Sorts output from each reducer
Simple Table
CREATE TABLE
LOAD: file moved into Hive’s data warehouse directory
DROP: both metadata and data deleted
CREATE EXTERNAL TABLE
LOAD: no files moved
DROP: only metadata deleted
Use this when sharing with other Hadoop applications, or when you want to use
multiple schemas on the same data
Partitioning
Command Comments
SHOW TABLES; Show all the tables in the database
SHOW TABLES 'page.*'; Show tables matching the
specification ( uses regex syntax )
SHOW PARTITIONS page_view; Show the partitions of the page_view
table
DESCRIBE page_view; List columns of the table
DESCRIBE EXTENDED page_view; More information on columns (useful
only for debugging )
DESCRIBE page_view List information about a partition
PARTITION (ds='2008-10-31');
Loading Data