0% found this document useful (0 votes)
24 views49 pages

Hive

Uploaded by

shritis2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views49 pages

Hive

Uploaded by

shritis2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Introduction to Hive

Big data and Hadoop

• The term ‘Big Data’ is used for


collections of large datasets that
include huge volume, high velocity, and
a variety of data that is increasing day
by day.
• Using traditional data management
systems, it is diffi cult to process Big
Data. Therefore, the Apache Software
Foundation introduced a framework
called Hadoop to solve Big Data
management and processing challenges.
Hadoop

• Hadoop is an open-source framework to store


and process Big Data in a distributed
environment. It contains two modules, one is
MapReduce and another is Hadoop Distributed
File System (HDFS).
– MapReduce: It is a parallel programming
model for processing large amounts of
structured, semi- structured, and
unstructured data on large clusters of
commodity hardware.
– HDFS: Hadoop Dist ribut ed File Syst em is a
part of Hadoop framework, used to store and
process the datasets. It provides a fault-
tolerant file system to run on commodity
Hadoop Tools

• The Hadoop ecosystem contains diff erent


sub- projects (tools) such as Sqoop, Pig,
and Hive that are used to help Hadoop
modules.
– Sqoop: It is used to import and
export data to and f rom HDFS and
RDBMS.
– Pig: It is a procedural language
platform used to develop a script for
MapReduce operations.
– Hive: It is a platform used to
develop SQL type scripts to do
Ways t o execut e MapReduce

• The traditional approach using Java


MapReduce program for structured,
semi-structured, and unstructured data.
• The scripting approach for MapReduce
to process structured and semi
structured data using Pig.
• The Hive Query Language (HiveQL or
HQL) for MapReduce to process
structured data using Hive.
Challenges that Data Analysts faced

Data Explosion
- TBs of data generated everyday
Solution – HDFS to store data and Hadoop
Map-Reduce framework to parallelize
processing of Data
What is the catch?
-Hadoop Map Reduce is Java intensive
-Thinking in Map Reduce paradigm can get
tricky
… Enter Hive!
Hive Key Principles

Data Warehouse :DWs are


central repositories of
integrated data from one
or more disparate sources
- A system used for
reporting and data
analysis.
ETL – process of extracting
data from source and
bringing it into data
warehouse. Extract
Transform and Load.
What is hive?

• Hive is a dat a warehouse inf rastruct ure


t ool t o process structured data in
Hadoop.
• It resides on top of Hadoop to
summarize Big Data, and makes
querying and analyzing easy.
• Initially Hive was developed by Facebook,
later the Apache Software Foundation
took it up and developed it f urt her as an
open source under t he name Apache Hive.
• It is used by diff erent companies. For
example, Amazon uses it in Amazon
HiveQL to MapReduce

Hive Framework

Data Analyst

SELECT COUNT(1) FROM Sales;

rowcount, N
rowcount,1 rowcount,1

Sales: Hive table


MR JOB Instance
Hive Data Model

Data in Hive organized into :


Tables
Partitions
Buckets
Hive Data Model Contd.

Tables
- Analogous to relational tables
-Each table has a corresponding directory
in HDFS
-Data serialized and stored as files within
that directory
- Hive has default serialization built in which
supports compression and lazy deserialization
- Users can specify custom serialization –
deserialization schemes (SerDe’s)
Hive Data Model Contd.

Partitions
-Each table can be broken into partitions
-Partitions determine distribution of data within
subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount
FLOAT)
PARTITIONED BY (country STRING, year INT,
month INT)
So each partition will be split out into different
folders like
Sales/country=US/year=2012/month=12
Hierarchy of Hive Partitions

/hivebase/Sales

/country=US
/country=CANADA

/year=2012 /year=2012
/year=2015
/year=2014
/month=12
/month=11 /month=11
File File File
Hive Data Model Contd.

Buckets
-Data in each partition divided into buckets
-Based on a hash function of the column
-H(column) mod NumBuckets = bucket
number
-Each bucket is stored as a file in partition
directory
Hive is not-

• A relational database
• A design for OnLine Transaction
Processing (OLTP)
• A language for real-time queries and
row-level updates
Features of Hive

• It stores schema in a database and


processed data into HDFS.
• It is designed for OLAP.
• It provides SQL t ype language for
querying called HiveQL or HQL.
• It is familiar, fast, scalable, and
extensible.
Hive Architecture
Hive Architecture

• User Interface
– Hive is a data warehouse infrastructure
software that can create interaction
between user and HDFS. The user
interf aces t hat Hive support s are Hive
Web UI, Hive command line, and Hive
HD Insight (In Windows server).
• Meta Store
– Hive chooses respective database servers to
store the schema or Metadata of
tables, databases, columns in a table,
their data types, and HDFS mapping.
Hive Architecture

• HiveQL Process Engine


– HiveQL is similar to SQL for querying on schema info on
the Metastore. It is one of the replacements of
traditional approach for MapReduce program. Instead
of writing MapReduce program in Java, we can write a
query for MapReduce job and process it.
• Execution Engine
– The conjunction part of HiveQL process Engine and
MapReduce is Hive Execution Engine. Execution engine
executes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
• HDFS or HBASE
– Hadoop distributed file system or HBASE are the data
storage techniques to store data into file system.
Working of Hive

Execute Job

Execute Plan

7.1 execute metadata


operations

Get Plan

5. Send Plan
Execution of Hive

1. Execute Query
The Hive int erface such as Command Line or Web UI
sends query to Driver (any database driver such as
JDBC, ODBC, etc.) to execute.
2. Get Plan
The driver takes the help of query compiler that
parses the query to check the syntax and query plan or
the requirement of query.
3. Get Metadata
The compiler sends metadata request to
Metastore (any database).
4. Send Metadata
Metastore sends metadata as a response to the
compiler.
Execution of Hive

5 Send Plan
The compiler checks the requirement and resends the
plan to the driver. Up to here, the parsing and
compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution
engine.
7 Execute Job
Internally, the process of execution job is a
MapReduce job. The execution engine sends the job
to JobTracker, which is in Name node and it assigns this
job to TaskTracker, which is in Data node. Here, the
query executes MapReduce job.
Execution of Hive

7.1 Metadata Ops


Meanwhile in execution, the execution
engine can execute metadata operations
with Metastore.
8 Fetch Result
The execution engine receives the results
from Data nodes.
9 Send Results
The execution engine sends those resultant
values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
HiveQL

DDL :
CREATE DATABASE
CREATE TABLE
ALTER TABLE
SHOW TABLE
DESCRIBE

DML:
LOAD TABLE
INSERT
QUERY:
SELECT
GROUP BY
JOIN
MULTI TABLE INSERT
Hive SerDe

SELECT Query

 Hive built in Serde: Record


Avro, ORC, Regex etc Reader

 Can use Custom Hive Table


Deserialize
SerDe’s (e.g. for
unstructured data
like audio/video
data, semistructured Hive Row Object
XML data) End User
Object Inspector Map
Fields
Data Hierarchy

Hive is organised hierarchically into:


Databases: namespaces that separate tables and other objects
Tables: homogeneous units of data with the same schema
Analogous to tables in an RDBMS
Partitions: determine how the data is stored
Allow efficient access to subsets of the data
Buckets/clusters
For subsampling within a partition
Join optimization
HiveQL

HiveQL / HQL provides the basic SQL-like


operations:
Select columns using SELECT
Filter rows using WHERE
JOIN between tables
Evaluate aggregates using GROUP BY
Store query results into another table
Download results to a local directory (i.e., export from HDFS)
Manage tables and queries with CREATE, DROP, and ALTER
Primitive Data Types

Type Comments
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8-byte integers
BOOLEAN TRUE/FALSE
FLOAT, DOUBLE Single and double precision real numbers
STRING Character string
TIMESTAMP Unix-epoch offset or datetime string
DECIMAL Arbitrary-precision decimal
BINARY Opaque; ignore these bytes
Complex Data Types

Type Comments
STRUCT A collection of elements
If S is of type STRUCT {a INT, b INT}:
S.a returns element a
MAP Key-value tuple
If M is a map from 'group' to GID:
M['group'] returns value of GID
ARRAY Indexed list
If A is an array of elements ['a','b','c']:
A[0] returns 'a'
Hive Warehouse

Hive tables are stored in the Hive


“warehouse”
Default HDFS location: /user/hive/warehouse
Tables are stored as sub-directories in the
warehouse directory
Partitions are subdirectories of tables
External tables are supported in Hive
The actual data is stored in flat files
Table types

Hive deals with two types of table structures


like Internal and External tables depending
on the loading and design of schema in Hive.
Create Table Syntax

CREATE TABLE table_name


(col1 data_type,
col2 data_type,
col3 data_type,
col4 datatype )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS format_type;
Loading And Inserting Data: Summary

Use this For this purpose


LOAD Load data from a file or directory
INSERT Load data from a query
• One partition at a time
• Use multiple INSERTs to insert into
multiple partitions in the one query
CREATE TABLE AS (CTAS) Insert data while creating a table
Add/modify external file Load new data into external table
Sample Select Clauses

Select from a single table


SELECT *
FROM sales
WHERE amount > 10 AND
region = "US";
Select from a partitioned table
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01'
AND
page_views.date <= '2008-03-31'
Relational Operators

ALL and DISTINCT


Specify whether duplicate rows should be returned
ALL is the default (all matching rows are returned)
DISTINCT removes duplicate rows from the result set
WHERE
Filters by expression
Does not support IN, EXISTS or sub-queries in the WHERE clause
LIMIT
Indicates the number of rows to be returned
Relational Operators

GROUP BY
Group data by column values
Select statement can only include columns included in the
GROUP BY clause
ORDER BY / SORT BY
ORDER BY performs total ordering
Slow, poor performance
SORT BY performs partial ordering
Sorts output from each reducer
Simple Table

CREATE TABLE page_view


(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User' )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
External Table

CREATE EXTERNAL TABLE page_view_stg


(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/staging/page_view';
More About Tables

CREATE TABLE
LOAD: file moved into Hive’s data warehouse directory
DROP: both metadata and data deleted
CREATE EXTERNAL TABLE
LOAD: no files moved
DROP: only metadata deleted
Use this when sharing with other Hadoop applications, or when you want to use
multiple schemas on the same data
Partitioning

Can make some queries faster


Divide data based on partition column
Use PARTITION BY clause when creating table
Use PARTITION clause when loading data
SHOW PARTITIONS will show a table’s
partitions
Bucketing

Can speed up queries that involve sampling


the data
Sampling works without bucketing, but Hive has to scan the entire dataset
Use CLUSTERED BY when creating table
For sorted buckets, add SORTED BY
To query a sample of your data, use
TABLESAMPLE
Browsing Tables And Partitions

Command Comments
SHOW TABLES; Show all the tables in the database
SHOW TABLES 'page.*'; Show tables matching the
specification ( uses regex syntax )
SHOW PARTITIONS page_view; Show the partitions of the page_view
table
DESCRIBE page_view; List columns of the table
DESCRIBE EXTENDED page_view; More information on columns (useful
only for debugging )
DESCRIBE page_view List information about a partition
PARTITION (ds='2008-10-31');
Loading Data

Use LOAD DATA to load data from a file or


directory
Will read from HDFS unless LOCAL keyword is specified
Will append data unless OVERWRITE specified
PARTITION required if destination table is partitioned

LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-


8_us.txt'
OVERWRITE INTO TABLE page_view
PARTITION (date='2008-06-08', country='US')
Inserting Data During Table Creation

Use AS SELECT in the CREATE TABLE


statement to populate a table as it is created

CREATE TABLE page_view AS


SELECT pvs.viewTime, pvs.userid, pvs.page_url,
pvs.referrer_url
FROM page_view_stg pvs
WHERE pvs.country = 'US';
Loading And Inserting Data: Summary

Use this For this purpose


LOAD Load data from a file or directory
INSERT Load data from a query
• One partition at a time
• Use multiple INSERTs to insert into
multiple partitions in the one query
CREATE TABLE AS (CTAS) Insert data while creating a table
Add/modify external file Load new data into external table
syntax of the SELECT query

SELECT [ALL | DISTINCT] select_expr,


select_expr, ... FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list |
[DISTRIBUTE BY col_list]
[SORT BY col_list]]
[LIMIT number];
hive> SELECT * FROM employee WHERE
salary>30000;
Thank you
This presentation is created using LibreOffice Impress 4.2.8.2, can be used freely as per GNU General Public
License

Web Resources Blogs http://digitallocha.bl


http://mitu.co.in h ogspot.in
ttp://tusharkute.co http://kyamputar.blogspo
m t.in

[email protected]

You might also like