07 Hive 01
07 Hive 01
8CAI4-01
Unit 6 (Hive)
Agenda
• Hive Overview and Concepts
• Installation
• Table Creation and Deletion
• Loading Data into Hive
• Partitioning
• Bucketing
• Joins
Hive
• Data Warehousing Solution built on top of
Hadoop
• Provides SQL-like query language named
HiveQL
– Minimal learning curve for people with SQL expertise
– Data analysts are target audience
• Early Hive development work started at
Facebook in 2007
• Today Hive is an Apache project under
Hadoop
– http://hive.apache.org
5
Hive Provides
• Ability to bring structure to various data
formats
• Simple interface for ad hoc querying,
analyzing and summarizing large amounts
of data
• Access to files on various data stores such
as HDFS and HBase
Hive
• Hive does NOT provide low latency or real-
time queries
• Even querying small amounts of data may
take minutes
• Designed for scalability and ease-of-use
rather than low latency responses
7
Hive
• Translates HiveQL statements into a set of
MapReduce Jobs which are then executed on a
Hadoop Cluster
Execute on
Hadoop
Cluster
HiveQL Hive
CREATE TABLE posts (user
STRING, post STRING, time
BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Monitor/ ...
Report
LOAD DATA LOCAL INPATH
'data/user-posts.txt' OVERWRITE
INTO TABLE posts;
Hadoop
Client Machine Cluster
8
Hive Metastore
• To support features like schema(s) and data
partitioning Hive keeps its metadata in a
Relational Database
– Packaged with Derby, a lightweight embedded SQL DB
• Default Derby based is good for evaluation an
testing
• Schema is not shared between users as each
user has their own instance of embedded Derby
• Stored in metastore_db directory which resides
in the directory that hive was started from
– Can easily switch another SQL installation such as
MySQL
9
Hive Architecture
Comman JDBC/
d Line Other
clients
Hive
Metastor Query
e Parser
Execu
tor
Hadoop
HDFS and MapReduce
10
11
Hive Concepts
• Re-used from Relational Databases
– Database: Set of Tables, used for name conflicts resolution
– Table: Set of Rows that have the same schema (same columns)
– Row: A single record; a set of columns
– Column: provides value and type for a single value
Column
Row
Table
12
Databas
e
Installation Prerequisites
• Java 6
– Just Like Hadoop
• Hadoop 0.20.x+
– No surprise here
13
Hive Installation
• Set $HADOOP_HOME environment variable
– Was done as a part of HDFS installation
• Set $HIVE_HOME and add hive to the PATH
export HIVE_HOME=$CDH_HOME/hive-0.8.1-cdh4.0.0
export PATH=$PATH:$HIVE_HOME/bin
14
Hive Installation
• Similar to other Hadoop’s projects Hive’s
configuration is in $HIVE_HOME/conf/hive-
site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:10040</value>
</property>
16
Simple Example
1. Create a Table
2. Load Data into a Table
3. Query Data
4. Drop the Table
17
1: Create a Table
• Let’s create a table to store data from
$PLAY_AREA/data/user-posts.txt
Launch Hive Command Line Interface (CLI)
$ cd $PLAY_AREA
Location of the session’s log
file
$ hive
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201208022144_2014345460.txt
hive> !cat data/user-posts.txt; Can execute local
user1,Funny Story,1343182026191 commands within CLI,
user2,Cool Deal,1343182133839 place a command in
user4,Interesting Post,1343182154633
between ! and ;
user5,Yet Another Blog,13431839394
hive>
Values are separate by ‘,’ and each
row represents a record; first value
is user name, second is post
content and third is timestamp
18
1: Create a Table
hive> CREATE TABLE posts (user STRING, post STRING, time BIGINT)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ',' 1st line: creates a table with 3
> STORED AS TEXTFILE; columns 2nd and 3rd line: how the
OK underlying file should be parsed
Time taken: 10.606 seconds 4th line: how to store data
Coo
l Select records whose
Dea timestamp is less or
l equals to the provided
value
O
K
user1 Funny Story
1343182026191 user2
Cool Deal
4: Drop the Table
1343182133839
hive> exit;
23
Loading Data
• Several options to start using data in HIVE
– Load data from HDFS location
hive> LOAD DATA INPATH '/training/hive/user-posts.txt'
> OVERWRITE INTO TABLE posts;
24
25
Schema Violations
• What would happen if we try to insert data that
does not comply with the pre-defined schema?
26
Schema Violations
hive> LOAD DATA LOCAL INPATH
> 'data/user-posts-inconsistentFormat.txt'
> OVERWRITE INTO TABLE posts;
OK
Time taken: 0.612 seconds
27
Partitions
• To increase performance Hive has the
capability to partition data
– The values of partitioned column divide a table into
segments
– Entire partitions can be ignored at query time
– Similar to relational databases’ indexes but not as
granular
• Partitions have to be properly crated by
users
– When inserting data must specify a partition
• At query time, whenever appropriate,
Hive will automatically filter out partitions
28
29
Load Data Into Partitioned Table
hive> LOAD DATA LOCAL INPATH 'data/user-posts-US.txt'
> OVERWRITE INTO TABLE posts;
FAILED: Error in semantic analysis: Need to specify partition
columns because the destination table is partitioned
Partitioned Table
• Partitions are physically stored under
separate directories
Bucketing
• Mechanism to query and examine random
samples of data
• Break data into a set of buckets based on a hash
function of a "bucket column"
– Capability to execute queries on a sub-set of random data
• Doesn’t automatically enforce bucketing
– User is required to specify the number of buckets by setting #
of reducer
a
p
p
r
o
x
i
35
m
a
Joins
• Joins in Hive are trivial
• Supports outer joins
– left, right and full joins
• Can join multiple tables
• Default Join is Inner Join
– Rows are joined where the keys match
– Rows that do not have matches are not included in the
result
36
Outer Join
• Rows which will not join with the ‘other’ table are still
included in the result
Left Outer
– Row from the first table are included whether they
have a match or not. Columns from the unmatched
(second) table are set to null.
Right Outer
– The opposite of Left Outer Join: Rows from the second
table are included no matter what. Columns from the
unmatched (first) table are set to null.
Full Outer
– Rows from both sides are included. For unmatched
39 rows the columns from the ‘other’ table are set to null.
Outer Join Examples
SELECT p.*, l.*
FROM posts p LEFT OUTER JOIN likes l ON (p.user = l.user)
limit 10;
40
Resources
• http://hive.apache.org/
• Hive Wiki
– https://cwiki.apache.org/confluence/display/Hive/Home
Hive
Edward Capriolo (Author), Dean Wampler (Author),
Jason Rutherglen (Author)
O'Reilly Media; 1 edition (October 3, 2012)
44