Hive Intoduction and Tables
Hive Intoduction and Tables
CMSC 491
Hadoop-Based Distributed Compu<ng
Spring 2016
Adam Shook
What Is Hive?
• Developed by Facebook and a top-level Apache project
• A data warehousing infrastructure based on Hadoop
• Immediately makes data on a cluster available to non-
Java programmers via SQL like queries
• Built on HiveQL (HQL), a SQL-like query language
• Interprets HiveQL and generates MapReduce jobs that
run on the cluster
• Enables easy data summariza<on, ad-hoc repor<ng
and querying, and analysis of large volumes of data
What Hive Is Not
• Hive, like Hadoop, is designed for batch
processing of large datasets
• Not an OLTP or real-<me system
• Latency and throughput are both high
compared to a tradi<onal RDBMS
– Even when dealing with rela<vely small data
( <100 MB )
Data Hierarchy
• Hive is organised hierarchically into:
– Databases: namespaces that separate tables and
other objects
– Tables: homogeneous units of data with the same
schema
• Analogous to tables in an RDBMS
– Par<<ons: determine how the data is stored
• Allow efficient access to subsets of the data
– Buckets/clusters
• For subsampling within a par<<on
• Join op<miza<on
HiveQL
• HiveQL / HQL provides the basic SQL-like
opera<ons:
– Select columns using SELECT
– Filter rows using WHERE
– JOIN between tables
– Evaluate aggregates using GROUP BY
– Store query results into another table
– Download results to a local directory (i.e., export
from HDFS)
– Manage tables and queries with CREATE, DROP, and
ALTER
Primi<ve Data Types
Type Comments
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8-byte integers
BOOLEAN TRUE/FALSE
FLOAT, DOUBLE Single and double precision real numbers
STRING Character string
TIMESTAMP Unix-epoch offset or date<me string
DECIMAL Arbitrary-precision decimal
BINARY Opaque; ignore these bytes
Complex Data Types
Type Comments
STRUCT A collec<on of elements
If S is of type STRUCT {a INT, b INT}:
S.a returns element a
MAP Key-value tuple
If M is a map from 'group' to GID:
M['group'] returns value of GID
ARRAY Indexed list
If A is an array of elements ['a','b','c']:
A[0] returns 'a'
HiveQL Limita<ons
• HQL only supports equi-joins, outer joins, lel
semi-joins
• Because it is only a shell for mapreduce, complex
queries can be hard to op<mise
• Missing large parts of full SQL specifica<on:
– HAVING clause in SELECT
– Correlated sub-queries
– Sub-queries outside FROM clauses
– Updatable or materialized views
– Stored procedures
Hive Metastore
• Stores Hive metadata
• Default metastore database uses Apache Derby
• Various configura<ons:
– Embedded (in-process metastore, in-process
database)
• Mainly for unit tests
– Local (in-process metastore, out-of-process database)
• Each Hive client connects to the metastore directly
– Remote (out-of-process metastore, out-of-process
database)
• Each Hive client connects to a metastore server, which
connects to the metadata database itself
Hive Warehouse
• Hive tables are stored in the Hive
“warehouse”
– Default HDFS loca<on: /user/hive/warehouse
• Tables are stored as sub-directories in the
warehouse directory
• Par<<ons are subdirectories of tables
• External tables are supported in Hive
• The actual data is stored in flat files
Hive Schemas
• Hive is schema-on-read
– Schema is only enforced when the data is read (at
query <me)
– Allows greater flexibility: same data can be read
using mul<ple schemas
• Contrast with an RDBMS, which is schema-on-
write
– Schema is enforced when the data is loaded
– Speeds up queries at the expense of load <mes
Create Table Syntax
CREATE TABLE table_name
(col1 data_type,
col2 data_type,
col3 data_type,
col4 datatype )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS format_type;
Simple Table
CREATE TABLE page_view
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User' )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
More Complex Table
CREATE TABLE employees (
(name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
External Table
CREATE EXTERNAL TABLE page_view_stg
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/staging/page_view';
More About Tables
• CREATE TABLE
– LOAD: file moved into Hive’s data warehouse
directory
– DROP: both metadata and data deleted
• CREATE EXTERNAL TABLE
– LOAD: no files moved
– DROP: only metadata deleted
– Use this when sharing with other Hadoop
applica<ons, or when you want to use mul<ple
schemas on the same data
Par<<oning
• Can make some queries faster
• Divide data based on par<<on column
• Use PARTITION BY clause when crea<ng table
• Use PARTITION clause when loading data
• SHOW PARTITIONS will show a table’s
par<<ons
Bucke<ng
• Can speed up queries that involve sampling
the data
– Sampling works without bucke<ng, but Hive has
to scan the en<re dataset
• Use CLUSTERED BY when crea<ng table
– For sorted buckets, add SORTED BY
• To query a sample of your data, use
TABLESAMPLE
Browsing Tables And Par<<ons
Command Comments
SHOW TABLES; Show all the tables in the database
SHOW TABLES 'page.*'; Show tables matching the
specifica<on ( uses regex syntax )
SHOW PARTITIONS page_view; Show the par<<ons of the page_view
table
DESCRIBE page_view; List columns of the table
DESCRIBE EXTENDED page_view; More informa<on on columns (useful
only for debugging )
DESCRIBE page_view List informa<on about a par<<on
PARTITION (ds='2008-10-31');
Loading Data
• Use LOAD DATA to load data from a file or
directory
– Will read from HDFS unless LOCAL keyword is
specified
– Will append data unless OVERWRITE specified
– PARTITION required if des<na<on table is par<<oned
– If mul<ple tables are joined, put the biggest table last and
the reducer will stream the last table, buffer the others
– Use lel semi-joins to take the place of IN/EXISTS
SELECT a.key, a.val FROM a LEFT SEMI JOIN b on a.key = b.key;
Advanced Hive Opera<ons
• JOIN
– Do not specify join condi<ons in the WHERE clause
• Hive does not know how to op<mise such queries
• Will compute a full Cartesian product before filtering it
• Join Example
SELECT
a.ymd, a.price_close, b.price_close
FROM stocks a
JOIN stocks b ON a.ymd = b.ymd
WHERE a.symbol = 'AAPL' AND
b.symbol = 'IBM' AND
a.ymd > '2010-01-01';
Hive S<nger
• MPP-style execu<on of Hive queries
• Available since Hive 0.13
• No MapReduce
• We will talk about this more when we get to
SQL on Hadoop
References
• hvp://hive.apache.org