Chapter 3 Hive - Distributed Data Warehouse
Chapter 3 Hive - Distributed Data Warehouse
Foreword
The Apache Hive data warehouse software helps read, write, and manage
large data sets that reside in distributed storage by using SQL. Structures
can be projected onto stored data. The command line tool and JDBC driver
are provided to connect users to Hive.
1 Huawei Confidential
Objectives
2 Huawei Confidential
Contents
1. Hive Overview
3 Huawei Confidential
Introduction to Hive
Hive is a data warehouse tool running on Hadoop and supports PB-level
distributed data query and management.
Hive features:
Supporting flexible extraction, transformation, and load (ETL)
Supporting multiple computing engines, such as Tez and Spark
Supporting direct access to HDFS files and HBase
Easy-to-use and easy-to-program
4 Huawei Confidential
Application Scenarios of Hive
Data extraction
Data Data loading
warehouse Data transformation
5 Huawei Confidential
Comparison Between Hive and Traditional Data
Warehouses (1)
Hive Conventional Data Warehouse
Clusters are used to store data, which have a capacity upper
HDFS is used to store data. Theoretically, limit. With the increase of capacity, the computing speed
Storage infinite expansion is possible. decreases sharply. Therefore, data warehouses are applicable
only to commercial applications with small data volumes.
Usage
HQL (SQL-like) SQL
Method
Metadata storage is independent of data
Flexibility storage, decoupling metadata and data.
Low flexibility. Data can be used for limited purposes.
6 Huawei Confidential
Comparison Between Hive and Traditional Data
Warehouses (2)
7 Huawei Confidential
Advantages of Hive
Advantages
High Reliability
and SQL-like Scalability Multiple APIs
Fault Tolerance
1. Cluster 1. SQL-like 1. User-defined 1. Beeline
deployment syntax storage 2. JDBC
of HiveServer 2. Large number format 3. Thrift
2. Double of built-in 2. User-defined 4. ODBC
MetaStores functions function
3. Timeout retry
mechanism
1 2 3 4
8 Huawei Confidential
Contents
1. Hive Overview
9 Huawei Confidential
Hive Architecture
Hive
JDBC ODBC
Web
Thrift Server
Interface
Driver
MetaStore
(Compiler, Optimizer, Executor)
10 Huawei Confidential
Hive Running Process
The client submits the HQL command.
HQL statement
Tez executes the query.
YARN allocates resources to Hive
applications in the cluster and enables
authorization for Hive jobs in the Tez(default)
YARN queue.
Hive updates data in HDFS or Hive YARN
11 Huawei Confidential
Data Storage Model of Hive
Database
Table Table
Partition
12 Huawei Confidential
Partition and Bucket
Partition: Data tables can be partitioned based on the value of a certain field.
Each partition is a directory.
The number of partitions is not fixed.
Partitions or buckets can be created in a partition.
Data can be stored in different buckets.
Each bucket is a file.
The number of buckets is specified when creating a table. The buckets can be sorted.
Data is hashed based on the value of a field and then stored in a bucket.
13 Huawei Confidential
Managed Table and External Table
Hive can create managed tables and external tables.
By default, a managed table is created, and Hive moves data to the data warehouse directory.
When an external table is created, Hive accesses data outside the warehouse directory.
If all processing is performed by Hive, you are advised to use managed tables.
If you want to use Hive and other tools to process the same data set, you are advised to use
external tables.
CREATE/LOAD Data is moved to the repository directory. The data location is not moved.
15 Huawei Confidential
Functions Supported by Hive
Built-in Hive Functions
Mathematical functions, such as round(), floor(), abs(), and rand().
Date functions, such as to_date(), month(), and day().
String functions, such as trim(), length(), and substr().
User-Defined Function (UDF)
16 Huawei Confidential
Contents
1. Hive Overview
17 Huawei Confidential
Hive SQL Overview
DDL-Data Definition Language:
Creates tables, modifies tables, deletes tables, partitions, and data types.
DML-Data Management Language:
Imports and exports data.
DQL-Data Query Language:
Performs simple queries.
Performs complex queries such as Group by, Order by and Join.
19 Huawei Confidential
DDL Operations
-- Create a table:
hive> CREATE TABLE pokes (foo INT, bar STRING);
hive> CREATE EXTERNAL TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
-- Describe a table:
hive> DESCRIBE invites;
-- Modify a table:
hive> ALTER TABLE events RENAME TO 3koobecaf;
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
20 Huawei Confidential
DML Operations
-- Load data to a table:
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
21 Huawei Confidential
DQL Operations (1)
--SELECTS and FILTERS:
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';
--GROUP BY:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
22 Huawei Confidential
DQL Operations (2)
--MULTITABLE INSERT:
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200;
--JOIN:
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
--STREAMING:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING
'/bin/cat' WHERE a.ds > '2008-08-09';
23 Huawei Confidential
Summary
24 Huawei Confidential
Quiz
25 Huawei Confidential
Recommendations
26 Huawei Confidential
Thank you. Bring digital to every person, home, and
organization for a fully connected,
intelligent world.