Module 06 Hive - Distributed Data Warehouse
Module 06 Hive - Distributed Data Warehouse
Hive
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Objectives
Upon completion of this course, you will be able to know:
Hive application scenarios and basic principles
Enhanced features of FusionInsight Hive
Common Hive SQL statements
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
Contents
1. Introduction to Hive
2. Hive Functions and Architecture
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
Hive Overview
Hive is a data warehouse tool running on Hadoop and supports
PB-level distributed data query and management.
Hive provides the following functions:
Flexible ETL (extract/transform/load)
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
Application Scenarios of Hive
User behavior analysis
Data mining Interest analysis
Partition demonstration
Data extraction
Data Data loading
warehouse Data transformation
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Position of Hive in FusionInsight
Application service layer
OpenAPI/SDK REST/SNMP/Syslog
Hive is a data warehouse tool, which employs HiveQL (SQL-like) to query data.
All Hive data is stored in HDFS.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Comparison Between Hive and
Traditional Data Warehouses (1)
Hive Traditional Warehouse
Cluster, which is of limited storage capacity. The
cluster calculation speed decreases dramatically
HDFS. Theoretically, it is infinitely when the storage capacity increases. It is
Storage
scalable. applicable only to commercial applications that
involve a small amount of data, and cannot
handle an extra-large amount of data.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Comparison Between Hive and
Traditional Data Warehouses (2)
Hive Traditional Data Warehouses
Low efficiency. It has not met
Index expectations currently. Efficient.
Data is stored in HDFS, which It has relatively low reliability. When a query
Reliability features high reliability and attempt fails, the query must be restarted.
high fault tolerance. Data fault tolerance relies on hardware RAID.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Advantages of Hive
Advantages of Hive
1 2 3 4
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 10
Disadvantages of Hive
Disadvantages of Hive
1 2 3 4
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 11
Contents
1. Introduction to Hive
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 12
Hive Architecture
Hive
JDBC ODBC
Web
Command Line Interface Thrift Server
Interface
Driver
Metastore
(Compiler,Optimizer,Executor)
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
Hive Architecture in FusionInsight HD
Hive contains HiveServer, MetaStore,
and WebHcat.
HiveServer: receives requests from Hiveserver (s) WebHcat (s)
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 14
Architecture of WebHCat
WebHCat provides Rest interface for users to make the following operations
through safe HTTPS protocol:
Hive DDL operations
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
Data Storage Model of Hive
Database
Table Table
Partition
Bucket
Bucket
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 16
Data Storage Model of Hive - Partition
and Bucket
Partition: A data table can be divided into partitions by using a field
value.
Each partition is a directory.
The bucket quantity is set when a table is created and data can be sorted
in the bucket.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Data Storage Model of Hive - Managed
Table and External Table
Hive can create managed table and external table:
Managed tables are created by default and managed by Hive. In this case,
Hive migrates data to data warehouse directories.
When external tables are created, Hive access data from locations outside data
warehouse directories.
Use external tables when Hive and other tools share the same data set for different
processing.
Managed Table External Table
Data is migrated to data warehouse The location of external data is
CREATE/LOAD directories. specified when a table is created.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 18
Functions of Hive
Built-in functions in Hive:
Mathematical Function, such as round(), floor(), abs(), rand(), etc.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 19
Enhanced Features of Hive - Colocation
Overview
Colocation: storing associated data or data on which associated operations
are performed on the same storage node.
NN #1
A C D A B D B C B C A D
DN #1 DN #2 DN #3 DN #4 DN #5 DN #6
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Enhanced Features of Hive - Using
Colocation
Step 1: Use an HDFS interface to create groupid and locatorid.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Enhanced Features of Hive - Encrypting
Columns
Step 1: When creating a table, specify the columns to be encrypted
and the encryption algorithm.
create table encode_test (id INT, name STRING, phone
STRING, address STRING) row format serde
"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
WITH SERDEPROPERTIES(
"column.encode.columns"="phone,address","column.encode.
classname"="org.apache.hadoop.hive.serde2.AESRewriter"
);
Step 2: Use an insert syntax to import data to tables whose columns
are encrypted.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 22
Enhanced Features of Hive - Deleting
HBase Records in Batches
Overview:
In FusionInsight HD, Hive allows deletion of a single record from an HBase
table. Hive can use specific syntax to delete one or more data records that
meet criteria from its HBase tables.
Usage:
To delete some data from an HBase table, run the following HQL
statement:
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Enhanced Features of Hive - Controlling
Traffic
By using the traffic control feature, you can control:
Total number of established connections
Number of established connections of each use
Number of connections established within a unit period
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Enhanced Features of Hive - Specifying
Row Delimiters
Step 1: Set inputFormat and outputFormat when creating a table.
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]
[db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[ROW FORMAT row_format]
STORED AS
inputformat
"org.apache.hadoop.hive.contrib.fileformat.SpecifiedD
elimiterInputFormat"
outputformat
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutput
Format";
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 25
Contents
1. Introduction to Hive
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Hive SQL Overview
DDL-Data definition language
Table creation, table modification and deletion, partitions, and
data types
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Hive Basic Operations (1)
Data format example:1,huawei,1000.0
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Hive Basic Operations (2)
--Modify a column
ALTER TABLE employee1 CHANGE money string COMMENT
'changed by alter' AFTER dateincompany;
--Add a column
ALTER TABLE employee1 ADD columns(column1 string);
--Describe table
DESC table_a;
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
Hive Basic Operations (3)
--Load data from the local
LOAD DATA LOCAL INPATH 'employee.txt' OVERWRITE INTO TABLE
example.employee;
--Insert data
INSERT INTO TABLE company.person
SELECT id, name, age, birthday FROM company.person_tmp
WHERE century= '23' AND year='2010';
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
Hive Basic Operations (4)
--WHERE
SELECT id, name FROM employee WHERE salary >= 10000;
--GROUP BY
SELECT department, avg(salary) FROM employee GROUP BY department;
--UNION ALL
SELECT id, salary, date FROM employee_a UNION ALL
SELECT id, salary, date FROM employee_b;
--JOIN
SELECT a.salary, b.address FROM employee a JOIN employee_info
b ON a.name=b.name;
--Subquery
SELECT a.salary, b.address FROM employee a JOIN (SELECT
address FROM employee_info where province='zhejiang') b ON
a.name=b.name;
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Summary
This module describes the following information about Hive:
basic principles, application scenarios, enhanced features in
FusionInsight and common Hive SQL statements.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Quiz
1. Which of the following scenarios does Hive apply to?
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Quiz
1. Which of the following statements about Hive SQL operations are correct?
A. The keyword external is used to create an external table and the key
word internal is used to create a common table.
C. When data is uploaded to Hive, the data source must be one HDFS path.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
More Information
Training materials:
http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
Exam outline:
http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node10
00011797
Mock exam:
http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node10000
11798
Authentication process:
http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 35
Thank You
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 36