0% found this document useful (0 votes)
2 views

Module 06 Hive - Distributed Data Warehouse

The document outlines the technical principles of Hive, a data warehouse tool that operates on Hadoop and supports large-scale data management and querying. It highlights the enhanced features of FusionInsight HD's Hive, including customization options, improved reliability, and performance compared to traditional data warehouses. Additionally, it provides an overview of Hive's architecture, SQL operations, and various functionalities such as data partitioning, encryption, and traffic control.

Uploaded by

Lucas Oliveira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 06 Hive - Distributed Data Warehouse

The document outlines the technical principles of Hive, a data warehouse tool that operates on Hadoop and supports large-scale data management and querying. It highlights the enhanced features of FusionInsight HD's Hive, including customization options, improved reliability, and performance compared to traditional data warehouses. Additionally, it provides an overview of Hive's architecture, SQL operations, and various functionalities such as data partitioning, encryption, and traffic control.

Uploaded by

Lucas Oliveira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Technical Principles of

Hive

www.huawei.com

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved.


Foreword
 Based on Hive provided by the Hive Open Source community,
Hive in FusionInsight HD has various enterprise-level
customization features, such as Colocation table creation,
column encryption, and syntax enhancement. With these
features, FusionInsight HD outperforms the community version
in terms of reliability, tolerance, scalability, and performance.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Objectives
 Upon completion of this course, you will be able to know:
 Hive application scenarios and basic principles
 Enhanced features of FusionInsight Hive
 Common Hive SQL statements

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
Contents
1. Introduction to Hive
2. Hive Functions and Architecture

3. Basic Hive Operations

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
Hive Overview
 Hive is a data warehouse tool running on Hadoop and supports
PB-level distributed data query and management.
 Hive provides the following functions:
 Flexible ETL (extract/transform/load)

 Supporting computing engines, such as MapReduce, Tez, and


Spark

 Direct access to HDFS files and HBase

 Easy to use and program

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
Application Scenarios of Hive
 User behavior analysis
Data mining  Interest analysis
 Partition demonstration

Non-real-  Log analysis


time analysis  Text analysis

Data  Daily/Weekly click count


aggregation  Traffic statistics

 Data extraction
Data  Data loading
warehouse  Data transformation

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Position of Hive in FusionInsight
Application service layer
OpenAPI/SDK REST/SNMP/Syslog

Data Information Knowledge Wisdom


DataFarm Porter Miner Farmer Manager
System
management
Hadoop API Plugin API
Service
governance
HIVE M/R Spark Storm Flink
Hadoop LibrA
YARN/ Zookeeper Security
management
HDFS/HBase

Hive is a data warehouse tool, which employs HiveQL (SQL-like) to query data.
All Hive data is stored in HDFS.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Comparison Between Hive and
Traditional Data Warehouses (1)
Hive Traditional Warehouse
Cluster, which is of limited storage capacity. The
cluster calculation speed decreases dramatically
HDFS. Theoretically, it is infinitely when the storage capacity increases. It is
Storage
scalable. applicable only to commercial applications that
involve a small amount of data, and cannot
handle an extra-large amount of data.

An algorithm with higher efficiency can be used


Execution
MapReduce/Tez/Spark to query data. More optimization measures can
engine
be taken to improve the efficiency.

Usage HQL (similar to SQL) SQL


Metadata and data are stored
Flexibility Low. Data is used for limited purposes.
separately for decoupling.

The calculation speed depends


on cluster size. Hive is scalable. It It is fast when there is a small amount of data.
Analysis
is more efficient than traditional Nevertheless, the speed decreases dramatically
speed
data warehouses when there is a when the amount of data becomes large.
large amount of data.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Comparison Between Hive and
Traditional Data Warehouses (2)
Hive Traditional Data Warehouses
Low efficiency. It has not met
Index expectations currently. Efficient.

An application model must be


It provides a set of well-developed report
Usability developed. This results in high
solutions to facilitate data analysis.
flexibility but low usability.

Data is stored in HDFS, which It has relatively low reliability. When a query
Reliability features high reliability and attempt fails, the query must be restarted.
high fault tolerance. Data fault tolerance relies on hardware RAID.

Environment It can be deployed using It requires high-performance commercial


dependence common computers. servers.

The data warehouses used for commercial


Price Open-source product.
purposes are expensive.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Advantages of Hive
Advantages of Hive

High reliability Multiple


SQL-like query Scalability
and tolerance interfaces
 HiveServer in  SQL-like syntax  User-defined  Beeline
cluster mode  Built-in functions storage format  JDBC
 Dual-MetaStore in large quantity  User defined  Thrift
 Query retry functions  Python
after timeout (UDF/UDAF/UDTF)  ODBC

1 2 3 4

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 10
Disadvantages of Hive

Disadvantages of Hive

High latency Not support Inapplicable to Not support


materialized OLTP storage process
views
Does not support  Does not  Does not
 MapReduce 

materialized views. support support storage


execution Data updating,

column-level process, but
engine by insertion, and
deletion cannot be data adding, supports logic
default
performed on views. updating, and processing
 High latency of
deletion. using UDF.
MapReduce

1 2 3 4

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 11
Contents
1. Introduction to Hive

2. Hive Functions and Architecture

3. Basic Hive Operations

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 12
Hive Architecture

Hive

JDBC ODBC

Web
Command Line Interface Thrift Server
Interface

Driver
Metastore
(Compiler,Optimizer,Executor)

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
Hive Architecture in FusionInsight HD
 Hive contains HiveServer, MetaStore,
and WebHcat.
 HiveServer: receives requests from Hiveserver (s) WebHcat (s)

clients, parses and executes HQL


commands, and returns query results. Metastore (s)
 MetaStore: provides metadata
services.
DBService/HDFS/YARN
 WebHcat: provides HTTP services,
such as metadata, Data Defined
Language (DDL) for external users.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 14
Architecture of WebHCat
 WebHCat provides Rest interface for users to make the following operations
through safe HTTPS protocol:
 Hive DDL operations

 Running Hive HQL task

 Runing MapReduce task

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
Data Storage Model of Hive

Database

Table Table

Partition
Bucket

Bucket

Partition Skewed data Normal data


Bucket
Bucket

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 16
Data Storage Model of Hive - Partition
and Bucket
 Partition: A data table can be divided into partitions by using a field
value.
 Each partition is a directory.

 The number of partitions is configurable.

 A partition can be partitioned or bucketed.

 Bucket: Data can be stored in different buckets.


 Each bucket is a file.

 The bucket quantity is set when a table is created and data can be sorted
in the bucket.

 Data is stored in a bucket by the hash value of a field.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Data Storage Model of Hive - Managed
Table and External Table
 Hive can create managed table and external table:
 Managed tables are created by default and managed by Hive. In this case,
Hive migrates data to data warehouse directories.
 When external tables are created, Hive access data from locations outside data
warehouse directories.

 Use managed tables when Hive performs all operations.

 Use external tables when Hive and other tools share the same data set for different
processing.
Managed Table External Table
Data is migrated to data warehouse The location of external data is
CREATE/LOAD directories. specified when a table is created.

DROP Metadata and data are deleted. Only metadata is deleted.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 18
Functions of Hive
 Built-in functions in Hive:
 Mathematical Function, such as round(), floor(), abs(), rand(), etc.

 Date Function, such as to_date(), month(), day(), etc.

 String Function, such as trim(), length(), substr(), etc.

 UDF (User- Defined Funcation)

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 19
Enhanced Features of Hive - Colocation
Overview
 Colocation: storing associated data or data on which associated operations
are performed on the same storage node.

 File-level Colocation allows quick file access. This avoids network


consumption caused by data migration.

NN #1

A C D A B D B C B C A D
DN #1 DN #2 DN #3 DN #4 DN #5 DN #6

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Enhanced Features of Hive - Using
Colocation
 Step 1: Use an HDFS interface to create groupid and locatorid.

hdfs colocationadmin -createGroup -groupId groupid


-locatorIds locatorid1,locatorid2,locatorid3

 Step 2: Use the Hive Colocation function.


CREATE TABLE tbl_1 (id INT, name STRING) stored as
RCFILE
TBLPROPERTIES("groupId"="group1","locatorId"="loca
tor1");

CREATE TABLE tbl_2 (id INT, name STRING) row


format delimited fields terminated by '\t' stored
as TEXTFILE
TBLPROPERTIES("groupId"="group1","locatorId"="loca
tor1");

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Enhanced Features of Hive - Encrypting
Columns
 Step 1: When creating a table, specify the columns to be encrypted
and the encryption algorithm.
create table encode_test (id INT, name STRING, phone
STRING, address STRING) row format serde
"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
WITH SERDEPROPERTIES(
"column.encode.columns"="phone,address","column.encode.
classname"="org.apache.hadoop.hive.serde2.AESRewriter"
);
 Step 2: Use an insert syntax to import data to tables whose columns
are encrypted.

insert into table encode_test select id, name,


phone, address from test;

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 22
Enhanced Features of Hive - Deleting
HBase Records in Batches
 Overview:
 In FusionInsight HD, Hive allows deletion of a single record from an HBase
table. Hive can use specific syntax to delete one or more data records that
meet criteria from its HBase tables.

 Usage:
 To delete some data from an HBase table, run the following HQL
statement:

remove table HBase_table where expression;

here, expression indicates the criteria for selecting the


data to be deleted.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Enhanced Features of Hive - Controlling
Traffic
By using the traffic control feature, you can control:
 Total number of established connections
 Number of established connections of each use
 Number of connections established within a unit period

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Enhanced Features of Hive - Specifying
Row Delimiters
 Step 1: Set inputFormat and outputFormat when creating a table.
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]
[db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[ROW FORMAT row_format]
STORED AS
inputformat
"org.apache.hadoop.hive.contrib.fileformat.SpecifiedD
elimiterInputFormat"
outputformat
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutput
Format";

 Step 2: Specify the delimiter before a query.


set hive.textinput.record.delimiter="!@!";

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 25
Contents
1. Introduction to Hive

2. Hive Functions and Architecture

3. Basic Hive Operations

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Hive SQL Overview
 DDL-Data definition language
 Table creation, table modification and deletion, partitions, and
data types

 DML-Data manipulation language


 Data import, export

 DQL-Data query language


 General query

 Complicated query, like Group by,Order by,Join, etc.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Hive Basic Operations (1)
 Data format example:1,huawei,1000.0

--Create managed table


CREATE TABLE IF NOT EXISTS example.employee(
Id INT COMMENT 'employeeid',
Company STRING COMMENT 'your company',
Money FLOAT COMMENT 'work money',)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
--Create external table
CREATE EXTERNAL TABLE IF NOT EXISTS
example.employee(
Id INT COMMENT 'employeeid',
Company STRING COMMENT 'your company',
Money FLOAT COMMENT 'work money',) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/localtest';

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Hive Basic Operations (2)
--Modify a column
ALTER TABLE employee1 CHANGE money string COMMENT
'changed by alter' AFTER dateincompany;

--Add a column
ALTER TABLE employee1 ADD columns(column1 string);

--Modify the file format


ALTER TABLE employee3 SET fileformat TEXTFILE;

--Delete table data


DELETE column_1 from table_1 WHERE column_1=??;
DROP table_a;

--Describe table
DESC table_a;

--Show the statements for creating a table


SHOW CREATE table_a;

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
Hive Basic Operations (3)
--Load data from the local
LOAD DATA LOCAL INPATH 'employee.txt' OVERWRITE INTO TABLE
example.employee;

--Load data from another table


INSERT INTO TABLE company.person PARTITION(century= '21',year='2010')
SELECT id, name, age, birthday FROM company.person_tmp WHERE
century= '23' AND year='2010';

--Export data from a Hive table to HDFS


EXPORT TABLE company.person TO '/department';

--Import data from HDFS to a Hive table


IMPROT TABLE company.person FROM '/department';

--Insert data
INSERT INTO TABLE company.person
SELECT id, name, age, birthday FROM company.person_tmp
WHERE century= '23' AND year='2010';

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
Hive Basic Operations (4)
--WHERE
SELECT id, name FROM employee WHERE salary >= 10000;

--GROUP BY
SELECT department, avg(salary) FROM employee GROUP BY department;

--UNION ALL
SELECT id, salary, date FROM employee_a UNION ALL
SELECT id, salary, date FROM employee_b;

--JOIN
SELECT a.salary, b.address FROM employee a JOIN employee_info
b ON a.name=b.name;

--Subquery
SELECT a.salary, b.address FROM employee a JOIN (SELECT
address FROM employee_info where province='zhejiang') b ON
a.name=b.name;

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Summary
 This module describes the following information about Hive:
basic principles, application scenarios, enhanced features in
FusionInsight and common Hive SQL statements.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Quiz
1. Which of the following scenarios does Hive apply to?

A. Real-time online data analysis

B. Data mining (user behavior analysis, interest analysis, and partition


demonstration)

C. Data aggregation (daily/weekly click count and click count rankings)

D. Non-real-time data analysis (log analysis and statistics analysis)

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Quiz
1. Which of the following statements about Hive SQL operations are correct?

A. The keyword external is used to create an external table and the key
word internal is used to create a common table.

B. Specify the location information when creating an external table.

C. When data is uploaded to Hive, the data source must be one HDFS path.

D. When creating a table, column delimiters can be specified.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
More Information
 Training materials:
 http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
 Exam outline:
 http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node10
00011797
 Mock exam:
 http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node10000
11798
 Authentication process:
 http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 35
Thank You
www.huawei.com

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 36

You might also like