0% found this document useful (0 votes)
28 views26 pages

Chapter 3 Hive - Distributed Data Warehouse

Uploaded by

mazlout hanadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views26 pages

Chapter 3 Hive - Distributed Data Warehouse

Uploaded by

mazlout hanadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Chapter 3 Hive - Distributed Data Warehouse

Foreword

 The Apache Hive data warehouse software helps read, write, and manage
large data sets that reside in distributed storage by using SQL. Structures
can be projected onto stored data. The command line tool and JDBC driver
are provided to connect users to Hive.

1 Huawei Confidential
Objectives

 Upon completion of this course, you will be able to learn:


 Hive application scenarios and basic principles
 Hive architecture and running process
 Hive SQL statements

2 Huawei Confidential
Contents

1. Hive Overview

2. Hive Functions and Architecture

3. Basic Hive Operations

3 Huawei Confidential
Introduction to Hive
 Hive is a data warehouse tool running on Hadoop and supports PB-level
distributed data query and management.
 Hive features:
 Supporting flexible extraction, transformation, and load (ETL)
 Supporting multiple computing engines, such as Tez and Spark
 Supporting direct access to HDFS files and HBase
 Easy-to-use and easy-to-program

4 Huawei Confidential
Application Scenarios of Hive

 User behavior analysis


Data  Interest partition
mining  Area display

Non-real-time  Log analysis


data  Text analysis
analysis

Data  Daily/Weekly user clicks


summarization  Traffic statistics

 Data extraction
Data  Data loading
warehouse  Data transformation

5 Huawei Confidential
Comparison Between Hive and Traditional Data
Warehouses (1)
Hive Conventional Data Warehouse
Clusters are used to store data, which have a capacity upper
HDFS is used to store data. Theoretically, limit. With the increase of capacity, the computing speed
Storage infinite expansion is possible. decreases sharply. Therefore, data warehouses are applicable
only to commercial applications with small data volumes.

Execution You can select more efficient algorithms to perform queries,


Tez (default)
engine or take more optimization measures to speed up the queries.

Usage
HQL (SQL-like) SQL
Method
Metadata storage is independent of data
Flexibility storage, decoupling metadata and data.
Low flexibility. Data can be used for limited purposes.

Computing depends on the cluster scale and


When the data volume is small, the data processing speed is
Analysis the cluster is easy to expand. In the case of a
high. When the data volume is large, the speed decreases
speed large amount of data, computing is much
sharply.
faster than that of a common data warehouse.

6 Huawei Confidential
Comparison Between Hive and Traditional Data
Warehouses (2)

Hive Conventional Data Warehouse

Index Low efficiency High efficiency

Self-developed application models are


A set of mature report solutions are
Ease of use needed, featuring high flexibility but
integrated to facilitate data analysis.
delivering low usability.

The reliability is low. If a query fails, you


Data is stored in HDFS, implementing
Reliability high data reliability and fault tolerance.
start the task again. The data fault
tolerance depends on hardware RAID.

Environment Low dependency on hardware, applicable Highly dependent on high-performance


dependency to common machines business servers

Price Open-source product, free of charge Expensive in commercial use

7 Huawei Confidential
Advantages of Hive

Advantages

High Reliability
and SQL-like Scalability Multiple APIs
Fault Tolerance
1. Cluster 1. SQL-like 1. User-defined 1. Beeline
deployment syntax storage 2. JDBC
of HiveServer 2. Large number format 3. Thrift
2. Double of built-in 2. User-defined 4. ODBC
MetaStores functions function
3. Timeout retry
mechanism

1 2 3 4

8 Huawei Confidential
Contents

1. Hive Overview

2. Hive Functions and Architecture

3. Basic Hive Operations

9 Huawei Confidential
Hive Architecture

Hive
JDBC ODBC

Web
Thrift Server
Interface

Driver
MetaStore
(Compiler, Optimizer, Executor)

Tez MapReduce Spark

10 Huawei Confidential
Hive Running Process
 The client submits the HQL command.
HQL statement
 Tez executes the query.
 YARN allocates resources to Hive
applications in the cluster and enables
authorization for Hive jobs in the Tez(default)
YARN queue.
 Hive updates data in HDFS or Hive YARN

warehouse based on the table type.


 Hive returns the query result through HDFS

the JDBC connection.

11 Huawei Confidential
Data Storage Model of Hive

Database

Table Table

Partition

Bucket Bucket Partition Skewed Normal


data data
Bucket Bucket

12 Huawei Confidential
Partition and Bucket
 Partition: Data tables can be partitioned based on the value of a certain field.
 Each partition is a directory.
 The number of partitions is not fixed.
 Partitions or buckets can be created in a partition.
 Data can be stored in different buckets.
 Each bucket is a file.
 The number of buckets is specified when creating a table. The buckets can be sorted.
 Data is hashed based on the value of a field and then stored in a bucket.

13 Huawei Confidential
Managed Table and External Table
 Hive can create managed tables and external tables.
 By default, a managed table is created, and Hive moves data to the data warehouse directory.
 When an external table is created, Hive accesses data outside the warehouse directory.
 If all processing is performed by Hive, you are advised to use managed tables.
 If you want to use Hive and other tools to process the same data set, you are advised to use
external tables.

Managed Table External Table

CREATE/LOAD Data is moved to the repository directory. The data location is not moved.

The metadata and data are deleted


DROP Only the metadata is deleted.
together.

15 Huawei Confidential
Functions Supported by Hive
 Built-in Hive Functions
 Mathematical functions, such as round(), floor(), abs(), and rand().
 Date functions, such as to_date(), month(), and day().
 String functions, such as trim(), length(), and substr().
 User-Defined Function (UDF)

16 Huawei Confidential
Contents

1. Hive Overview

2. Hive Functions and Architecture

3. Basic Hive Operations

17 Huawei Confidential
Hive SQL Overview
 DDL-Data Definition Language:
 Creates tables, modifies tables, deletes tables, partitions, and data types.
 DML-Data Management Language:
 Imports and exports data.
 DQL-Data Query Language:
 Performs simple queries.
 Performs complex queries such as Group by, Order by and Join.

19 Huawei Confidential
DDL Operations
-- Create a table:
hive> CREATE TABLE pokes (foo INT, bar STRING);

hive> CREATE EXTERNAL TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

-- Browse the table:


hive> SHOW TABLES;

-- Describe a table:
hive> DESCRIBE invites;

-- Modify a table:
hive> ALTER TABLE events RENAME TO 3koobecaf;
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);

20 Huawei Confidential
DML Operations
-- Load data to a table:

hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

-- Export data to HDFS:

EXPORT TABLE invites TO '/department';

21 Huawei Confidential
DQL Operations (1)
--SELECTS and FILTERS:

hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

--GROUP BY:

hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

22 Huawei Confidential
DQL Operations (2)
--MULTITABLE INSERT:

FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200;

--JOIN:
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

--STREAMING:

hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING
'/bin/cat' WHERE a.ds > '2008-08-09';

23 Huawei Confidential
Summary

 This course introduces Hive application scenarios, basic principles, Hive


architecture, running process, and common Hive SQL statements.

24 Huawei Confidential
Quiz

1. Which of the following scenarios are applicable to Hive? ( )


A. Online real-time data analysis
B. Data mining (including user behavior analysis, region of interest, and regional display)
C. Data summary (daily/weekly user clicks and click ranking)
D. Non-real-time analysis (log analysis and statistical analysis)
2. Which of the following statements about basic Hive SQL operations is correct? ( )
A. You need to use the keyword "external" to create an external table and specify the keyword
"internal" to create a normal table.
B. The location information must be specified when an external table is created.
C. When data is loaded to Hive, the source data must be a path in HDFS.
D. Column separators can be specified when a table is created.

25 Huawei Confidential
Recommendations

 Huawei Cloud Official Web Link:


 https://www.huaweicloud.com/intl/en-us/
 Huawei MRS Documentation:
 https://www.huaweicloud.com/intl/en-us/product/mrs.html
 Huawei TALENT ONLINE:
 https://e.huawei.com/en/talent/#/

26 Huawei Confidential
Thank you. Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright© 2020 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

You might also like