HCIA-Big Data V3.5 Learning Guide
HCIA-Big Data V3.5 Learning Guide
ISSUE: V3.5
1
Copyright © Huawei Technologies Co., Ltd. 2022. All rights reserved.
No part of this document may be reproduced or transmitted in any form or by any
means without prior written consent of Huawei Technologies Co., Ltd.
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of
their respective holders.
Notice
The purchased products, services and features are stipulated by the contract made
between Huawei and the customer. All or part of the products, services and features
described in this document may not be within the purchase scope or the usage scope.
Unless otherwise specified in the contract, all statements, information, and
recommendations in this document are provided "AS IS" without warranties,
guarantees or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has
been made in the preparation of this document to ensure accuracy of the contents, but
all statements, information, and recommendations in this document do not constitute
a warranty of any kind, express or implied.
1
HCIA-Big Data V3.5 (For Trainees) Page 2
2
HCIA-Big Data V3.5 (For Trainees) Page 3
Contents
1 Big Data Development Trends and the Kunpeng Big Data Solution .................................. 8
1.1 Big Data Era ............................................................................................................................................................................... 8
1.1.1 Background ............................................................................................................................................................................. 8
1.1.2 What Is Big Data ................................................................................................................................................................... 8
1.1.3 Big Data Analysis vs. Traditional Data Analysis ........................................................................................................ 9
1.2 Big Data Application Fields.................................................................................................................................................10
1.2.1 Big Data Computing Tasks ..............................................................................................................................................12
1.2.2 Hadoop Big Data Ecosystem ..........................................................................................................................................12
1.3 Challenges and Opportunities Faced by Enterprises .................................................................................................14
1.3.1 Big Data Challenges ...........................................................................................................................................................14
1.3.2 Big Data Opportunities .....................................................................................................................................................15
1.4 Huawei Kunpeng Solution ..................................................................................................................................................16
1.4.1 Introduction to Kunpeng ..................................................................................................................................................16
1.4.2 Kunpeng Big Data Solution .............................................................................................................................................18
1.4.3 Huawei Cloud Big Data Services ...................................................................................................................................18
1.4.4 Huawei Cloud MRS ............................................................................................................................................................19
1.5 Quiz .............................................................................................................................................................................................21
2 HDFS and ZooKeeper...................................................................................................................... 22
2.1 HDFS: Distributed File System ...........................................................................................................................................22
2.1.1 HDFS Overview ....................................................................................................................................................................22
2.1.2 HDFS Concepts ....................................................................................................................................................................22
2.1.3 HDFS Key Features .............................................................................................................................................................23
2.1.4 HDFS File Read and Write ...............................................................................................................................................28
2.2 ZooKeeper: Distributed Coordination Service ..............................................................................................................30
2.2.1 ZooKeeper Overview ..........................................................................................................................................................30
2.2.2 ZooKeeper Architecture ....................................................................................................................................................30
2.3 Quiz .............................................................................................................................................................................................33
3 HBase and Hive ................................................................................................................................ 34
3.1 HBase: Distributed Database .............................................................................................................................................34
3.1.1 HBase Overview and Data Models ...............................................................................................................................34
3.1.2 HBase Architecture .............................................................................................................................................................38
3.1.3 HBase Performance Tuning ............................................................................................................................................42
3
HCIA-Big Data V3.5 (For Trainees) Page 4
4
HCIA-Big Data V3.5 (For Trainees) Page 5
5
HCIA-Big Data V3.5 (For Trainees) Page 6
6
HCIA-Big Data V3.5 (For Trainees) Page 7
7
HCIA-Big Data V3.5 (For Trainees) Page 8
8
HCIA-Big Data V3.5 (For Trainees) Page 9
The most obvious feature of big data is its huge data volume. Currently, the data
storage unit ranges from GB to TB, and even PB-scale. With the rapid development
of networks and information technologies, data continues to grow explosively.
⚫ Variety
The second characteristic of big data is the diversity of data types. This is mainly
reflected in diverse data sources and data formats. Currently, there are three types of
data: 1. Structured data, such as that of financial systems, information management
systems, and medical systems; 2. Unstructured data, such as video, images, and
audio; 3. Semi-structured data, such as HyperText Markup Language (HTML),
documents, emails, and web pages.
⚫ Velocity
The third characteristic is the speed of processing data analysis. In the big data era,
data is time-sensitive. As time goes by, data value decreases. To mine data value as
much as possible, algorithms help quickly process, analyze, and return data to users
to meet their real-time requirements.
⚫ Value
In contrast with traditional small data, the biggest value of big data comes from
mining valuable data for future trend and pattern prediction from a large amount of
irrelevant data of various types. This is similar to gold miners extracting gold from
massive amounts of sand.
9
HCIA-Big Data V3.5 (For Trainees) Page 10
10
HCIA-Big Data V3.5 (For Trainees) Page 11
teachers and students, and completion of exams and tests. In these cases, big data
allows the learning analysis system to continuously and accurately analyze the data
of each student's participation in teaching activities. Teachers can quickly diagnose
problems with more detailed information, such as the time and frequency of
readings, give suggestions for improvement, and predict students' academic
performance.
For examination evaluation, big data requires educators to update and transcend
traditional ideas. That is, they need to also focus on how students behave during the
teaching process. How long do students spend on each question in an exam? What's
the maximum time? What's the minimum time? What's the average time? For
questions that have been asked before, have the students answered the questions
correctly? What clues in a problem that have benefited the students? By monitoring
the information and providing students with personalized learning schemes and
learning methods through the self-adaptive learning system, personal learning data
archives can be generated to help educators understand the entire process of
learning for students to master the course content and implement custom teaching
in accordance with the students' aptitudes.
⚫ Government and public security
In the government and public security field, big data can be used to monitor
population flows and generate prompt warnings, so that administrative departments
can be notified of emergencies such as abnormal population flows. Cloud computing
and massive data can also be used to locate areas that are most vulnerable to
criminals and create a hotspot map of areas with high crime rates. When studying
the crime rate in a certain district, various factors in the adjacent district are taken
into consideration to provide support for the police departments to locate the high
crime areas and catch suspects.
⚫ Transportation planning
Traffic management centers can dynamically monitor the traffic status of roads and
hubs based on big data technologies to comprehensively obtain the information such
as the road conditions and passenger traffic in key hubs such as railway stations.
This provides data support for related departments for emergency response plans.
Furthermore, traffic management departments can use big data to analyze and
judge road safety situations and clear congested roads in a timely manner. Roads
where accidents frequently occur and potential security risks exist are strictly
managed and controlled. Meanwhile, warnings and related road traffic control
information are released promptly during bad weather to ensure traffic safety.
⚫ Clean energy
In the nine-day period from June 20 to 28, 2018, Qinghai province used clean energy
such as water, wind, and light to provide power supply for 216 hours, achieving zero
emission of electricity and promoting the new practice of full-clean energy power
supply with the goal of maximizing the consumption of new energy. Adjusting the
peak load caused by the high proportion of new energy is the basis and key of "9-
Day Green Power Supply" activity that uses the multi-energy complementary
coordinated control technologies and a big data platform to improve new energy
management. A compensation mechanism and load participation mechanism are
introduced to the peak adjustment to expand the space for PV absorption.
11
HCIA-Big Data V3.5 (For Trainees) Page 12
12
HCIA-Big Data V3.5 (For Trainees) Page 13
different machines but you only cite a file path. As a user, you do not need to know
which track and which sector the file is distributed to. HDFS manages the data for you.
After the data is saved, we need to process and analyze it. In this case,
MapReduce/Tez/Spark /Flink is required. MapReduce is the first-generation compute
engine, while Tez and Spark are second-generation compute engines. Flink is mainly used
for real-time computing. Developers find it troublesome to develop program code for
these compute engines. To simplify development and improve efficiency, a higher-level
abstract language layer is designed to describe algorithms and data processing processes.
For this, Pig and Hive are available. Pig describes MapReduce in a similar way to scripts,
while Hive uses SQLs. Both Pig and Hive convert the scripts and SQLs into MapReduce
programs and then the compute engines process data. The language of Hive is SQL-like.
Therefore, Hive is easy to use and becomes the core component of the big data
warehouse.
In Hadoop 1.0, only MapReduce is used as the compute engine responsible for resource
and job scheduling. In later versions, with the joining of other compute engines, job
scheduling conflicts because a system can only schedule the resources and jobs by itself.
To settle the conflict, Yarn and Oozie join the compute engine family. Yarn is responsible
for system resource scheduling and management, and Oozie is responsible for compute
job flow scheduling.
HDFS is the default persistent storage layer. HBase is a column-oriented distributed
database and applies to structured storage. However, the bottom layer of HBase still
depends on HDFS for physical storage, which is similar to Hive. However, Hive is suitable
for analyzing and querying data in a period of time, and HBase is suitable for querying
big data in real time. In addition, ZooKeeper is required for HBase deployment.
ZooKeeper is a distributed coordination service, including the configuration service,
metadata maintenance service, and NameSpace service.
Sqoop and Flume are developed to make up the gap where traditional data collection
tools cannot fetch mass data. Sqoop is an open-source tool used to transfer data
between Hadoop (Hive) and traditional databases (MySQL and PostgreSQL). You can
import data from a relational database to HDFS of Hadoop or from HDFS to a relational
database. Flume is a highly available, reliable, and distributed system provided by
Cloudera for collecting, aggregating, and transmitting mass logs. Flume supports
customization of various data senders in the log system to collect data.
In addition, there are some special systems and components. For example, Mahout is a
distributed machine learning library, and Ambari is a distributed architecture software
used to create, manage, and monitor Hadoop clusters.
In short, you can think of the big data ecosystem as a kitchen ecosystem. In order to cook
different dishes, you need a variety of cookware. At the same time, customers'
requirements are also keeping up with the times. Therefore, your tools need to be
upgraded. In addition, no universal tool can handle all situations. As a result, the
ecosystem will become larger and larger.
13
HCIA-Big Data V3.5 (For Trainees) Page 14
14
HCIA-Big Data V3.5 (For Trainees) Page 15
security of data storage, and therefore poses higher requirements on multiple data
replicas and disaster recovery mechanisms.
⚫ Lack of big data talents
Each operation of component construction and maintenance in big data needs
professional personnel. Therefore, it is required to build and cultivate a professional
team that is experienced in big data management and applications. Every year,
hundreds of thousands of big data-related jobs are created around the world. In the
future, there will be a talent gap of more than 1 million workers. Therefore,
universities and enterprises must work together to cultivate and explore talents.
⚫ Trade-off between data openness and privacy
As big data applications become increasingly important, data resource openness and
sharing have become the key to maintaining advantages in the data war. However,
data openness will inevitably infringe on some users' privacy. How to effectively
protect citizens' and enterprises' privacy while promoting data openness, application,
and sharing and gradually strengthen privacy legislation will be a major challenge in
the big data era.
15
HCIA-Big Data V3.5 (For Trainees) Page 16
As the value of big data gains wider recognition among industry users, the market
demand will surge, and new technologies, products, services, and business forms
oriented to the big data market will emerge continuously. Big data will create a
high-growth market for the information industry. In the hardware and integrated
device field, big data will face challenges such as effective storage, fast read/write,
and real-time analysis, which will have a significant impact on the chip and storage
industries. In addition, integrated data storage and processing servers and in-
memory compute markets are emerging. In the software and service field, the huge
value of big data brings urgent requirements for quick data processing and analysis,
which will lead to unprecedented prosperity in the data mining and business
intelligence markets.
16
HCIA-Big Data V3.5 (For Trainees) Page 17
investment and continuous innovation in the chip field, Huawei has built a differentiated
competitive chip system covering computing, storage, transmission, management, and AI.
Therefore, Huawei is capable of providing a computing foundation that supports the
sustainable development of the Kunpeng computing industry. With the development of
the Kunpeng computing industry in the Chinese market and the joining of more
enterprises in China and abroad, the Kunpeng computing industry will eventually become
a computing industry with continuous innovation capabilities and global leadership.
The Kunpeng computing industry is a collection of full-stack IT infrastructure, industry
applications, and services powered by Huawei Kunpeng processors. This industry includes
personal computers (PCs), servers, storage devices, OSs, middleware, virtualization,
databases, cloud services, industry applications, and consulting and management services.
With the powerful compute power provided by Kunpeng processors, Kunpeng computing
will play an important role in the digital transformation of various industries.
In terms of ecosystem implementation, Huawei Cloud has worked with industry-leading
independent software vendors (ISVs) and System Integrators (SIs) to create many success
stories in industries such as finance, energy, government and enterprise, transportation,
and retail. At the beginning of 2019, Huawei Cloud worked with ChinaSoft International
to smoothly migrate nine service systems to the cloud for the largest dairy product
supplier in China in only four hours. As cloud migration is the key service of the
enterprise, Huawei Cloud helped it migrate up to 68 hosts and 14 databases to the cloud.
In March 2019, Huawei Cloud and its partner Jingying Shuzi and China Coal Research
Institute (CCRI), jointly developed the Mine Brain solution, an industrial intelligent twins
solution for the coal industry.
The full openness to developers accelerates the implementation of Kunpeng industry
cases. Huawei Cloud DevCloud has provided services for 300,000+ developers and has
been deployed in 30+ city campuses with developed software industries in China. In the
future, the Kunpeng industry will continue to strengthen cooperation and construction in
ecosystem fields, including technology ecosystem, developer ecosystem, community
construction, cooperation with universities, industry ecosystem, and partner ecosystem, to
continuously enhance full ecosystem vitality.
In terms of computing capabilities, the TaiShan server based on the Huawei Kunpeng
processor fully demonstrates the advantages of efficient, secure, and reliable computing.
In addition, the TaiShan server is an open platform and supports mainstream software
and hardware in the industry. TaiShan 100 is a first-generation server based on the
Kunpeng 916 processor and was launched in 2016. In 2019, Huawei launched the
TaiShan 200 server based on the latest Kunpeng 920 processor, which is the main
product in the market.
The Kunpeng product system includes: mainstream OSs such as CentOS, Ubuntu,
NeoKylin, Debian, Huawei Euler OS, Kylin, SUSE, and Deepin; niche OSs such as Chinese
OSs (Hunan Kylin, Linx, YMOS, Taishan Guoxin, BCLinux, and NeoShine); non-Chinese
OSs, such as Red Hat, a mainstream product outside China. It is temporarily removed
from the market because it is subject to Export Administration Regulations (EAR). The
Kunpeng product system is compatible with Red Hat. Currently, the Kunpeng processor
supports only the Linux operating system.
In terms of cloud service applications, Huawei Cloud provides 69 Kunpeng cloud services
(such as Kunpeng ECS, Kunpeng BMS, Kunpeng CCE, and Kunpeng CCI) and more than
17
HCIA-Big Data V3.5 (For Trainees) Page 18
20 solutions (such as Kunpeng DeC, Kunpeng HPC, Kunpeng big data, Kunpeng enterprise
applications, and Kunpeng native applications) for governments, finance, enterprises,
Internet, and other industries. Huawei Cloud Kunpeng cloud services and solutions have a
full-stack ecosystem. Based on the Kunpeng community, Huawei Kunpeng Solution
provides support for building mainstream components that cover multiple service
scenarios and provides a platform for technical communication and discussion; completes
the adaptation and compatibility certification of multiple open-source and operating
systems, databases, middleware, and other system software in China; works with industry
partners to develop industry-oriented Kunpeng solutions to serve end users.
18
HCIA-Big Data V3.5 (For Trainees) Page 19
The data access layer provides data collection capabilities delivered by the components,
including Flume (data ingestion), Loader (relational data import), and Kafka (highly
reliable message queue). Data from various data sources can be collected. In the data
analysis phase, you can select the Hive (data warehouse), SparkSQL, or Presto (both are
interactive query engines) to analyze SQL-like data. MRS also provides multiple
mainstream compute engines, including MapReduce (batch processing), Tez (DAG
model), Spark (in-memory computing), SparkStreaming (micro-batch stream computing),
Storm (stream computing), and Flink (stream computing) for various big data application
scenarios. Both HDFS (a universal distributed file system) and OBS (featuring high
availability and low cost) are used for underlying data storage.
MRS can connect to DAYU to provide one-stop data asset management, development,
exploration, and sharing capabilities based on the enterprise data lake, helping users
quickly build big data processing centers and implement data governance and
development scheduling. This enables quick monetization of data.
Huawei Cloud big data services have the following advantages: 1. 100% compatibility
with open-source ecosystems, plug-in management of third-party components, and a
one-stop enterprise platform; 2. Decoupled storage and compute resources allowing for
flexible configuration of storage and compute resources; 3. Application of Huawei-
developed Kunpeng server. Thanks to the multi-core performance advantages of
Kunpeng processors and algorithm optimization of task scheduling on Huawei Cloud, the
CPU has higher concurrency capability, which provides higher compute power for big
data computing.
19
HCIA-Big Data V3.5 (For Trainees) Page 20
⚫ High performance
MRS supoorts the in-house CarbonData, a high-performance big data storage
solution. It allows one data set to apply to multiple scenarios and supports features
such as multi-level indexing, dictionary encoding, pre-aggregation, dynamic
partitioning, and quasi-real-time data query. This improves I/O scanning and
computing performance and returns analysis results of tens of billions of data
records in seconds. In addition, MRS supports the self-developed enhanced scheduler
Superior, which breaks the scale bottleneck of a single cluster and is capable of
scheduling over 10,000 nodes in a cluster.
⚫ Low cost
Based on diversified cloud infrastructure, MRS provides various computing and
storage choices and supports storage-compute decoupling, delivering cost-effective
mass data storage solutions. MRS supports auto scaling to address peak and off-
peak service loads, releasing idle resources on the big data platform for customers.
MRS clusters can be created and scaled out when you need them, and can be
terminated or scaled in after you use them, minimizing cost.
⚫ High security
With Kerberos authentication, MRS provides role-based access control (RBAC) and
sound audit functions. MRS is a one-stop big data platform that allows different
physical isolation modes to be set up for customers in the public resource area and
dedicated resource area of Huawei Cloud as well as Huawei Cloud Stack Online in
the customer's equipment room. A cluster supports multiple logical tenants.
Permission isolation enables the computing, storage, and table resources of the
cluster to be divided based on tenants.
⚫ Easy O&M
MRS provides a visualized big data cluster management platform, improving O&M
efficiency. MRS supports rolling patch upgrade and provides visualized patch release
information and one-click patch installation without manual intervention, ensuring
long-term stability of user clusters.
⚫ High reliability
MRS delivers high availability (HA) and real-time SMS and email notification on all
nodes.
Big data is ubiquitous in people's lives. Huawei Cloud MRS is suitable for processing
big data in the sectors such as the Internet of things (IoT), e-commerce, finance,
manufacturing, healthcare, energy, and government affairs. Typical big data
application scenarios are as follows:
⚫ Large-scale data analysis
Large-scale data analysis is a major scenario in modern big data systems. Generally,
an enterprise has multiple data sources. After accessing the data sources, the
enterprise needs to perform ETL processing on data to form model-based data for
each service module to analyze and sort out. This type of service has the following
characteristics: 1. The service does not have high requirements on real-time
execution, and the job execution time ranges from dozens of minutes to hours; 2.
The data volume is huge; 3. Various data sources and formats exist; 4. Data
processing usually consists of multiple tasks, and resources need to be planned in
20
HCIA-Big Data V3.5 (For Trainees) Page 21
detail. In the environmental protection industry, climate data is stored on OBS and
periodically dumped into HDFS for batch analysis. 10 TB of climate data can be
analyzed in 1 hour. In this scenario, MRS has the following advantages: 1. Low cost:
OBS is used to implement low-cost storage; 2. Analysis of massive sets of data:
TB/PB-level data is analyzed by Hive; 3. Visualized data import and export tool:
Loader exports data to DWS for business intelligence (BI) analysis.
⚫ Large-scale data storage
A user who has a large amount of structured data usually requires index-based
quasi-real-time query. For example, in an Internet of Vehicles (IoV) scenario, vehicle
maintenance information is queried by plate numbers. Therefore, vehicle information
is indexed by the plate number when stored, to implement second-level response in
this scenario. Generally, the data volume is large, and the user may store data for
one to three years.
For example, in the IoV industry, an automobile company stores data on HBase,
which supports PB-level storage and CDR queries in milliseconds. In this scenario,
MRS has the following advantages: 1. Real-time: Kafka implements real-time access
of messages from a large number of vehicles. 2. Storage of massive sets of data:
HBase stores a large volume of data and implements millisecond-level data query. 3.
Distributed data query: Spark analyzes and queries a large volume of data.
⚫ Real-time data processing
Real-time data processing is usually used in scenarios such as anomaly detection,
fraud detection, rule-based alarming, and service process monitoring. Data is
processed while being input to the system. For example, in the Internet of elevators
& escalators (IoEE) industry, data of smart elevators and escalators is imported to
MRS streaming clusters in real time for real-time alarming. In this scenario, MRS has
the following advantages: 1. Real-time data collection: Flume collects data in real
time and provide various collection and storage connection modes. 2. Data source
access: Kafka accesses data of tens of thousands of elevators and escalators in real
time.
1.5 Quiz
1. What challenges do we face in the big data era?
21
HCIA-Big Data V3.5 (For Trainees) Page 22
22
HCIA-Big Data V3.5 (For Trainees) Page 23
⚫ Client: used for users to access the entire file system by interacting with NameNode
and DataNode. HDFS opens file namespaces and allows user data to be stored as
files. Users communicate with the HDFS through the client.
⚫ Data block: the minimum unit data read or write on a disk. Files are stored on disks
as blocks. The file system can process data blocks whose size is an integer multiple
of the data block size on a disk each time. Files in the HDFS are also divided into
multiple logical blocks for storage. In versions later than Hadoop 2.0, the default
data block size is 128 MB. As a distributed file system, HDFS has the following
advantages when using data blocks for storage:
o Large-scale file storage: Files are stored in blocks. A large-scale file can be split
into multiple file blocks, and different file blocks can be distributed to different
nodes. Therefore, the size of a file is not limited by the storage capacity of a
single node. The capacity can be much larger than the storage capacity of any
node on the network.
o Simplified system design: First, HDFS greatly simplifies storage management
because the file block size is fixed. In this way, it is easy to calculate the number
of file blocks that can be stored on a node. Second, HDFS facilitates metadata
management. Metadata does not need to be stored together with file blocks. It
can be managed by other systems.
o Suitability for data backup: Each file block can be redundantly stored on multiple
nodes, greatly improving the fault tolerance and availability of the system.
23
HCIA-Big Data V3.5 (For Trainees) Page 24
24
HCIA-Big Data V3.5 (For Trainees) Page 25
25
HCIA-Big Data V3.5 (For Trainees) Page 26
26
HCIA-Big Data V3.5 (For Trainees) Page 27
HDFS designs the data balancing mechanism, which ensures that data is evenly
distributed on each DataNode.
⚫ Metadata reliability
Metadata can be operated using the log mechanism, and metadata is stored on the
active and standby NameNodes.
The snapshot mechanism implements the common snapshot mechanism of file
systems, ensuring that data can be restored promptly in the case of misoperations.
⚫ Security mode
When a hard disk of a node is faulty, the node enters the security mode. In this
mode, HDFS supports only access to metadata. In this case, data on HDFS is read-
only. Other operations, such as creating and deleting files, will fail. After the disk
fault is rectified and data is restored, the node exits the security mode.
27
HCIA-Big Data V3.5 (For Trainees) Page 28
to the active state and provides services for external systems. In addition, the
JournalNodes allows only one active NameNode to be written.
⚫ Adding disk balancers to DataNodes
A data balancer between different disks can be added on a single DataNode. The
Hadoop of earlier versions supports balancers only between DataNodes. If data
imbalance occurs between different disks on each node, there is no good method to
handle the problem. You can run the hdfs diskbalancer command to balance data
among disks on a node. This function is disabled by default. You need to manually
set dfs.disk.balancer.enabled to true to enable it.
28
HCIA-Big Data V3.5 (For Trainees) Page 29
29
HCIA-Big Data V3.5 (For Trainees) Page 30
30
HCIA-Big Data V3.5 (For Trainees) Page 31
For a service with n instances, n may be an odd or even number. Assume that the DR
capability is x. If the value is an odd number and n is 2x + 1, x + 1 votes are required for
a node to become the leader.
If the value is an even number and n is 2x + 2, x + 2 votes are required for a node to
become the leader.
31
HCIA-Big Data V3.5 (For Trainees) Page 32
32
HCIA-Big Data V3.5 (For Trainees) Page 33
2.3 Quiz
1. Why is the size of an HDFS data block larger than that of a disk block?
2. Can HDFS data be read when it is written?
33
HCIA-Big Data V3.5 (For Trainees) Page 34
34
HCIA-Big Data V3.5 (For Trainees) Page 35
⚫ Data maintenance: In the relational databases, the most recent value replaces the
original value in the record. The original value no longer exists after being
overwritten. When an update is performed in HBase, a new version is generated, and
the original one is retained.
⚫ Scalability: It is difficult to horizontally expand relational databases, and the space
for vertical expansion is limited. In contrast, distributed databases, such as HBase and
BigTable, are developed to implement flexible horizontal expansion. Their
performance can easily be scaled by adding or reducing the hardware in a cluster.
35
HCIA-Big Data V3.5 (For Trainees) Page 36
Figure 3-1
36
HCIA-Big Data V3.5 (For Trainees) Page 37
Column qualifier: A column qualifier is added to a column family to provide an index for
a given data segment. About the content of the column family, the column qualifier
might be content:html, and the other might be content:pdf. Although the column family
is fixed when a table is created, the column qualifiers are variable and may vary greatly
between rows.
Cell: A cell is a combination of row, column family, and column qualifiers. It contains
values and timestamps which indicate the version of a value.
Timestamp: A timestamp is written with each value and is the identifier of the value of a
given version. By default, the timestamp indicates the time on RegionServer when data is
written. You can specify different timestamp values when placing data in cells.
Compared with traditional data tables, HBase data tables have the following
characteristics:
⚫ Each table has rows and columns. All columns belong to a column family.
⚫ The intersection of a row and a column is called cell, and the cell is versioned. The
content of a cell is an inseparable byte array.
⚫ The row key of a table is also a byte array, so any data can be saved, whether it is a
string or a number.
⚫ HBase tables are sorted by key, and the sorting mode is based on byte.
⚫ All tables must have primary keys.
"com.cnn.www" T9 anchor:cnnsi.com="CNN"
37
HCIA-Big Data V3.5 (For Trainees) Page 38
"com.cnn.www" T8 anchor:my.look.ca="CNN.com"
"com.cnn.www" T6 contents:html="<html>..."
"com.cnn.www" T5 contents:html="<html>..."
"com.cnn.www" T3 contents:html="<html>..."
Empty cells shown in the HBase conceptual view are not stored at all. Therefore, no value
is returned for the requests whose timestamp is t8 in the contents:html column. Similarly,
no value is returned for a request with the anchor:my.look.ca value in timestamp t9.
However, if no timestamp is provided, the latest value of a specific column is returned.
Given multiple versions, the most recent one is also the first one because the timestamps
are stored in descending order. Therefore, if no timestamp is specified, the request for the
values of all columns in row com.cnn.www will be contents:html in timestamp t6,
anchor:cnnsi.com in timestamp t9, and anchor:my.look.ca in timestamp t8.
38
HCIA-Big Data V3.5 (For Trainees) Page 39
39
HCIA-Big Data V3.5 (For Trainees) Page 40
⚫ Obtain the location of the ROOT table from the /hbase/rs file in ZooKeeper. The
ROOT table has only one region.
⚫ Search for the corresponding HRegion in the first table of the META table using the
ROOT table. Actually, the ROOT table is the first region of the META table. Each
Region in the META table has a record in the ROOT table.
⚫ Use the META table to locate the HRegion of a user table. Each HRegion in the user
table is a record in the META table. The ROOT table will never be divided into
multiple HRegions, ensuring that any region can be located after a maximum of
three hops. The client caches the queried location information and the cache will not
proactively become invalid.
However, if all caches on the client become invalid, you need to perform the network
roundtrip six times to locate the correct HRegion: three times to detect the cache failure,
and another three times to obtain the HRegion location.
Store: Each HRegion consists of at least one Store. HBase stores the data to be accessed
together in a Store. That is, HBase creates a Store for each column family. The number of
Stores is the same as that of column families. A Store consists of a MemStore and zero or
multiple StoreFiles. HBase determines whether to split an HRegion based on the Store
size.
MemStore: is stored in the memory and stores the modified data, that is, keyValues.
When the size of MemStore reaches a threshold (64 MB by default), MemStore is flushed
to a file, that is, a snapshot is generated. Currently, HBase has a thread to perform the
flush operation on MemStore.
StoreFile: After data in the MemStore memory is written to a file, the file becomes a
StoreFile. The data at the bottom layer of the StoreFile is stored in the HFile format.
HFile: is a binary storage format for KeyValue data in HBase. The length of an HFile is
not fixed, except for Trailer and FileInfo. In Trailer, pointers point to the start points of
other data blocks. FileInfo records meta information about files. A data block is the basic
unit of HBase I/O. To improve efficiency, HRegionServer provides the LRU-based block
cache mechanism. The size of each data block can be specified by a parameter when a
table is created. The default block size is 64 KB. Large blocks facilitate sequential scan,
and small blocks facilitate random query. In addition to the Magic at the beginning of
each data block, each data block is composed of KeyValue pairs. The Magic content is
random numbers to prevent data damage.
HLog (WAL log): is the write ahead log, which is used for disaster recovery (DR). HLog
records all data changes. Once an HRegionServer breaks down, data can be recovered
from the logs.
LogFlusher: periodically writes information in the cache to log files. LogRoller: manages
and maintains log files.
40
HCIA-Big Data V3.5 (For Trainees) Page 41
41
HCIA-Big Data V3.5 (For Trainees) Page 42
⚫ HMaster first processes the remaining HLog files on the faulty RegionServer. The
remaining HLog files contain log records from multiple regions.
⚫ The system splits the HLog data based on the region to which each log belongs and
saves the split data to the directory of the corresponding region. Then, the system
allocates the invalid region to an available RegionServer and sends the HLog logs
related to the region to the corresponding RegionServer.
⚫ After obtaining the allocated region object and related HLogs, the RegionServer
performs operations in the logs again, writes the data in the logs to the MemStore
cache, and then updates the data to the StoreFile file on the disk for data
restoration.
⚫ Advantages of shared logs: The performance of writing data to tables is improved.
Disadvantage: Logs need to be split during restoration.
3.1.3.2 Compaction
HBase architecture is in Log-Structured Merge Tree mode. User data is written to WAL
and then to the cache. When certain conditions are met, the cached data is flushed to
disks and a data file HFile is generated. As more and more data is written, the number of
flush times increases. As a result, the HFile data files increase. However, too many data
files increase the I/O times due to data query. Therefore, HBase attempts to merge these
files. This process is called compaction.
Some Hfile files from a Store of a region are selected for compaction. The compaction
principle is simple. KeyValues are read from the data files to be merged, and then written
to a new file after being sorted in ascending order. After that, the new file replaces all
the files to be merged to provide services. Based on the compaction scale, HBase
compactions are classified into minor compactions and major compactions.
Minor compaction is when small and adjacent StoreFiles are combined into a larger
StoreFile. In this process, the cells that have been deleted or expired are not processed. A
minor compaction results in fewer and larger StoreFiles.
Major compaction is when all StoreFiles are combined into one StoreFile. In this process,
three types of meaningless data are deleted: deleted data, Time To Live (TTL) expired
data, and data whose version number is higher than the specified one. In addition, major
compaction takes a long time and consumes a large number of system resources, greatly
affecting upper-layer services. Therefore, the function of automatically triggering major
compaction is disabled for online services, and major compaction is manually triggered
during off-peak hours.
3.1.3.3 OpenScanner
Before locating the corresponding RegionServer and Region of RowKeys, a Scanner needs
to be opened to search for data. Because a Region contains MemStores and HFiles, a
42
HCIA-Big Data V3.5 (For Trainees) Page 43
Scanner needs to be opened for each of them to read data in MemStores and HFiles
respectively. The scanner corresponding to HFile is StoreFileScanner. The scanner
corresponding to MemStore is MemStoreScanner.
3.1.3.4 BloomFilter
Bloom Filter was proposed by Bloom in 1970. It is actually a long binary vector and a
series of random mapping functions. Bloom Filter can be used to retrieve whether an
element is in a set. Its advantage is that the space efficiency and query time are much
longer than those of common algorithms. Its disadvantage is that it has a certain
misidentification rate and makes data difficult to delete. In HBase, row keys are stored in
HFiles. To query a row key from a series of HFiles, you can use Bloom Filter to quickly
determine whether the row key is in the HFiles. In this way, most HFiles are filtered out,
reducing the number of blocks to be scanned. Bloom Filter is critical to the random read
performance of HBase. For the GET operation and some SCAN operations, HFiles that are
not used can be deleted to reduce the actual I/O times and improve the random read
performance.
43
HCIA-Big Data V3.5 (For Trainees) Page 44
put: adds data to a specified cell in a table, row, or column. scan: browses information
about a table.
get: obtains the value of a cell based on the table name, row, column, timestamp, time
range, and version number.
drop: deletes a table.
44
HCIA-Big Data V3.5 (For Trainees) Page 45
Metadata storage is
independent of data Low flexibility. Data can be
Flexibility
storage, decoupling used for limited purposes.
metadata and data.
45
HCIA-Big Data V3.5 (For Trainees) Page 46
Traditional Data
Feature Hive
Warehouse
Expensive in commercial
Price Open-source product
use
46
HCIA-Big Data V3.5 (For Trainees) Page 47
47
HCIA-Big Data V3.5 (For Trainees) Page 48
Create a database:
CREATEDATABASE|SCHEMA[IFNOTEXISTS]<databasename>
DROP(DATABASE|SCHEMA)[IFEXISTS]database_name[RESTRICT|CASCADE]
LOADDATA[LOCAL]INPATH'filepath' [OVERWRITE]INTOTABLE
tablename[PARTITION(partcol1=val1,partcol2=val2...)]
Data query language (DQL) is used to perform simple queries, complex queries, GROUP
BY, ORDER BY, JOIN, and more.
Basic SELECT statement:
SELECT[ALL|DISTINCT]select_expr,select_expr,...FROMtable_reference
[WHEREwhere_condition]
[GROUPBYcol_list[HAVINGcondition]][CLUSTERBYcol_list
|[DISTRIBUTEBYcol_list][SORTBY|ORDERBYcol_list]
]
[LIMITnumber]
3.3 Quiz
1. What are the similarities and differences between Hive and RDBMS?
2. Why is it not recommended to use too many column families in HBase?
48
HCIA-Big Data V3.5 (For Trainees) Page 49
4.1 Overview
4.1.1 Introduction
ClickHouse is an OLAP colum-oriented database management system. It is independent
of the Hadoop big data system and features ultimate compression rate and fast query. It
supports SQL query and delivers superior query performance, especially the aggregation
analysis and query performance based on large and wide tables. Its query speed is one
order of magnitude faster than that of other analytical databases.
ClickHouse is an OLAP colum-oriented database management system. It is independent
of the Hadoop big data system and features ultimate compression rate and fast query. It
supports SQL query and delivers superior query performance, especially the aggregation
analysis and query performance based on large and wide tables. Its query speed is one
order of magnitude faster than that of other analytical databases.
4.1.2 Advantages
The core functions of ClickHouse are as follows:
⚫ Comprehensive DBMS functions
ClickHouse has comprehensive database management functions, including the basic
functions of a DBMS:
Data Definition Language (DDL): allows databases, tables, and views to be
dynamically created, modified, or deleted without restarting services.
Data Manipulation Language (DML): allows data to be queried, inserted, modified,
or deleted dynamically.
Permission control: supports user-based database or table operation permission
settings to ensure data security.
Data backup and restoration: supports data backup, export, import, and restoration,
meeting the requirements of the production environment.
Distributed management: provides the cluster mode to automatically manage
multiple database nodes.
⚫ Column-oriented storage and data compression
49
HCIA-Big Data V3.5 (For Trainees) Page 50
50
HCIA-Big Data V3.5 (For Trainees) Page 51
51
HCIA-Big Data V3.5 (For Trainees) Page 52
Uniqueness: Hash message digests are calculated based on indicators such as data
sequence, row, and size to prevent blocks from being repeatedly written due to
exceptions.
Copies increase data storage redundancy and reduce data loss risks. The multi-host
architecture enables each copy to act as the entry for data read and write, sharing the
load of nodes.
52
HCIA-Big Data V3.5 (For Trainees) Page 53
53
HCIA-Big Data V3.5 (For Trainees) Page 54
4.4 Quiz
1. What table engines does ClickHouse offer?
54
HCIA-Big Data V3.5 (For Trainees) Page 55
5.1 Overview
5.1.1 Introduction to MapReduce
Hadoop has experienced three major versions since its release: 1.0, 2.0, and 3.0. The most
typical versions are 1.0 and 2.0. Compared with 2.0, 3.0 does not change much. The
architecture of Hadoop 1.0 is simple, including only HDFS and MapReduce. The lower
layer is the open source Hadoop distributed file system (HDFS) implemented by Google
File System (GFS), while the upper layer is the distributed computing framework
MapReduce. MapReduce of Hadoop 1.0 does not separate resource management from
job scheduling. As a result, when multiple jobs are submitted concurrently, the resource
scheduler is overloaded, resulting in low resource utilization. In addition, Hadoop 1.0 does
not support heterogeneous computing frameworks. In Hadoop 2.0, YARN, the resource
management and scheduling system, is introduced to replace the original computing
framework, implementing proper resource scheduling. YARN is compatible with multiple
computing frameworks.
MapReduce is designed and developed based on the MapReduce paper released by
Google. MapReduce is used for parallel computing and offline computing of large-scale
data sets (larger than 1 TB). MapReduce can be understood as a process of summarizing
a large amount of disordered data based on certain features (Map) and processing the
data to obtain the final result (Reduce).
A major advantage of MapReduce is that it has a highly abstract programming idea. To
develop a distributed program, developers only need to use some simple APIs without
considering other details, such as data shards and data transmission. Developers only
need to focus on the program's logic implementation. Another advantage is scalability.
When the data volume reaches a certain level, the existing cluster cannot meet the
computing and storage requirements. In this case, you can add nodes to scale out a
cluster.
MapReduce also features high fault tolerance. In a distributed environment, especially as
the cluster scale increases, the failure rate of the cluster also increases, which may cause
task failures and data loss. Hadoop uses computing or data migration policies to improve
cluster availability and fault tolerance.
However, MapReduce also has its own limitations. (1) MapReduce can only process tasks
that can be divided into multiple independent subtasks. (2) All data on which MapReduce
depends comes from files. The entire computing process involves a large number of file
55
HCIA-Big Data V3.5 (For Trainees) Page 56
I/Os, which inevitably affects the computing speed. So, it is not friendly to process real-
time data.
56
HCIA-Big Data V3.5 (For Trainees) Page 57
57
HCIA-Big Data V3.5 (For Trainees) Page 58
58
HCIA-Big Data V3.5 (For Trainees) Page 59
Step 1 File A is stored on HDFS and divided into blocks A.1, A.2, and A.3 that are stored
on DataNodes #1, #2, and #3.
Step 2 The WordCount analysis and processing program provides user-defined Map and
Reduce functions. WordCount submits analysis applications to ResourceManager.
Then ResourceManager creates jobs based on the request and create three Map
tasks as well as three Reduce tasks that are running in a container.
Step 3 Map tasks 1, 2, and 3 output an MOF file that is partitioned and sorted but not
combined. For details, see the table.
Step 4 Reduce tasks obtain the MOF file from Map tasks. After combination, sorting, and
user-defined Reduce logic processing, statistics shown in the table is output.
59
HCIA-Big Data V3.5 (For Trainees) Page 60
60
HCIA-Big Data V3.5 (For Trainees) Page 61
Step 2 ResourceManager allocates the first container to the applications and asks
NodeManager to start ApplicationMaster in the container.
Step 3 ApplicationMaster registers with ResourceManager so users can directly view the
operating status of the applications. Then ResourceManager applies for resources
for each task and monitors the tasks until the task running ends (that is, repeat
steps 4 to 7).
Step 5 Once obtaining resources, ApplicationMaster asks NodeManager to start the tasks.
Step 6 After setting an operating environment (including environment variables, JAR file,
and binary programs) for the tasks, NodeManager writes task startup commands
into a script and runs this script to start the tasks.
Step 7 Each task uses RPC to report status and progress to ApplicationMaster so that
ApplicationMaster can restart a task if the task fails. During application running,
you can use RPC to obtain their statuses from ApplicationMaster at any time.
61
HCIA-Big Data V3.5 (For Trainees) Page 62
62
HCIA-Big Data V3.5 (For Trainees) Page 63
63
HCIA-Big Data V3.5 (For Trainees) Page 64
way, administrators can automatically configure and manage the queue in the
administer_queue ACL of the queue.
The capacity scheduler has the following features:
⚫ Capacity assurance: Administrators can set upper and lower limits for the resource
usage of each queue. All applications submitted to the queue share the resources.
⚫ Flexibility: The remaining resources of a queue can be used by other queues that
require resources. If a new application is submitted to the queue, the resources
released by other queues will be returned to the queue.
⚫ Priority scheduling: Queues support task priority scheduling (FIFO by default).
⚫ Multi-leasing: A cluster can be shared by multiple users or applications.
Administrators can add multiple restrictions to prevent cluster resources from being
exclusively occupied by an application, user, or queue.
⚫ Dynamic update of configuration files: Administrators can dynamically modify
configuration parameters to manage clusters online.
64
HCIA-Big Data V3.5 (For Trainees) Page 65
65
HCIA-Big Data V3.5 (For Trainees) Page 66
applications that have demanding memory requirements may run on servers with
standard performance. This results in low computing efficiency. Through label-based
scheduling, tasks that consume a large amount of memory are submitted to queues
bound to high-memory labels. In this way, tasks can run on high-memory machines,
improving the cluster running efficiency.
5.5 Quiz
1. If no partitioner is defined, how is data partitioned before being sent to the reducer?
2. What are the differences between combine and merge?
66
HCIA-Big Data V3.5 (For Trainees) Page 67
6
Spark — In-memory Distributed
Computing Engine & Flink — Stream and Batch
Processing in a Single Engine
67
HCIA-Big Data V3.5 (For Trainees) Page 68
6.1.1.3 Highlights
The highlights of Spark are as follows:
⚫ Lightweight: Spark only has 30,000 lines of core code. It also supports the easy-to-
read and rich Scala language.
⚫ Fast: Spark can respond to small data set queries in just subseconds. Spark is faster
than MapReduce, Hive or Pregel for iterative machine learning of large dataset
applications, such as ad-hoc query and graph programming. Spark features in-
memory computing, data locality, transmission optimization, and scheduling
optimization.
⚫ Flexible: Spark offers flexibility at different levels. Spark uses the Scala trait dynamic
mixing policy (such as replaceable cluster scheduler and serialized library). Spark
allows users to extend new data operators, data sources, and language bindings.
Spark supports a variety of paradigms such as in-memory computing, multi-iteration
batch processing, ad-hoc query, streaming processing, and graph programming.
⚫ Smart: Spark seamlessly combines with Hadoop and is compatible with the Hadoop
ecosystem. Graph computing uses Pregel and PowerGraph APIs as well as the point
division idea of PowerGraph.
68
HCIA-Big Data V3.5 (For Trainees) Page 69
69
HCIA-Big Data V3.5 (For Trainees) Page 70
70
HCIA-Big Data V3.5 (For Trainees) Page 71
Wide dependency: In extreme cases, all parent RDD partitions need to be recalculated.
It is similar to
groupBykey.
reduceBykey(func,[numTasks]) However, the value reduceBykey(func,[numTasks])
of each key is
calculated based on
71
HCIA-Big Data V3.5 (For Trainees) Page 72
⚫ Control: RDD persistence is performed. A RDD can be stored in the disk or memory
based on different storage policies. For example, the cache API caches the RDD in the
memory by default. An example is as follows:
6.1.2.7 DataFrame
Similar to RDD, DataFrame is also an invariable, elastic, and distributed data set. In
addition to data, it also records data structure information, that is, schema. The schema
is similar to a two-dimensional table. The query plan of DataFrame can be optimized
using Spark Catalyst Optimiser. The optimized logical execution plan is still logical and
cannot be understood by the Spark system. In this case, you need to convert a logical
execution plan to a physical plan. In this way, a logically feasible execution plan is
changed to a plan that Spark can actually execute.
72
HCIA-Big Data V3.5 (For Trainees) Page 73
6.1.2.8 DataSet
DataSet, a typed dataset, includes Dataset[Car] and Dataset[Person]. DataFrame is a
special case of DataSet (DataFrame=Dataset[Row]). Therefore, you can use the as
method to convert DataFrame to DataSet. Row is a common type where all table
structure information is represented by row.
73
HCIA-Big Data V3.5 (For Trainees) Page 74
74
HCIA-Big Data V3.5 (For Trainees) Page 75
processes the data incrementally and continuously and updates the results to the result
set.
75
HCIA-Big Data V3.5 (For Trainees) Page 76
TCP sockets) and save the results to external file systems, databases, or real-time
dashboards.
The core idea of Spark Streaming is to split stream computing into a series of short batch
jobs. The batch processing engine is Spark Core. That is, the input data of Spark
Streaming is divided into segments based on a specified time slice (for example, 1
second), each segment is converted into RDDs in Spark, then the DStream conversion in
Spark Streaming is transformed to the RDD conversion in Spark. As a result, the
intermediate results of RDD conversion are saved in the memory.
Storm is a well-known framework in the real-time computing field. Compared with
Storm, Spark Streaming provides higher throughput. They have better performance than
each other in different scenarios.
Use cases of Storm:
⚫ Storm is recommended when even a one-second delay is unacceptable. For example,
a financial system requires real-time financial transaction and analysis.
⚫ If a reliable transaction mechanism and reliability mechanism are required for real-
time computing, that is, data processing must be accurate, Storm is ideal.
⚫ If dynamic adjustment of real-time computing program parallelism is required based
on the peak and off-peak hours, Storm can maximize the use of cluster resources
(usually in small companies with resource constraints).
⚫ If SQL interactive query operations and complex transformation operators do not
need to be executed on a big data application system that requires real-time
computing, Storm is preferred.
If real-time computing, a strong transaction mechanism, and dynamic parallelism
adjustment are not required, Spark Streaming should be considered. Located in the Spark
ecological technology stack, Spark Streaming can seamlessly integrate with Spark Core
and Spark SQL. That is, delay batch processing, interactive query, and other operations
can be performed immediately and seamlessly on immediate data that is processed in
real time. This feature significantly enhances the advantages and functions of Spark
Streaming.
76
HCIA-Big Data V3.5 (For Trainees) Page 77
The biggest difference between Flink and other stream computing engines is state
management.
What is a state? For example, when a stream computing system or task is developed for
data processing, data statistics such as Sum, Count, Min, or Max need to be collected.
These values need to be stored. These values or variables can be understood as a state
because they need to be updated continuously. If the data sources are Kafka and
RocketMQ, the read location and offset may need to be recorded. These offset variables
are the states to be calculated.
Flink provides built-in state management. You can store states in Flink instead of storing
them in an external system. This:
⚫ Reduces the dependency of the computing engine on external systems, simplifying
deployment and O&M.
⚫ Significantly improves performance.
If Redis or HBase wants to access the states in Flink, it must access the states via the
Internet or RPC. If the states are accessed through Flink, they are accessed only through
its own process. In addition, Flink periodically takes state checkpoints and stores them to
a distributed persistence system, such as HDFS. In case of a failure, Flink resets its state
to the last successful checkpoint and continues to process the stream. There is no impact
on user data.
77
HCIA-Big Data V3.5 (For Trainees) Page 78
Flink provides the following deployment plans: Local, which indicates local deployment,
Cluster, which indicates cluster deployment, and Cloud, which indicates cloud
deployment.
The runtime layer is a common engine for stream processing and batch processing of
Flink, receiving applications in a JobGraph. A JobGraph is a general parallel data flow,
which has more or fewer tasks to receive and generate data streams.
Both the DataStream API and DataSet API can generate JobGraphs using a specific
compiling method. The DataSet API uses the optimizer to determine the application
optimization method, while the DataStream API uses the stream builder to perform this
task.
Table API supports query of structured data. Structured data is abstracted into a
relationship table. Users can perform various query operations on the relationship table
through SQL-like DSL provided by Flink. Java and Scala are supported.
The Libraries layer corresponds to some functions of different Flink APIs, including the
table for processing logical table query, FlinkML for machine learning, Gelly for image
processing, and CEP for complex event processing.
6.2.1.3 DataStream
Flink uses class DataStream to represent the stream data in Flink programs. You can
think of DataStream as immutable collections of data that can contain duplicates. The
number of elements in DataStream is unlimited.
78
HCIA-Big Data V3.5 (For Trainees) Page 79
6.2.1.4 DataSet
DataSet programs in Flink are regular programs that transform data sets (for example,
filtering, mapping, joining, and grouping). The datasets are initially created by reading
files or from local collections. Results are returned via sinks, which can write the data to
(distributed) files or to standard output (for example, the command line terminal).
79
HCIA-Big Data V3.5 (For Trainees) Page 80
⚫ File-based data: readTextFile(path) reads a text file line by line based on the
TextInputFormat read rule and returns the result.
⚫ Collection-based data: romCollection() creates a data flow using a collection. All
elements in the collection must be of the same type.
⚫ Queue-based data: Data in message queues such as Kafka and RabbitMQ is used as
the data source.
⚫ User-defined source: Data sources are defined by implementing the SourceFunction
API.
80
HCIA-Big Data V3.5 (For Trainees) Page 81
A user submits a Flink program to JobClient. JobClient processes, parses, and optimizes
the program, and then submits the program to JobManager. Then, TaskManager runs the
task.
81
HCIA-Big Data V3.5 (For Trainees) Page 82
};
}
}
82
HCIA-Big Data V3.5 (For Trainees) Page 83
For example, a simple Flink program that counts visitors at a website every hour,
grouped by region continuously, is the following:
If you know that the input data is bounded, you can implement batch processing using
the following code:
If the input data is bounded, the result of the following code is the same as that of the
preceding code:
83
HCIA-Big Data V3.5 (For Trainees) Page 84
Checkpoint and state mechanisms: used to implement fault tolerance and stateful
processing. Watermark mechanism: used to implement the event clock.
Window and trigger: used to limit the calculation scope and define the time when the
results are displayed.
On the same stream processing engine, Flink has another mechanism to implement
efficient batch processing.
Backtracking for scheduling and recovery: introduced by Microsoft Dryad and now used
in almost all batch processors.
Special memory data structure used for hashing and sorting: Part of the data can be
allowed to flow from the memory to the hard disk when necessary.
Optimizer: shortens the time for generating results as much as possible.
The two sets of Flink mechanisms correspond to their respective APIs (DataStream API
and DataSet API). When creating a Flink job, you cannot combine the two sets of
mechanisms to use all Flink functions at the same time.
Flink supports two types of relational APIs: Table API and SQL. Both of these APIs are
used for unified batch and stream processing, which means that relational APIs execute
queries with the same semantics and produce the same results on unbounded real-time
data streams and bounded historical data streams.
The Table API and SQL are becoming the main APIs to be used with unified stream and
batch processing for analytical use cases. The DataStream API is the primary API for
data-driven applications and pipelines.
84
HCIA-Big Data V3.5 (For Trainees) Page 85
Figure 6-12 Differences between the processing time and event time
For most streaming applications, it is valuable to have the ability to reprocess historical
data and produce consistent results with certainty using the same code used to process
real-time data.
85
HCIA-Big Data V3.5 (For Trainees) Page 86
It is also critical to note the sequence in which events occur, not the sequence in which
they are processed, and the capability to infer when a set of events are (or should be)
completed. For example, consider a series of events involved in e-commerce transactions
or financial transactions. These requirements for timely stream processing can be met by
using the event timestamp recorded in the data stream instead of using the clock of the
machine that processes the data.
86
HCIA-Big Data V3.5 (For Trainees) Page 87
87
HCIA-Big Data V3.5 (For Trainees) Page 88
received within the specified time, the window ends and the window calculation is
triggered. Different from the tumbling window and sliding window, the session
window does not require a fixed sliding value or window size. You only need to set
the session gap for triggering window calculation and specify the upper limit of the
inactive data duration.
stream.timeWindow(Time.minutes(1))
A sliding time window of 1 minute that slides every 30 seconds can be defined as simply
as:
stream.timeWindow(Time.minutes(1),Time.seconds(30))
To define window division rules, you can use the SessionWindows WindowAssigner API
provided by Flink. If you've used SlidingEventTimeWindows or
TumlingProcessingTimeWindows, you'll be familiar with this API.
In this way, Flink automatically places elements in different session windows based on
the timestamps of the elements. If the timestamp interval between two elements is less
than the session gap, the two elements are in the same session. If the interval between
two elements is greater than the session gap and no element can fill in the gap, the two
elements are placed in different sessions.
88
HCIA-Big Data V3.5 (For Trainees) Page 89
89
HCIA-Big Data V3.5 (For Trainees) Page 90
The following figure shows the watermark of ordered streams (Watermark is set to 0).
input.keyBy(<keyselector>)
.window(<windowassigner>)
. allowedLateness(<time>)
.<windowed transformation>(<window function>);
90
HCIA-Big Data V3.5 (For Trainees) Page 91
result, the window is closed before they arrive. In this case, you can use any of the
following methods to solve the problem:
⚫ Reactivate the closed windows and recalculate to correct the results (with the Side
Output mechanism).
⚫ Collect the delayed events and process them separately (with the Allowed Lateness
mechanism).
⚫ Consider delayed events as error messages and discard them.
By default, Flink uses the third method. The other two methods use the Side Output and
Allowed Lateness mechanisms, respectively.
6.2.5.1 Checkpointing
Flink provides a checkpoint fault tolerance mechanism to ensure exactly-once semantics.
Note that it can only ensure the exactly-once of the built-in operators of Flink. For the
source and sink, if exactly-once needs to be ensured, these components themselves
should support this semantics.
Flink provides a checkpoint fault tolerance mechanism based on the asynchronous,
lightweight, and distributed snapshot technology. The snapshot technology allows you to
take global snapshots of task or operator state data at the same point in time. Flink
periodically generates checkpoint barriers on the input data set and divides the data
within the interval to the corresponding checkpoints through barriers. When an exception
occurs in an application, the operator can restore the status of all operators from the
previous snapshot to ensure data consistency.
91
HCIA-Big Data V3.5 (For Trainees) Page 92
For applications with small state, these snapshots are very light-weight and can be taken
frequently without impacting the performance much. During checkpointing, the state is
stored at a configurable place (such as the JobManager node or HDFS).
⚫ Exactly-once or at-least-once
Exactly-once ensures end-to-end data consistency, prevents data loss and duplicates,
and delivers poor Flink performance.
At-least-once applies to scenarios that have high requirements on the latency and
throughput but low requirements on data consistency.
Exactly-once is used by default. You can use the setCheckpointingMode() method to
set the semantic mode.
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
⚫ Checkpointing timeout
Specifies the timeout period of checkpoint execution. Once the threshold is reached,
Flink interrupts the checkpoint process and regards it as timeout.
This metric can be set using the setCheckpointTimeout method. The default value is
10 minutes.
env.getCheckpointConfig().setCheckpointingTimeout(60000)
Env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500)
Env.getCheckpointConfig().setMaxConcurrentCheckpoints(10)
92
HCIA-Big Data V3.5 (For Trainees) Page 93
⚫ External checkpoints
You can configure periodic checkpoints to be persisted externally. Externalized
checkpoints write their metadata out to persistent storage and are not automatically
cleaned up when the job fails. This way, you will have a checkpoint around to
resume from if your job fails.
Env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizeCheckpointClea
nup.RETAIN_ON_CANCELLATION)
6.2.5.3 Savepoint
The checkpoint can be retained in an external media when a job is cancelled. Flink also
has another mechanism, savepoint, to restore job data.
Savepoints are a special implementation of checkpoints. The underlying layer actually
uses the checkpointing mechanism. Savepoints are triggered by manual commands and
the results are persisted to a specified storage path. Savepoints help users save system
state data during cluster upgrade and maintenance. This ensures that the system restores
to the original computing state when application termination operations such as
shutdown O&M or application upgrade are performed. As a result, end-to-end exactly-
once semantics can be ensured.
Similar to checkpoints, savepoints allow saving state to external media. If a job fails, it
can be restored from an external source. What are the differences between savepoints
and checkpoints?
⚫ Triggering and management: Checkpoints are automatically triggered and managed
by Flink, while savepoints are manually triggered and managed by users.
⚫ Function: Checkpoints allow fast recovery when tasks encounter exceptions, including
network jitter or timeout. On the other hand, savepoints enable scheduled backup
and allow you to stop-and-resume jobs, such as modifying code or adjusting
concurrency.
⚫ Features: Checkpoints are lightweight and can implement automatic recovery from
job failures and are deleted by default after job completion. Savepoints, on the other
hand, are persistent and saved in a standard format. They allow code or
configuration changes. To resume a job from a savepoint, you need to manually
specify a path.
93
HCIA-Big Data V3.5 (For Trainees) Page 94
6.2.5.5 MemoryStateBackend
newMemoryStateBackend(intmaxStateSize,booleanasynchronousSnapshots)
MemoryStateBackend: The construction method is to set the maximum StateSize and
determine whether to perform asynchronous snapshot. This storage state is stored in the
memory of the TaskManager node, that is, the execution node. Because the memory
capacity is limited, the default value of maxStateSize for a single state is 5 MB. Note that
the value of maxStateSize is less than or equal to that of akka.framesize (10 MB by
default). Since the JobManager memory stores checkpoints, the checkpoint size cannot be
larger than the memory of JobManager. Recommended scenarios: local testing and jobs
that do hold little state, for example, ETL, and JobManager is unlikely to fail, or the
failure has little impact. MemoryStateBackend is not recommended in production
scenarios.
6.2.5.6 FsStateBackend
FsStateBackend(URIcheckpointDataUri,booleanasynchronousSnapshots)
FsStateBackend: The construction method is to transfer a file path and determine
whether to perform asynchronous snapshot. The FsStateBackend also holds state data in
the memory of the TaskManager, but unlike MemoryStateBackend, it doesn't have the 5
MB size limit. In terms of the capacity limit, the state size on a single TaskManager
cannot exceed the memory size of the TaskManager and the total size cannot exceed the
capacity of the configured file system. Recommended scenarios: jobs with large state,
such as aggregation at the minute-window level and join, and jobs requiring high-
availability setups.
6.2.5.7 RocksDBStateBackend
RocksDBStateBackend(URIcheckpointDataUri,boolean enableIncremental-Checkpointing)
RocksDBStateBackend: RocksDB is a key-value store. Similar to other storage systems for
key-value data, the state is first put into memory. When the memory is about to run up,
the state is written to disks. Note that RocksDB does not support synchronous
Checkpoints. The synchronous snapshot option is not included in the constructor.
However, the RocksDBStateBackend is currently the only backend that supports
incremental Checkpoints. This suggests that users only write incremental state changes,
without having to write all the states each time. The external file systems (local file
systems or HDFS) store the checkpoints. The state size of a single TaskManager is limited
to the total size of its memory and disk. The maximum size of a key is 2 GB. The total
size cannot be larger than the capacity of the configured file system. Recommended
scenarios: jobs with very large state, for example, aggregation at the day-window level,
jobs requiring high-availability setups, and jobs that do not require high read/write
performance.
6.3 Quiz
1. What are the similarities and differences between MapReduce and Spark?
94
HCIA-Big Data V3.5 (For Trainees) Page 95
2. How is the exactly-once semantics implemented in Flink and how is the state stored?
3. What are the three time windows of Flink? What are the use cases?
95
HCIA-Big Data V3.5 (For Trainees) Page 96
96
HCIA-Big Data V3.5 (For Trainees) Page 97
⚫ Channel: A channel is located between a source and a sink. It caches the events that
have been received by an agent but not been written into another agent or HDFS.
The channel ensures that the source and sink run securely at different rates. Events
are temporarily stored in the channel of each agent and transferred to the next
agent or HDFS. Events are deleted from the channel only after they have been
successfully stored in a channel or in HDFS of the next agent. Events can be written
by sources to one or more channels, and then read by one or more sinks. Common
channel types are as follows:
o Memory channel: Messages are stored in the memory, providing high
throughput but not ensuring reliability. This means data may be lost.
o File channel: Data is persisted; however, the configuration is complex. You need
to configure the data directory and checkpoint directory (for each file channel).
o JDBC channel: A built-in Derby database makes events persistent with high
reliability. This channel can replace the file channel that also supports data
persistence.
⚫ Sink: A sink is a component that receives events from channels and writes them to
the next phase or to their final destination. Sinks continuously poll events in
channels, write them to HDFS or other agents in batches, and then batch remove
them from channels. In this manner, a source can continuously receive events and
write them to channels. The final destinations include HDFS, Logger, Avro, and Thrift.
The following table describes the common sink types.
97
HCIA-Big Data V3.5 (For Trainees) Page 98
⚫ Event: The basic unit of data transmission of Flume is an event. A data unit contains
an optional message header. Event headers are not used to transmit data. An event
ID or universally unique identifier (UUID) is added to the entire event to determine
routes or transmit other structural information. The event body is a byte array, which
contains the actual transmission load of Flume. If the event body is a text file, it is
usually a row of records, which is the basic unit of a transaction.
⚫ Channel processor: used to cache the data sent from the source into the channel.
⚫ Interceptor: a simple plug-in component between a source and a channel. Before a
source writes received events to channels, the interceptor can convert or delete the
events. Each interceptor processes only the events received by a given source. The
interceptor can be customized.
⚫ Channel selector: used to transmit and place data into different channels based on
user configurations.
⚫ Sink runner: runs a sink group to drive a sink processor. The sink processor drives
sinks to obtain data from channels.
⚫ Sink processor: used to drive sinks to obtain data from channels by configuring
policies. Currently, the policies include load balancing, failover, and pass-through.
98
HCIA-Big Data V3.5 (For Trainees) Page 99
99
HCIA-Big Data V3.5 (For Trainees) Page 100
100
HCIA-Big Data V3.5 (For Trainees) Page 101
data receiving is abnormal during Flume data transmission. The following figure shows
the process.
7.1.3 Applications
7.1.3.1 Installing and Using Flume
⚫ Download the Flume client: Log in to the MRS Manager and choose Services >
Flume > Download Client.
⚫ Install the Flume client: Decompress the client package and install the client.
⚫ Configure the Flume configuration file for the source, channel, and sink.
101
HCIA-Big Data V3.5 (For Trainees) Page 102
⚫ Upload the configuration file: Name the configuration file of the Flume agent
properties.properties, and upload the configuration file.
102
HCIA-Big Data V3.5 (For Trainees) Page 103
103
HCIA-Big Data V3.5 (For Trainees) Page 104
hard disk. Instead, data is written directly to the file system logs. Write
operations add data to a file in sequence. Read operations read data directly
from a file.
o Kafka makes messages persistent so that the stored messages can be used again
after the server is restarted. In addition, Kafka supports online or offline
processing and integration with other storage and stream processing
frameworks.
⚫ Message reliability
o In a messaging system, the reliability of messages during production and
consumption is extremely important. In the actual message transmission process,
the following situations may occur: 1. The message fails to be sent. 2. The
message is sent multiple times. 3. Each message is sent successfully and only
once (exactly-once, ideally).
o From the perspective of the producer, after a message is sent, the producer waits
for a response from the broker (the wait time can be controlled by a parameter).
If the message is lost during the process or one of the brokers breaks down, the
producer resends the message.
o From the perspective of the consumer, the broker records an offset value in the
partition, which points to the next message to be consumed by the consumer. If
the consumer receives a message but cannot process it, the consumer can still
find the previous message based on the offset value and process the message
again. The consumer has the permission to control the offset value and perform
any processing of the messages that are persistently sent to the broker.
⚫ Backup mechanism
o The backup mechanism improves the reliability and stability of the Kafka cluster.
With the backup mechanism, if a node in the Kafka cluster breaks down, the
entire cluster is not affected. A cluster whose number of backups is n allows n-1
nodes to fail. Of all the backup nodes, one node serves as the leader node. This
node stores the list of other backup nodes and maintains the status
synchronization between the backup nodes.
⚫ Lightweight
o The agency of Kafka is stateless. That is, the agency does not record whether a
message is consumed. The consumer or group coordinator maintains the
consumption offset management. In addition, the cluster does not need the
status information of producers and consumers. Kafka is lightweight, and the
implementation of producers and consumer clients is also lightweight.
⚫ High throughput
o High throughput is the main objective of Kafka design. Kafka messages are
continuously added to files. This feature enables Kafka to fully utilize the
sequential read/write performance of disks. Sequential read/write does not
require the seek time of the disk head. It only requires a small sector rotation
time. The speed is much faster than random read/write. Since Linux kernel 2.2,
there has been zero-copy calling available. The sendfile() function is used to
directly transfer data between two file descriptors. It skips the copy of the user
buffer and establishes a direct mapping between the disk space and the memory.
104
HCIA-Big Data V3.5 (For Trainees) Page 105
105
HCIA-Big Data V3.5 (For Trainees) Page 106
106
HCIA-Big Data V3.5 (For Trainees) Page 107
107
HCIA-Big Data V3.5 (For Trainees) Page 108
cluster configurations, elect a leader, and rebalance when consumers change. Producers
push messages to brokers to publish. Consumers pull messages to brokers to subscribe
and consume.
⚫ Record: A message is the basic unit of Kafka communication. Each record is called a
message. Each record contains the following attributes:
o Offset: indicates the unique identifier of a message, which can be used to find a
unique message. The corresponding data type is long.
o Message size: indicates the size of a message. The corresponding data type is
int.
o Data: indicates the content of a message, which can be considered as a byte
array.
⚫ Controller: A server in the Kafka cluster, which is used for leader election and various
failovers.
⚫ Broker: A Kafka cluster consists of one or more service instances, which are called
brokers.
⚫ Producer: releases messages to Kafka brokers.
⚫ Consumer: consumes messages and functions as a Kafka client to read messages
from Kafka brokers.
108
HCIA-Big Data V3.5 (For Trainees) Page 109
overall consideration of the cluster, Kafka will balance leaders across each instance to
ensure the overall stable performance. In a Kafka cluster, a node can be a leader and also
a follower to another leader. Kafka replications have the following features:
⚫ A replication is based on a partition. Each partition in Kafka has its own primary and
secondary replications.
⚫ The primary replication is called a leader, and the secondary replication is called a
follower. Followers constantly pull new messages from the leader.
⚫ Consumers and producers read and write data from the leader and do not interact
with followers.
In Kafka, data is synchronized between partition replications. A Kafka broker only uses a
single thread (ReplicaFetcherThread) to replicate data from the leader of a partition to
the follower. Actually, the follower (a follower is equivalent to a consumer) proactively
pulls messages from the leader in batches, which greatly improves the throughput.
When a Kafka broker is started, a ReplicaManager is created. ReplicaManager maintains
the link connections between ReplicaFetcherThread and other brokers. The leader
partitions corresponding to the follower partitions in the broker are distributed on
different brokers. These brokers create the same number of ReplicaFetcherThread threads
to synchronize the corresponding partition data. In Kafka, every partition follower (acts
as a consumer) reads messages from the partition leader. Each time a follower reads
messages, it updates the HW status (High Watermark, which indicates the last message
successfully replicated to all partition replications). Each time when the broker, where the
leader is located, is affected after the partitions of the follower are changed,
ReplicaManager creates or destroys the corresponding ReplicaFetcherThread.
A new leader needs to be elected in case of the failure of an existing one. When a new
leader is elected, the new leader must have all the messages committed by the old
leader. Kafka is not fully synchronous or asynchronous. It is an ISR mechanism:
⚫ The leader maintains a replica list that is basically in sync with the leader. The list is
called ISR (in-sync replica). Each partition has an ISR.
⚫ If a follower is too far behind a leader or does not initiate a data replication request
within a specified period, the leader removes the follower from ISR.
⚫ The leader commits messages only when all replicas in ISR send ACK messages to
the leader.
If all replicas do not work, there are two solutions:
⚫ Wait for a replica in the ISR to return to life and select it as the leader. This can
ensure that no data is lost, but may take a long time.
⚫ Select the first replica (not necessarily in the ISR) that returns to life as the leader.
Data may be lost, but the unavailability time is relatively short.
109
HCIA-Big Data V3.5 (For Trainees) Page 110
110
HCIA-Big Data V3.5 (For Trainees) Page 111
⚫ The sequence number is made persistent to the replica log. Therefore, if the leader of
the partition fails, other brokers take over. The new leader can still determine
whether the resent message is duplicate.
The overhead of this mechanism is very low: Each batch of messages has only a few
additional fields.
Maximum period a
log segment is kept
log.retention.hours 168 1–2147483647
before it is deleted.
Unit: hour
Maximum size of
log data in a
partition. By -1–922337203685477580
log.retention.bytes -1
default, the value is 7
not restricted. Unit:
byte
⚫ Compact
111
HCIA-Big Data V3.5 (For Trainees) Page 112
o Data is compressed and only the data of the last version of each key is retained.
In the broker configuration, set log.cleaner.enable to true to enable the cleaner.
By default, the cleaner is disabled. Enable the compression policy by configuring
log.cleanup.policy=compact in the topic configuration. The following figure
shows the compression details.
7.3 Quiz
1. What are the functions of the source, sink, and channel of Flume?
2. What are the functions of ZooKeeper for Kafka?
112
HCIA-Big Data V3.5 (For Trainees) Page 113
8.1 Overview
In our daily lives, we use search engines to search for movies, books, goods on e-
commerce websites, or resumes and positions on recruitment websites. Elasticsearch is
often first brought up when we talk about the search function during project
development.
In recent years, Elasticsearch has developed rapidly and surpassed its original role as a
search engine. It has added the features of data aggregation analysis and visualization. If
you need to locate desired content using keywords in millions of documents,
Elasticsearch is the best choice.
Elasticsearch is a high-performance Lucene-based full-text search service. It is a
distributed RESTful search and data analysis engine and can also be used as a NoSQL
database.
⚫ It extends Lucene and provides a query language even richer than that provided by
Lucene. The configurable and scalable Elasticsearch optimizes the query performance
and provides thorough function management GUIs.
⚫ The prototype environment and production environment can be seamlessly switched.
Users can communicate with Elasticsearch in the same way regardless of whether it
runs on one node or runs on a cluster containing 300 nodes.
⚫ Elasticsearch supports horizontal scaling. It can process a huge number of events per
second, and automatically manage indexes and query distribution in the cluster,
facilitating operations.
⚫ Elasticsearch supports multiple data formats, including numerical, text, location data,
structured data, and unstructured data.
113
HCIA-Big Data V3.5 (For Trainees) Page 114
114
HCIA-Big Data V3.5 (For Trainees) Page 115
Logstash: real-time data transmission pipeline for log processing. It transmits data from
the input end to the output end. It provides powerful filters to meet personalized
requirements.
Kibana: open-source analysis and visualization platform for data on Elasticsearch. Users
can obtain results required for upper-layer analysis and visualization. Developers or O&M
personnel can easily perform advanced data analysis and view data in charts, tables, and
maps.
Beats: platform dedicated to data transmission. It can seamlessly transmit data to
Logstash or Elasticsearch. Installed with the lightweight agent mechanism, it is similar to
the Ambari or CDH Manager used during Hadoop cluster installation. Beats sends data
from hundreds or thousands of computers to Logstash or Elasticsearch.
Elasticsearch-hadoop: integrated with Hadoop and Elasticsearch and subproject officially
maintained by Elasticsearch. Data is input and output between Hadoop and Elasticsearch.
With the parallel computing of MapReduce, real-time search is available for HDFS data.
Elasticsearch-sql: uses SQL statements to perform operations on Elasticsearch, instead of
writing complex JSON queries. Currently, Elasticsearch-sql has two versions: one is the
open-source nlpchina/Elasticsearch-sql plugin promoted in China many years ago, and
the other is the Elasticsearch-sql officially supported after Elasticsearch 6.3.0 was released
in June 2018.
Elasticsearch-head: client tool for Elasticsearch. It is a GUI-based cluster operation and
management tool that is used to perform simple operations on clusters, and is a frontend
project based on Node.js.
Bigdesk: cluster monitoring tool for Elasticsearch. Users can use it to view the status of
the Elasticsearch cluster, such as the CPU and memory usages, index data, search status,
and number of HTTP connections.
115
HCIA-Big Data V3.5 (For Trainees) Page 116
116
HCIA-Big Data V3.5 (For Trainees) Page 117
of Lucene. Index storage is implemented through local files, shared files, and HDFS. The
following figure shows the internal architecture of Elasticsearch.
117
HCIA-Big Data V3.5 (For Trainees) Page 118
Discovery layer: This module is responsible for automatic node discovery and master
node election in a cluster. Nodes communicate with each other in P2P mode, eliminating
single points of failure. In Elasticsearch, the master node maintains the global status of
the cluster. For example, if a node is added to or removed from the cluster, the master
node reallocates shards.
Script layer: Elasticsearch query supports multiple script languages, including MVEL, JS,
and Python.
Transport layer: Interaction mode between an Elasticsearch internal node or cluster and
the client. By default, internal nodes use the TCP protocol for interaction. In addition,
transmission protocols (integrated using plugins) as the HTTP (JSON format), Thrift,
Servlet, Memcached, and ZeroMQ are supported.
RESTful interface layer: The top layer is the access interface exposed by Elasticsearch. The
recommended solution is a RESTful interface, which directly sends HTTP requests to
facilitate the use of Nginx as a proxy and distribution. In addition, permission
management may be performed in the future. It is easy to perform such management
through HTTP.
118
HCIA-Big Data V3.5 (For Trainees) Page 119
119
HCIA-Big Data V3.5 (For Trainees) Page 120
not random, or we would not know where to find the documents in the future when we
want to get them.
Elasticsearch provides two routing algorithms:
⚫ Default route: shard=hash (routing)%number_of_primary_shards. This routing policy
is limited by the number of shards. During capacity expansion, the number of shards
needs to be multiplied (Elasticsearch 6.x). In addition, when creating an index, you
need to specify the capacity to be expanded in the future. Note that Elasticsearch 5.x
does not support capacity expansion, but Elasticsearch 7.x supports free expansion.
⚫ Custom route: In this routing mode, the routing can be specified to determine the
shard to which a document is written, or search for a specified shard.
120
HCIA-Big Data V3.5 (For Trainees) Page 121
121
HCIA-Big Data V3.5 (For Trainees) Page 122
8.4 Quiz
1. How does Elasticsearch implement master election?
2. Describe the data write process of Elasticsearch.
122
HCIA-Big Data V3.5 (For Trainees) Page 123
123
HCIA-Big Data V3.5 (For Trainees) Page 124
124
HCIA-Big Data V3.5 (For Trainees) Page 125
Cost Low initial cost and high High initial cost and low
subsequent cost subsequent cost
Data quality Massive raw data to be High quality data that can
cleaned and normalized be used as the basis of
before use facts
⚫ What is a lakehouse?
o Although the application scenarios and architectures of a data warehouse and a
data lake are different, they can cooperate to resolve problems. A data
warehouse stores structured data and is ideal for quick BI and decision-making
support, while a data lake stores data in any format and can generate larger
value by mining data. Therefore, their convergence can bring more benefits to
enterprises in some scenarios.
o A lakehouse, the convergence of a data warehouse and a data lake, aims to
enable data mobility and streamline construction. The key of the lakehouse
architecture is to enable the free flow of data/metadata between the data
warehouse and the data lake. The explicit-value data in the lake can flow to or
even be directly used by the warehouse. The implicit-value data in the
warehouse can also flow to the lake for long-term storage at a low cost and for
future data mining.
125
HCIA-Big Data V3.5 (For Trainees) Page 126
200+ cloud services and 190+ solutions, serving many well-known enterprises around the
world.
126
HCIA-Big Data V3.5 (For Trainees) Page 127
127
HCIA-Big Data V3.5 (For Trainees) Page 128
Real-time, incremental data updates, offline and real-time data warehouses over the
same architecture
Real-time data import and analysis are available.
One copy of data can be imported to the database in real time and analyzed from
multiple dimensions.
The offline data warehouse can be seamlessly upgraded to a real-time one, allowing for
converged batch and stream analysis.
Decoupled storage and compute, EC ratio as low as 1.2, over 20% TCO reduction
Integrated data lake and warehouse
The in-lake interactive engine outperforms same-class products by over 30%. You can
generate BI reports using the data in the lake through a self-service graphical interface.
Convergence of batch, streaming, and interactive data analysis via a unified SQL
interface.
Collaborative computing across MRS and GaussDB(DWS), no need to move data around.
128
HCIA-Big Data V3.5 (For Trainees) Page 129
9.2 Components
9.2.1 Hudi
9.2.1.1 Introduction to Hudi
Hudi is an open-source project launched by Apache in 2019 and became a top Apache
project in 2020. Huawei participated in Hudi community development in 2020 and used
Hudi in FusionInsight.
Hudi is the file organization layer of the data lake. It manages Parquet files, provides
data lake capabilities and IUD APIs, and supports compute engines.
129
HCIA-Big Data V3.5 (For Trainees) Page 130
⚫ Incremental view (for frequent updates): The latest data is read. The base file and
delta files are merged during read.
⚫ Real-time view (for stream-batch convergence): Data that is incrementally written
into Hudi is continuously read in a way similar to CDC.
9.2.2 HetuEngine
9.2.2.1 Introduction to HetuEngine
HetuEngine is a self-developed high-performance engine for interactive SQL analysis and
data virtualization. It seamlessly integrates with the big data ecosystem to implement
interactive query of massive amounts of data within seconds, and supports cross-source
and cross-domain unified data access to enable one-stop SQL convergence analysis in the
data lake, between lakes, and between lakehouses.
130
HCIA-Big Data V3.5 (For Trainees) Page 131
9.2.3 Ranger
9.2.3.1 Ranger
Apache Ranger offers a centralized security management framework and supports
unified authorization and auditing. It manages fine-grained access control over Hadoop
and related components, such as HDFS, Hive, HBase, Kafka, and Storm. Users can use the
front-end web UI provided by Ranger to configure policies to control users' access to
these components.
131
HCIA-Big Data V3.5 (For Trainees) Page 132
9.2.4 LDAP
In 1988, the International Telegraph and Telephone Advisory Committee developed the
X.500 standard and updated it in 1993 and 1997. The X.500 standard features perfect
function design and flexible scalability. It defines the comprehensive directory service,
including the information model, namespace, function model, access control, directory
replication, and directory protocol. It quickly becomes the standard that all directory
servers comply with. X.500 defines the Directory Access Protocol (DAP) protocol for
communication between a client and a server of a directory service. However, due to the
complex DAP structure and strict compliance with the OSI seven-layer protocol model
during running, it is difficult to deploy DAP in many small environments, and it usually
runs in UNIX environments. Therefore, the University of Michigan launched a TCP/IP-
based Lightweight Directory Access Protocol (LDAP) based on the X.500 standard.
The LDAP client and server communicate with each other independently. Users can
access various LDAP servers through different LDAP clients. Currently, LDAP has two
versions: LDAP v2 and LDAP v3. Most directory servers comply with LDAP v3 to provide
various query services for users. LDAP is just a protocol and does not involve data
storage. Therefore, a backend database component is required to implement storage.
These backends can be Berkeley DB, shell, passwd, or others.
LDAP has the following characteristics:
⚫ Flexibility: In LDAP, the Access Protocol (AP) is not only an X.500-based access
protocol, but also a flexible directory system that can be implemented independently.
⚫ Cross-platform: LDAP is a cross-platform and standard-based protocol and can serve
various operating systems such as Windows, Linux, and Unix.
⚫ Low cost and easy configuration management: LDAP runs on TCP/IP or other
connection-oriented transmission services. Most LDAP servers are easy to install and
configure, respond quickly, and seldom require maintenance during long-term use.
⚫ Access Control List (ACL): LDAP can use ACLs to control access to directories. For
example, an administrator can specify the permissions of members in a given group
or location, or grant specific users to modify selected fields in their own records.
⚫ LDAP is an Internet Engineering Task Force (IETF) standard track protocol and is
specified in RFC 4510 on Lightweight Directory Access Protocol (LDAP): Technical
Specification Road Map.
⚫ Distributed: LDAP manages physically distributed data in a unified manner and
ensures logical integrity and consistency of the data. LDAP implements distributed
operations through client APIs, thereby balancing load.
132
HCIA-Big Data V3.5 (For Trainees) Page 133
As a directory service system, the LDAP server provides centralized user account
management. It has the following advantages:
⚫ When facing a large, fast-growing number of users, the LDAP server makes it easier
for you to manage user accounts, specifically, to create accounts, reclaim accounts,
manage permissions, and audit security.
⚫ The LDAP server makes it secure to access different types of systems and databases
across multiple layers. All account-related management policies are configured on
the server, implementing centralized account maintenance and management.
⚫ The LDAP server fully inherits and utilizes the identity authentication functions of
existing account management systems on the platform. It separates account
management from access control, thus improving access authentication security of
the big data platform.
On the Huawei big data platform, an LDAP server functions as a directory service system
to implement centralized account management. An LDAP server is a directory service
system that consists of a directory database and a set of access protocols. It has the
following characteristics:
⚫ The LDAP server is based on the open-source OpenLDAP technology. (OpenLDAP
project is built by a team of volunteers.)
⚫ The LDAP server uses Berkeley DB as the default backend database.
⚫ The LDAP server is an open-source implementation of the LDAP protocol.
The LDAP server directory service system consists of basic models such as organizational
models and function models.
9.2.5 Kerberos
Kerberos is an authentication concept named after the ferocious three-headed guard dog
of Hades from Greek mythology. The Kerberos protocol adopts a client–server model
and cryptographic algorithms such as Data Encryption Standard (DES) and Advanced
Encryption Standard (AES). Furthermore, it provides mutual authentication, so that the
client and server can verify each other's identity. As a trusted third-party authentication
service, Kerberos performs identity authentication based on shared keys.
The Kerberos protocol was developed by Massachusetts Institute of Technology and was
originally provided as a network server to protect the Athena project. Steve Miller and
Clifford Neuman released Kerberos version 4 at the end of 1980, mainly for the Athena
project. In 1993, John Kohl and Clifford Neuman released Kerberos version 5, which
resolves limitations and security issues on the basis of version 4. In 2005, the Kerberos
working group of the Internet Engineering Task Force updated the protocol specifications.
Currently, Kerberos version 5 is the mainstream network identity authentication protocol.
Windows 2000 and later operating systems use Kerberos as the default authentication
method. Apple Mac OS X uses the Kerberos client and server versions. Red Hat Enterprise
Linux 4 and later operating systems use the Kerberos client and server versions.
Huawei big data platform builds the KrbServer identity authentication system based on
the Kerberos protocol and provides security authentication functions for all open-source
components to prevent eavesdropping, replay attacks, and data integrity. It is a system
that manages keys by using a symmetric key mechanism.
133
HCIA-Big Data V3.5 (For Trainees) Page 134
134
HCIA-Big Data V3.5 (For Trainees) Page 135
135
HCIA-Big Data V3.5 (For Trainees) Page 136
136
HCIA-Big Data V3.5 (For Trainees) Page 137
137
HCIA-Big Data V3.5 (For Trainees) Page 138
Real-time data lake = Original batch processing solution + Original real-time stream
processing solution + Original interactive query solution + New Flink SQL batch-stream
integration engine
The real-time data lake and offline data lake are the same data lake. Only one copy of
data is stored, which is the basic requirement of the real-time data lake. Many vendors
store data both in real time and offline. Although the data is stored in the same HDFS,
there are still two lakes for users. For example, in the vendor's solution of the
MaxCompute offline data lake and E-MapReduce real-time data lake, there are two lakes
and two copies of data are stored. For users, the real-time data lake is mandatory, and
the specialized mart is optional. The real-time data lake can meet the requirements, and
there's no need to build a specialized mart.
Core requirements of the real-time data lake:
⚫ Large-scale cluster: The cluster must have more than 100 nodes to handle more than
1 PB of data.
⚫ High-concurrency interactive query: Data in the data lake can be queried within 2
seconds under hundreds of concurrent queries.
⚫ Intra-lake update operation: In addition to common query and append operations,
update operations are also supported during offline and real-time data processing,
that is, lakehouse.
⚫ One replica of data storage: Only one replica of data is stored to support multiple
types of analysis. Offline data lakes and real-time data lakes cannot store multiple
replicas of data.
⚫ Data permission control and resource isolation (multi-tenant): Multiple offline and
real-time processing jobs run at the same time. Different data permissions and
resource scheduling policies are required to prevent unauthorized access or resource
preemption.
⚫ Compatibility with open-source APIs: Customers usually have inventory processing
applications that need to be migrated to the real-time data lake.
⚫ Rolling upgrade: Offline and real-time processing is the basis of the customer's big
data system. The customer demands that the system is upgraded without service
interruptions.
⚫ Job scheduling management: Multiple offline and real-time jobs have different
priorities and different running time. Therefore, multiple scheduling policies are
required to monitor abnormal and failed jobs.
⚫ Heterogeneous devices: During capacity expansion, customers can configure the
devices to be upgraded and use the old and new devices separately.
⚫ Interconnection with third-party software (visualization, analysis and mining, report,
and metadata): Multiple third-party tools can be interconnected to facilitate further
data analysis and management.
⚫ Real-time data import into the lake: It takes less than 15 minutes from data
generation to import into the lake.
⚫ Multiple offline and real-time data sources, including traditional file, database
synchronization, and message queue data.
138
HCIA-Big Data V3.5 (For Trainees) Page 139
139
HCIA-Big Data V3.5 (For Trainees) Page 140
140
HCIA-Big Data V3.5 (For Trainees) Page 141
141
HCIA-Big Data V3.5 (For Trainees) Page 142
⚫ Build an offline data lake to centrally store video, image, text, and IoT data,
providing a real-time computing area and real-time data processing capabilities.
Customer benefits:
⚫ A one-stop platform that holds all-domain data is provided for people to handle
healthcare business. The intensive construction reduces data silos and TCO by 30%.
⚫ The medical insurance reimbursement efficiency is increased by 3 times, and the
manual review workload and error rate are decreased by 80%. People only need to
visit the office once to handle a single business.
⚫ The real-time data computing capability effectively controls vulnerabilities that may
breed medical insurance reimbursement violations and insurance fraud, recovering
XX00 million economic loss every year and ensuring sound development of the
medical benefits fund (MBF).
9.4 Quiz
What are the advantages of MRS compared with self-built Hadoop?
142
HCIA-Big Data V3.5 (For Trainees) Page 143
143
HCIA-Big Data V3.5 (For Trainees) Page 144
144
HCIA-Big Data V3.5 (For Trainees) Page 145
⚫ DataArts Architecture
DataArts Architecture helps users plan the data architecture, customize models, unify
data standards, visualize data modeling, and label data. DataArts Architecture
defines how data will be processed and utilized to solve business problems and
enables users to make informed decisions.
⚫ DataArts Factory
DataArts Factory helps users build a big data processing center, create data models,
integrate data, develop scripts, and orchestrate workflows.
⚫ DataArts Quality
DataArts Quality monitors the data quality in real time with data lifecycle
management and generates real-time notifications on abnormal events.
⚫ DataArts Catalog
DataArts Catalog provides enterprise-class metadata management to clarify
information assets. A data map displays data lineages and data assets for intelligent
data search, operations, and monitoring.
⚫ DataArts DataService
DataArts DataService is a platform where users can develop, test, and deploy their
data services. It ensures agile response to data service needs, easier data retrieval,
better experience for data consumers, higher efficiency, and better monetization of
data assets.
⚫ DataArts Security
DataArts Security provides all-round protection for users' data. Users can use it to
detect sensitive data, grade and classify data, safeguard data privacy, control data
access permissions, encrypt data during transmission and storage, and identify data
risks. DataArts Security is an efficient tool for users to establish security warning
mechanisms. It can greatly improve users' overall security protection capability,
securing their data and making their data more accessible.
145
HCIA-Big Data V3.5 (For Trainees) Page 146
146
HCIA-Big Data V3.5 (For Trainees) Page 147
DataArts Migration can automatically archive the data that fails to be processed
during migration, has been filtered out, or is not compliant with conversion or
cleaning rules to dirty data logs for users to analyze. A threshold for dirty data ratio
can be set to determine whether a task is successful.
147
HCIA-Big Data V3.5 (For Trainees) Page 148
DataArts Architecture can help users create process-based and systematic data
standards that fit their needs. Peered with the national and industry standards, these
standards enable users to standardize their enterprise data and improve data quality,
ensuring that their data is trusted and usable.
⚫ Data modeling
Data modeling involves building unified data model systems. DataArts Architecture
can be used to build a tiered, enterprise-class data system based on data
specifications and models. The system incorporates data from the public layer and
subject libraries, significantly reducing data redundancy, silos, inconsistency, and
ambiguity. This allows freer flow of data, better data sharing, and faster innovation.
DataArts Architecture provides the following data modeling methods:
⚫ ER modeling
Entity Relationship (ER) modeling involves describing the business activities of an
enterprise, and ER models are compliant with the third normal form (3NF). ER
models can be used for data integration, which merges and classifies data from
different systems by similarity or subject. However, ER models cannot be used for
decision-making.
⚫ Dimensional modeling
Dimensional modeling involves constructing bus matrices to extract business facts
and dimensions for model creation. Users need to sort out business requirements for
constructing metric systems and creating summary models.
148
HCIA-Big Data V3.5 (For Trainees) Page 149
149
HCIA-Big Data V3.5 (For Trainees) Page 150
Job scheduling supports a variety of hybrid orchestration tasks of cloud services. The
high-performance scheduling engine has been verified by hundreds of applications.
⚫ O&M and monitoring
Users can run, suspend, restore, and terminate a job.
Users can view the details of each job and each node in the job.
Users can use various methods to receive notifications when a job or task error
occurs.
150
HCIA-Big Data V3.5 (For Trainees) Page 151
151
HCIA-Big Data V3.5 (For Trainees) Page 152
152
HCIA-Big Data V3.5 (For Trainees) Page 153
DataArts Studio has a wide range of scheduling configuration policies and powerful
job scheduling capability. It supports online collaborative development among
multiple users, online editing and real-time query of SQL and shell scripts, and job
development via data processing nodes such as CDM, SQL, MRS, Shell, and Spark.
⚫ Unified scheduling and O&M
Fully hosted scheduling based on time and event triggering mechanisms enables task
scheduling by minute, hour, day, week, or month.
The visualized task O&M center monitors all tasks and supports alarm notification,
enabling users to obtain real-time task status and ensuring normal running of
services.
⚫ Reusable industrial knowledge bases
DataArts Studio provides vertical industries with reusable knowledge bases, including
data standards, domain models, subject libraries, algorithm libraries, and metric
libraries, and supports fast customization of E2E data operations solutions for
industries such as smart government, smart taxation, and smart campus.
⚫ Unified data asset management
DataArts Studio allows users to have a global view of data assets, facilitating fast
asset query, intelligent asset management, data source tracing, and data openness.
In addition, it enables users to define business data catalog, terms, and
classifications, as well as access to assets in a unified manner.
⚫ Visualized data operations in all scenarios
Visualized data governance requires only drag-and-drop operations without coding;
visualized processing results facilitate interaction and exploration; visualized data
asset management supports drill-down and source tracing.
⚫ All-round security assurance
Unified security authentication, tenant isolation, data grading and classification, and
data lifecycle management ensure data privacy, auditability, and traceability. Role-
based access control allows users to associate roles with permissions and supports
fine-grained permission policies, meeting different authorization requirements.
10.3 Quiz
What functions does DataArts Studio provide?
153
HCIA-Big Data V3.5 (For Trainees) Page 154
11 Summary
This document introduces basic big data knowledge, including components such as HDFS,
Hive, HBase, ClickHouse, MapReduce, Spark, Flink, Flume, and Kafka, how to use MRS,
and MRS operations and development. It also introduces big data solutions and data
mining, which will be described in detail in the latest HCIP and HCIE certification courses.
In addition, you can obtain more learning resources in the following ways:
1. Huawei Talent: https://e.huawei.com/en/talent/portal/#/
2. Huawei iLearningX platform: https://ilearningx.huawei.com/portal/
3. WeChat official accounts:
⚫ Huawei Certification
⚫ Huawei HCIE Elite Salon
⚫ Huawei ICT Academy
154