Module 08 Flink – Stream Processing and Batch Processing Platform
Module 08 Flink – Stream Processing and Batch Processing Platform
Flink
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Contents
1. Flink Overview
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
Flink Overview
Flink is a unified computing framework that supports both batch processing
and stream processing. It provides a streaming data processing engine that
supports data distribution and parallel computing. Flink features stream
processing, and is a top open-source stream processing engine in the
industry.
Flink, similar to Storm, is an event-driven real-time streaming system.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
Key Features of Flink
Streaming-first Fault-tolerant
- Stream processing - Reliability and
engine checkpoint
mechanism
Flink
Scalable Excellent
- Scaling out to over
performance
1000 nodes - High throughput
and low latency
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
Application Scenarios of Flink
Flink provides high-concurrency data processing, millisecond-
level latency, and high reliability, making it extremely suitable
for low-latency data processing scenarios.
Typical scenarios:
Internet finance services
Clickstream log processing
Public opinion monitoring
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Key Features of Flink
Low Latency
Millisecond-level processing capability.
Exactly Once
Asynchronous snapshot mechanism, ensuring that all data is processed
only once.
HA
Active/standby JobManagers, preventing single points of failure (SPOFs).
Scale-out
Manual scale-out supported by TaskManagers.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Hadoop Compatibility
Flink supports Yarn and can obtain data from the Hadoop distributed file
system (HDFS) and HBase.
Flink supports all formatted input and output of Hadoop.
Flink supports the Mappers and Reducers of Hadoop, which can be used
together with Flink operations.
Flink can run Hadoop jobs faster.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Performance Comparison of Stream
Computing Frameworks
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Contents
1. Flink Overview
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 10
Flink Architecture
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 11
Flink Technology Stack
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 12
Core Concept of Flink - DataStream
DataStream: Flink uses DataStream to represent data streams in applications. Data
streams can be considered as an unchangeable collection of duplicate data. The
number of DataStream elements is unlimited.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
DataStream
Data source: indicates the streaming data source, which can be HDFS
files, Kafka data, or texts.
Transformations: indicates streaming data conversion.
Data sink: indicates data output, which can be HDFS files, Kafka data,
or texts.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 14
Data Source of Flink
Batch processing Stream processing
Files Files
HDFS, local file system, and Socket streams
MapR file system Kafka
Text, CSV, Avro, and Hadoop RabbitMQ
input formats
Flume
JDBC
Collections
HBase
Implement your own
Collections
SourceFunction.collect
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
DataStream Transformations
Common transformations:
public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper)
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 16
DataStream Transformations
flatMap
1 3
writeAsText
Window/Join
6
HDFS
HDFS
textFile
map keyBy
2 4 5
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Flink Application Running Process - Key
Roles
Client:
Indicates the request initiator, which submits application requests and creates the data
flow.
JobManager:
Manages the resources for applications. JobManager applies to
ResourceManager for resources based on the requirements of
applications.
ResourceManager of Yarn:
Indicates the resource management department, which schedules and allocates the
resources of the entire cluster in a unified manner.
TaskManager:
Performs computing work. An application will be split and assigned to multiple
TaskManagers for computing.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 18
Flink Job Running Process
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 19
Flink on Yarn
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Technical Principles of Flink (1)
A Flink application consists of streaming data and transformation operators.
Conceptually, a stream is a (potentially never-ending) flow of data records,
and a transformation is an operator that takes one or more streams as input,
and produces one or more output streams as a result.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Technical Principles of Flink (2)
The source operator is used to load streaming data. Transformation operators,
such as map(), keyBy(), and apply(), are used to process streaming data. After
streaming data is processed, the sink writes the processed streaming data into
related storage systems, such as HDFS, HBase, and Kafka.
keyBy()
Source map() Sink
apply()
Stream
Streaming Dataflow
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 22
Parallel DataStream of Flink
Streaming Dataflow (condensed view)
keyBy()
Source map() Sink
apply()
Stream
Operator
Source[1 map()
[1] keyBy()
] apply()
[1]
Operator Stream Sink
Subtask Partition parallelism = 2 [1]
keyBy()
Source map() parallelism = 1
apply()
[2] [2]
[2]
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Operator Chain of Flink
Streaming Dataflow (condensed view)
keyBy()
Source map() apply()
[1] [1] [1]
Sink
Subtask (=thread) Subtask (=thread)
[1]
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Windows of Flink
Flink supports operations based on time windows and operations
based on data windows.
Categorized by splitting standard: time windows and count windows
Categorized by window action: tumbling windows, sliding windows, and
custom windows
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 25
Common Window Types of Flink (1)
Tumbling windows, whose times do not overlap
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Common Window Types of Flink (2)
Sliding windows, whose times overlap
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Common Window Types of Flink (3)
Session windows, which are considered completed if there is no data
within the preset time period.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Fault Tolerance of Flink
The checkpoint mechanism is a key fault tolerance measure of Flink.
The checkpoint mechanism keeps creating status snapshots of stream
applications. The status snapshots of the stream applications are stored at a
configurable place (for example, in the memory of JobManager or on HDFS).
The core of the distributed snapshot mechanism of Flink is the barrier. Barriers
are periodically inserted into data streams and flow as part of the data streams.
New tuple DataStream Old tuple
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
Checkpoint Mechanism (1)
The checkpoint mechanism is the reliability pillar stone of Flink. When
an exception occurs on an operator in the Flink cluster (for example,
unexpected exit), the checkpoint mechanism can restore all application
statuses at a previous time so that all statuses are consistent.
This mechanism ensures that when a running application fails, all
statuses of the application can be restored from a checkpoint so that
data is processed only once. Alternatively, you can choose to process
data at least once.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
Checkpoint Mechanism (2)
Barrier Source Intermediate Sink
operator operator operator
CheckpointCoordinator
Barrier
Source Intermediate
Sink
operator operator operator
CheckpointCoordinator
Snapshot
Barrier
Source Intermediate
Sink
operator operator operator
CheckpointCoordinator Snapshot
CheckpointCoordinator
Snapshot
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Checkpoint Mechanism (3)
A Barrier of source A
C D
B Barrier of source B
A Barrier of source A
C D
B Barrier of source B
Snapshot
A
Merged barrier
C D
B
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Contents
1. Flink Overview
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
Location of Flink in FusionInsight
Products
Application service layer
Open API/SDK REST/SNMP/Syslog
FusionInsight HD provides a Big Data processing environment and selects the best practice
in the industry based on scenarios and open source software enhancement.
Flink is a unified computing framework that supports both batch processing and stream
processing. Flink provides high-concurrency pipeline data processing, millisecond-level
latency, and high reliability.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 35
Flink WebUI
The FusionInsight HD platform provides a visual management and monitoring UI for
Flink. You can use the Yarn WebUI to query the running status of Flink tasks.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 36
Interaction of Flink with Other
Components
In the FusionInsight HD cluster, Flink interacts with the
following components:
HDFS: (mandatory) Flink reads and writes data in HDFS.
Yarn: (mandatory) Flink relies on Yarn to schedule and manage
resources for running tasks.
ZooKeeper: (mandatory) Flink relies on ZooKeeper to implement
the checkpoint mechanism.
Kafka: (optional) Flink can receive data streams sent from Kafka.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 37
Summary
These slides describe the following information about Flink:
basic concepts, application scenarios, technical architecture,
window types, and Flink on Yarn.
These slides also describe Flink integration in FusionInsight HD.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 38
Quiz
1. What are the key features of Flink?
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 39
More Information
Training materials:
http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
Exam outline:
http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node10
00011797
Mock exam:
http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node10000
11798
Authentication process:
http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 40
Thank You
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 41