Flink HandsOn
Flink HandsOn
High-level Interfaces
Support / Integration
Data Processing
Data Storage
Resource Management
2
Matteo Nardelli - SABD 2021/22
Apache Flink
• Apache Flink is a framework and distributed processing
engine for stateful computations over unbounded and
bounded data streams.
• Unbounded streams: have a start but no defined end;
must be continuously processed; is not possible to wait
for all data to arrive.
• Stream processing
• Bounded streams: have defined start and end; can be
processed by ingesting all data before computation;
ordered ingestion is not usually required (can be sorted)
• Batch processing
4
Matteo Nardelli - SABD 2021/22
Apache Flink
• Key concepts:
• Stream:
• bounded/unbounded;
• real-time/recorded
• State:
• Flink offers state primitives,
• pluggable state backends (e.g., RocksDB),
• exactly-once semantic,
• scalable applications (data partitioning and
distribution)
• Time:
• event-time vs processing-time mode;
• watermark;
• late data handling
5
Matteo Nardelli - SABD 2021/22
Apache Flink: APIs
• Multiple APIs at different levels of abstraction
6
Matteo Nardelli - SABD 2021/22
Apache Flink: ProcessFunction API
ProcessFunction API:
• Low-level stream processing operation
• Handles events by being invoked for each event
received
• Has access to (RuntimeContext):
• Events (stream elements)
• State (fault-tolerant, consistent, only on keyed
stream)
• Timers (event time and processing time, only on
keyed stream)
7
Matteo Nardelli - SABD 2021/22
Apache Flink: DataStream API
• Data streaming applications: DataStream API
– Supports functional transformations on data
streams, with user-defined state and flexible
windows
– Example: windowed version of WordCount
WindowWordCount using Flink's DataStream API
See https://bit.ly/2AhCEBX 8
V. Cardellini - SABD 2020/21
Apache Flink: DataStream API
DataStream API:
• Provides primitives for many common stream
processing operations:
• Windowing
• Record-at-a-time transformations
• Enriching events
• Based on functions, e.g., map(), reduce(), and
aggregate()
See https://bit.ly/2zEH3Pk 11
V. Cardellini - SABD 2020/21
Anatomy of a Flink program
• Let’s analyze DataStream API
https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html
12
V. Cardellini - SABD 2020/21
Anatomy of a Flink program
3. Specify transformations on data by calling methods on
DataStream
13
V. Cardellini - SABD 2020/21
Flink: Lazy evaluation
14
V. Cardellini - SABD 2020/21
Flink: data sources
• Several predefined stream sources accessible from the
StreamExecutionEnvironment
1. File-based:
– E.g., readTextFile(path) to read text files
– Flink splits file reading process into two sub-tasks: directory monitoring and
data reading
• Monitoring is implemented by a single, non-parallel task, while reading is
performed by multiple tasks running in parallel, whose parallelism is equal to
the job parallelism
2. Socket-based
3. Collection-based
4. Custom
– E.g., to read from Kafka fromSource(new KafkaSource<…>(…))
https://nightlies.apache.org/flink/flink-docs-stable/docs/connectors/
datastream/kafka/
– See Apache Bahir for streaming connectors and SQL data sources https://
bahir.apache.org/
15
V. Cardellini - SABD 2020/21
Flink: DataStream transformations
• Map
DataStream → DataStream
– Example: double the values of the input stream
• FlatMap
DataStream → DataStream
– Example: split sentences to words
16
V. Cardellini - SABD 2020/21
Flink: DataStream transformations
• Filter
DataStream → DataStream
– Example: filter out zero values
• KeyBy
DataStream → KeyedStream
– To specify a key that logically partitions a stream into disjoint partitions
– Internally, implemented with hash partitioning
– Different ways to specify keys, the simplest case is grouping tuples on one
or more fields of the tuple
– Examples:
17
V. Cardellini - SABD 2020/21
Flink: DataStream transformations
• Reduce
KeyedStream → DataStream
– “Rolling” reduce on a keyed data stream
– Combines the current element with the last reduced value and emits
the new value
– Example: create a stream of partial sums
18
V. Cardellini - SABD 2020/21
Flink: DataStream transformations
• Aggregations
KeyedStream → DataStream
– To aggregate on a keyed data stream
– min returns the minimum value, whereas minBy returns the element that
has the minimum value in this field
• Window
KeyedStream → WindowedStream
19
V. Cardellini - SABD 2020/21
Flink: DataStream transformations
• Other transformations available in Flink
– join: joins two data streams on a given key
– union: union of two or more data streams creating a new
stream containing all the elements from all the streams
– split: splits the stream into two or more streams
according to some criterion
– iterate: creates a “feedback” loop in the flow, by
redirecting the output of one operator to some previous
operator
• Useful for algorithms that continuously update a model
See https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/
operators/overview/
20
V. Cardellini - SABD 2020/21
Example: streaming window WordCount
• Count the words from a web socket in 5 sec windows
21
V. Cardellini - SABD 2020/21
Example: streaming window WordCount
22
V. Cardellini - SABD 2020/21
Flink: windows support
• Windows can be applied either to keyed streams or to
non-keyed ones
• General structure of a windowed Flink program
23
V. Cardellini - SABD 2020/21
Flink: window lifecycle
• First, specify if stream is keyed or not and define the
window assigner
– Keyed stream allows to perform the windowed computation in
parallel by multiple tasks
– The window is completely removed when the time (event or
processing time) passes its end timestamp plus the user-specified
allowed lateness
25
V. Cardellini - SABD 2020/21
Flink: window assigners
• Session windows
– To group elements by sessions of
activity
– Differently from tumbling and sliding
windows, do not overlap and do not
have a fixed start and end time
– A session window closes when a
gap of inactivity occurs
• Global windows
– To assign all elements with the
same key to the same single global
window
– Only useful if you also specify a
custom trigger
26
V. Cardellini - SABD 2020/21
Flink: window functions
• ReduceFunction
– To incrementally aggregate the elements of a window
– Example: sum up the second fields of the tuples for all elements in a
window
27
V. Cardellini - SABD 2020/21
Flink: window functions
• AggregateFunction: generalized version of a ReduceFunction
– Example: compute average of the elements in the window
28
V. Cardellini - SABD 2020/21
Flink: window functions
• AggregateFunction
– Example: compute weighted average of the elements in the window
29
V. Cardellini - SABD 2020/21
Flink: window functions
• ProcessWindowFunction: gets an Iterable containing all
the elements of the window, and a Context object with access to
time and state information
✓ More flexibility than other window functions
✗ At the cost of performance and resource consumption: elements are
buffered until the window is ready for processing
30
V. Cardellini - SABD 2020/21
Flink: window functions
• ProcessWindowFunction: gets an Iterable containing all
the elements of the window, and a Context object with access to
time and state information
✓ More flexibility than other window functions
✗ At the cost of performance and resource consumption: elements are
buffered until the window is ready for processing
31
V. Cardellini - SABD 2020/21
Flink: control events
32
V. Cardellini - SABD 2020/21
Flink: watermarks
• Watermarks mark the progress of event time within a
data stream
• Flow as part of data stream and carry a timestamp t
– W(t) declares that event time
has reached time t in that
stream, meaning that there
should be no more elements
with timestamp t’ <= t
– Crucial for out-of-order
streams, where events are not
ordered by their timestamps
33
V. Cardellini - SABD 2020/21
Flink: watermarks
• By default, late elements are dropped when the
watermark is past the end of the window
• However, Flink allows to specify a maximum allowed
lateness for window operator
– By how much time elements can be late before they are
dropped (0 by default)
– Late elements that arrive after the watermark has passed the
end of the window but before it passes the end of the window
plus the allowed lateness, are still added to the window
34
V. Cardellini - SABD 2020/21
Flink: watermarks
• Flink does not provide ordering guarantees after any
form of stream partitioning or broadcasting
– In such case, dealing with out-of-order tuples is left to the
operator implementation
35
V. Cardellini - SABD 2020/21
Flink: application execution
• Data parallelism
– Different operators of the same program may have different
levels of parallelism
– The parallelism of an individual operator, data source, or data
sink can be defined by calling its setParallelism() method
36
V. Cardellini - SABD 2020/21
Flink: application execution
37
V. Cardellini - SABD 2020/21
Flink: application monitoring
• Built-in monitoring and metrics system
• Allows gathering and exposing metrics to external systems
• Built-in metrics include
– Throughput: in terms of number of records per sec. (per operator/
task)
– Latency
• Support for latency tracking: special markers (called LatencyMarker)
are periodically inserted at all sources in order to obtain a distribution
of latency between sources and each downstream operator
– But do not account for time spent in operator processing (or in
window buffers)
– Assume that all machines clocks are sync
– Used JVM heap/non-heap/direct memory
– Availability, checkpointing
38
V. Cardellini - SABD 2020/21
Flink: application monitoring
• Application-specific metrics can be added
– E.g., counters for number of invalid records
• All metrics can be
– queried via Flink’s Monitoring REST API
– visualized in Flink’s Dashboard (Metrics tab)
– or send to external systems (e.g., Graphite and InfluxDB)
See https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html
39
V. Cardellini - SABD 2020/21