0% found this document useful (0 votes)
10 views39 pages

Flink HandsOn

Apache Flink is a distributed processing engine designed for stateful computations over both unbounded and bounded data streams, optimized for in-memory performance and scalability. It offers multiple APIs, including DataStream, Table API, and SQL, to facilitate batch and stream processing with various transformation functions. Key features include windowing support, control events like watermarks, and the ability to handle late data, making it suitable for real-time data analytics.

Uploaded by

drivesankofa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views39 pages

Flink HandsOn

Apache Flink is a distributed processing engine designed for stateful computations over both unbounded and bounded data streams, optimized for in-memory performance and scalability. It offers multiple APIs, including DataStream, Table API, and SQL, to facilitate batch and stream processing with various transformation functions. Key features include windowing support, control events like watermarks, and the ability to handle late data, making it suitable for real-time data analytics.

Uploaded by

drivesankofa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Macroarea di Ingegneria

Dipartimento di Ingegneria Civile e Ingegneria Informatica

Apache Flink: Hands-on Session


A.A. 2021/22
Matteo Nardelli

Laurea Magistrale in Ingegneria Informatica - II anno


The reference Big Data stack

High-level Interfaces

Support / Integration
Data Processing

Data Storage

Resource Management

2
Matteo Nardelli - SABD 2021/22
Apache Flink
• Apache Flink is a framework and distributed processing
engine for stateful computations over unbounded and
bounded data streams.
• Unbounded streams: have a start but no defined end;
must be continuously processed; is not possible to wait
for all data to arrive.
• Stream processing
• Bounded streams: have defined start and end; can be
processed by ingesting all data before computation;
ordered ingestion is not usually required (can be sorted)
• Batch processing

• Flink has been designed to run in all common cluster


environments, perform computations at in-memory speed
and at any scale.
3
Matteo Nardelli - SABD 2021/22
Apache Flink
• Flink is designed to run stateful streaming
applications at any scale.
• Applications are parallelized into possibly thousands of
tasks that are distributed and concurrently executed in a
cluster.
• Leverage In-Memory Performance
• Stateful Flink applications are optimized for local state
access.

4
Matteo Nardelli - SABD 2021/22
Apache Flink
• Key concepts:
• Stream:
• bounded/unbounded;
• real-time/recorded
• State:
• Flink offers state primitives,
• pluggable state backends (e.g., RocksDB),
• exactly-once semantic,
• scalable applications (data partitioning and
distribution)
• Time:
• event-time vs processing-time mode;
• watermark;
• late data handling
5
Matteo Nardelli - SABD 2021/22
Apache Flink: APIs
• Multiple APIs at different levels of abstraction

6
Matteo Nardelli - SABD 2021/22
Apache Flink: ProcessFunction API
ProcessFunction API:
• Low-level stream processing operation
• Handles events by being invoked for each event
received
• Has access to (RuntimeContext):
• Events (stream elements)
• State (fault-tolerant, consistent, only on keyed
stream)
• Timers (event time and processing time, only on
keyed stream)

Read more: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/process_function/

7
Matteo Nardelli - SABD 2021/22
Apache Flink: DataStream API
• Data streaming applications: DataStream API
– Supports functional transformations on data
streams, with user-defined state and flexible
windows
– Example: windowed version of WordCount
WindowWordCount using Flink's DataStream API

Sliding time window of


10 sec length and 5
sec slide

See https://bit.ly/2AhCEBX 8
V. Cardellini - SABD 2020/21
Apache Flink: DataStream API
DataStream API:
• Provides primitives for many common stream
processing operations:
• Windowing
• Record-at-a-time transformations
• Enriching events
• Based on functions, e.g., map(), reduce(), and
aggregate()

DataStream<Tuple2<String, Long>> result = words


.map(word -> Tuple2.of(word, 1L))
.returns(Types.TUPLE(Types.STRING, Types.LONG))
.keyBy(0)
.reduce((a, b) -> Tuple2.of(a.f0, a.f1 + b.f1));
9
Matteo Nardelli - SABD 2021/22
Apache Flink: Table API and SQL
Table API
• Table API and SQL are unified APIs for batch and
stream processing;
• They can be seamlessly integrated with the
DataStream and DataSet APIs;
• They support user-defined scalar, aggregate, and
table-valued functions.
• Relational APIs are designed to ease the definition of
data analytics, data pipelining, and ETL applications
Sessionize a clickstream and count the number of clicks per session

SELECT userId, COUNT(*)


FROM clicks
GROUP BY SESSION(clicktime, INTERVAL '30' MINUTE), userId
10
Matteo Nardelli - SABD 2021/22
Flink: APIs and libraries
• Batch processing applications: DataSet API
– Supports a wide range of data types beyond key/
value pairs and a wealth of operators
Core of PageRank algorithm using DataSet API

See https://bit.ly/2zEH3Pk 11
V. Cardellini - SABD 2020/21
Anatomy of a Flink program
• Let’s analyze DataStream API
https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html

• Special DataStream class used to represent a


collection of data in a Flink program
• Each Flink program consists of the same basic parts:
1. Obtain one execution environment

2. Load/create initial data

12
V. Cardellini - SABD 2020/21
Anatomy of a Flink program
3. Specify transformations on data by calling methods on
DataStream

4. Specify where to put the results of your computations

5. Trigger the program execution by calling execute on


StreamExecutionEnvironment

13
V. Cardellini - SABD 2020/21
Flink: Lazy evaluation

• Flink programs are executed lazily


– When program’s main method is executed, data
loading and transformations do not happen directly
– Rather, each operation is created and added to
program’s plan
– Operations are actually executed when execution
is explicitly triggered by calling execute() on the
execution environment

14
V. Cardellini - SABD 2020/21
Flink: data sources
• Several predefined stream sources accessible from the
StreamExecutionEnvironment
1. File-based:
– E.g., readTextFile(path) to read text files
– Flink splits file reading process into two sub-tasks: directory monitoring and
data reading
• Monitoring is implemented by a single, non-parallel task, while reading is
performed by multiple tasks running in parallel, whose parallelism is equal to
the job parallelism
2. Socket-based
3. Collection-based
4. Custom
– E.g., to read from Kafka fromSource(new KafkaSource<…>(…))
https://nightlies.apache.org/flink/flink-docs-stable/docs/connectors/
datastream/kafka/
– See Apache Bahir for streaming connectors and SQL data sources https://
bahir.apache.org/

15
V. Cardellini - SABD 2020/21
Flink: DataStream transformations
• Map
DataStream → DataStream
– Example: double the values of the input stream

• FlatMap
DataStream → DataStream
– Example: split sentences to words

16
V. Cardellini - SABD 2020/21
Flink: DataStream transformations
• Filter
DataStream → DataStream
– Example: filter out zero values

• KeyBy
DataStream → KeyedStream
– To specify a key that logically partitions a stream into disjoint partitions
– Internally, implemented with hash partitioning
– Different ways to specify keys, the simplest case is grouping tuples on one
or more fields of the tuple
– Examples:

17
V. Cardellini - SABD 2020/21
Flink: DataStream transformations

• Reduce
KeyedStream → DataStream
– “Rolling” reduce on a keyed data stream
– Combines the current element with the last reduced value and emits
the new value
– Example: create a stream of partial sums

18
V. Cardellini - SABD 2020/21
Flink: DataStream transformations
• Aggregations
KeyedStream → DataStream
– To aggregate on a keyed data stream
– min returns the minimum value, whereas minBy returns the element that
has the minimum value in this field

• Window
KeyedStream → WindowedStream

19
V. Cardellini - SABD 2020/21
Flink: DataStream transformations
• Other transformations available in Flink
– join: joins two data streams on a given key
– union: union of two or more data streams creating a new
stream containing all the elements from all the streams
– split: splits the stream into two or more streams
according to some criterion
– iterate: creates a “feedback” loop in the flow, by
redirecting the output of one operator to some previous
operator
• Useful for algorithms that continuously update a model

See https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/
operators/overview/

20
V. Cardellini - SABD 2020/21
Example: streaming window WordCount
• Count the words from a web socket in 5 sec windows

// Key by the first element of a Tuple

21
V. Cardellini - SABD 2020/21
Example: streaming window WordCount

22
V. Cardellini - SABD 2020/21
Flink: windows support
• Windows can be applied either to keyed streams or to
non-keyed ones
• General structure of a windowed Flink program

23
V. Cardellini - SABD 2020/21
Flink: window lifecycle
• First, specify if stream is keyed or not and define the
window assigner
– Keyed stream allows to perform the windowed computation in
parallel by multiple tasks
– The window is completely removed when the time (event or
processing time) passes its end timestamp plus the user-specified
allowed lateness

• Then, associate to window its trigger, (evictor) and function


– Trigger determines when a window is ready to be processed by the
window function
– Evictor (optional) has the ability to remove elements from a window
after the trigger fires and before and/or after the window function is
applied
– Function specifies the computation to be applied to the window
contents
Read more: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/windows/
24
V. Cardellini - SABD 2020/21
Flink: window assigners
• How elements are assigned to windows
• Support for different window assigners
– Each WindowAssigner comes with a default Trigger
• Built-in assigners for most common use cases:
– Tumbling windows
– Sliding windows
– Session windows
– Global windows
• Except for global windows, they assign elements to
windows based on time, which can either be processing
time or event time
• It is also possible to implement a custom window assigner

25
V. Cardellini - SABD 2020/21
Flink: window assigners
• Session windows
– To group elements by sessions of
activity
– Differently from tumbling and sliding
windows, do not overlap and do not
have a fixed start and end time
– A session window closes when a
gap of inactivity occurs
• Global windows
– To assign all elements with the
same key to the same single global
window
– Only useful if you also specify a
custom trigger

26
V. Cardellini - SABD 2020/21
Flink: window functions

• Different window functions to specify the computation


on each window

• ReduceFunction
– To incrementally aggregate the elements of a window
– Example: sum up the second fields of the tuples for all elements in a
window

27
V. Cardellini - SABD 2020/21
Flink: window functions
• AggregateFunction: generalized version of a ReduceFunction
– Example: compute average of the elements in the window

28
V. Cardellini - SABD 2020/21
Flink: window functions
• AggregateFunction
– Example: compute weighted average of the elements in the window

29
V. Cardellini - SABD 2020/21
Flink: window functions
• ProcessWindowFunction: gets an Iterable containing all
the elements of the window, and a Context object with access to
time and state information
✓ More flexibility than other window functions
✗ At the cost of performance and resource consumption: elements are
buffered until the window is ready for processing

30
V. Cardellini - SABD 2020/21
Flink: window functions
• ProcessWindowFunction: gets an Iterable containing all
the elements of the window, and a Context object with access to
time and state information
✓ More flexibility than other window functions
✗ At the cost of performance and resource consumption: elements are
buffered until the window is ready for processing

• ReduceFunction and AggregateFunction can execute


more efficiently
– Flink can incrementally aggregate the elements for each
window as they arrive

31
V. Cardellini - SABD 2020/21
Flink: control events

• Control events: special events injected in the


data stream by operators

• Two types of control events in Flink


⎼ Watermarks
⎼ Checkpoint barriers

32
V. Cardellini - SABD 2020/21
Flink: watermarks
• Watermarks mark the progress of event time within a
data stream
• Flow as part of data stream and carry a timestamp t
– W(t) declares that event time
has reached time t in that
stream, meaning that there
should be no more elements
with timestamp t’ <= t
– Crucial for out-of-order
streams, where events are not
ordered by their timestamps

Read more: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/event-time/generating_watermarks/

33
V. Cardellini - SABD 2020/21
Flink: watermarks
• By default, late elements are dropped when the
watermark is past the end of the window
• However, Flink allows to specify a maximum allowed
lateness for window operator
– By how much time elements can be late before they are
dropped (0 by default)
– Late elements that arrive after the watermark has passed the
end of the window but before it passes the end of the window
plus the allowed lateness, are still added to the window

34
V. Cardellini - SABD 2020/21
Flink: watermarks
• Flink does not provide ordering guarantees after any
form of stream partitioning or broadcasting
– In such case, dealing with out-of-order tuples is left to the
operator implementation

35
V. Cardellini - SABD 2020/21
Flink: application execution
• Data parallelism
– Different operators of the same program may have different
levels of parallelism
– The parallelism of an individual operator, data source, or data
sink can be defined by calling its setParallelism() method

36
V. Cardellini - SABD 2020/21
Flink: application execution

• Execution plan can be visualized

37
V. Cardellini - SABD 2020/21
Flink: application monitoring
• Built-in monitoring and metrics system
• Allows gathering and exposing metrics to external systems
• Built-in metrics include
– Throughput: in terms of number of records per sec. (per operator/
task)
– Latency
• Support for latency tracking: special markers (called LatencyMarker)
are periodically inserted at all sources in order to obtain a distribution
of latency between sources and each downstream operator
– But do not account for time spent in operator processing (or in
window buffers)
– Assume that all machines clocks are sync
– Used JVM heap/non-heap/direct memory
– Availability, checkpointing

38
V. Cardellini - SABD 2020/21
Flink: application monitoring
• Application-specific metrics can be added
– E.g., counters for number of invalid records
• All metrics can be
– queried via Flink’s Monitoring REST API
– visualized in Flink’s Dashboard (Metrics tab)
– or send to external systems (e.g., Graphite and InfluxDB)

See https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html

39
V. Cardellini - SABD 2020/21

You might also like