7 - Streaming 2 - Calcite
7 - Streaming 2 - Calcite
1
Data Safety and Availability
• Survive failure
2
Taking Snapshots: the Na¨ıve Way
Periodically
• Pause all operators
• Buffer all transient messages
• Each stateful operator does a
checkpoint
Recovery
• revert to the last checkpoint
Applied in Naiadaa
a
Murray, Derek G., et al. “Naiad: a timely dataflow system.” Proceedings of the
Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013
3
Asynchronous Snapshots
• Asynchronous state snapshots are taken across nodes
without pausing data processing.
• Snapshots maintain system-wide consistency through
coordinated timing.
• Snapshots enable consistent state recovery after failures.
• Applied in Flink6
6
Reordering and Elimination
7
Operator Separation
9
Placement
10
Load Balancing
7
Rivetti, Nicolo, et al. “Efficient key grouping for near-optimal load balancing in
stream processing systems.” Proceedings of the 9th ACM International Conference
on Distributed Event-Based Systems. ACM, 2015.
11
State Sharing
12
Batching
13
Algorithm Selection and Load Shedding
14
Algorithm Selection and Load Shedding
15
Streaming Systems Overview
16
Open-Source Stream Processing Systems (1/3)
Operator outputs are materialized in memory (or disk) until the con-
suming operator is ready to consume the materialized data
Spark uses Discretized Streams (D-Streams)
Streams are interpreted as a series of deterministic batch-processing
jobs
18
Open-Source Stream Processing Systems (3/3)
Flink’s runtime is a native streaming engine
Based on Nephele/PACTs
19
Programming with Streams
20
Programming with Streams (2)
Low-Level APIs.
A dataflow program is represented as a directed graph, whose nodes
represent a computation and whose edges represent connections among
dataflow nodes.
21
Programming with Streams (3)
Functional APIs.
stream-processing frameworks such as Spark or Flink offer higher-level
functional APIs. These APIs are more declarative than low-level dataflow
programming by giving programmers the ability to specify data-stream
programs as transformations on data-streams.
22
Low-Level Dataflow Programming (1)
• Logical Dataflow
• A dataflow program is
represented as a directed
graph of operators, and a
set of edges connecting
those operators.
• Operators are independent
processing units defined by
the programmer, which
take input and produce
output.
• Operators can only
communicate with each
other by their input and
output connections.
• The bottom figure shows
how the Split operator is
written using Java-like
pseu- docode.
23
Low-Level Dataflow Programming (2)
• Physical Dataflow
• A logical data flow graph is deployed in a distributed environment,
in the form a physical dataflow graph.
• Before execution, systems typically create several parallel
instances of the same operator, which we refer to as tasks.
• A system is able to scale out by dis- tributing these tasks across
many ma- chines, akin to a MapReduce execution.
• In low-level dataflow programming, the programmers can control
the physical dataflow execution such as the degree of parallelism
24
Low-Level Dataflow Programming (3)
Stateful Operators.
Unlike a simple operator such as Split, certain operators need to keep
mutable state. For instance, in the word counting example, counting
the word occurrences received by an operator, requires storing the words
received thus far along with their respective counts. Thus, the Count
operator must keep a state of the current counts. (need prior knowledge)
In this example, the state is read (get method) and updated (put method)
in every call of the onArrivingDataPoint event handler.
25
Low-Level Dataflow Programming (4)
Partitioning by key: guarantees that records with the same key (e.g.,
declared by the user) are sent to the same par- allel task of consuming
operators
26
Functional APIs
27
Declarative SQL-like APIs
Optimized pipelines
28
Apache Calcite8
8https://calcite.apache.org/
29
Architecture
30
Conventional Database
31
Calcite
33
Json File for JDBC Connection
34
Using Calcite with Kafka
35
Json File for Kafka Connection
Output
36
Relational Algebra
• Streaming Operators
• Delta: convert relation to stream
• Chi: convert stream to relation
• Core Operators
• Scan: Retrieves data rows from a table.
• Filter: Selects data rows meeting specific criteria.
• Project: Chooses and possibly transforms columns.
• Join: Combines rows from multiple tables.
• Sort: Orders rows by specified columns.
• Aggregate: Summarizes data with calculations.
• Union: Merges results from multiple queries.
• Values: Creates rows with specified values.
37
Simple Queries
38
Stream-Table
39
Aggregation and Windows on Streams
Aggregation is indicated
by the GROUP BY
clause
A tuple can contribute
to more than one
aggregate, e.g., in the
case of sliding window
40
Making Progress
41
Window Functions
42
Join Stream to a Table
43
Join Stream to a Stream
44
The End
45