0% found this document useful (0 votes)
19 views45 pages

7 - Streaming 2 - Calcite

The document provides an overview of Big Stream Processing Systems, focusing on data safety, availability, and various optimization techniques such as asynchronous snapshots, operator separation, and load balancing. It discusses the architecture of open-source stream processing systems like Spark and Flink, and the programming abstractions available for stream processing, including low-level dataflow programming and declarative SQL-like APIs. Additionally, it covers the importance of timely data processing and the mechanisms to ensure efficient data handling and query optimization.

Uploaded by

Karim Osama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views45 pages

7 - Streaming 2 - Calcite

The document provides an overview of Big Stream Processing Systems, focusing on data safety, availability, and various optimization techniques such as asynchronous snapshots, operator separation, and load balancing. It discusses the architecture of open-source stream processing systems like Spark and Flink, and the programming abstractions available for stream processing, including low-level dataflow programming and declarative SQL-like APIs. Additionally, it covers the importance of timely data processing and the mechanisms to ensure efficient data handling and query optimization.

Uploaded by

Karim Osama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

CIT650 Introduction to Big Data

Big Stream Processing Systems

1
Data Safety and Availability

• Ensure that operators see all events


• At least once
• Replay a stream from a checkpoint

• Ensure that operators do not perform duplicate state update


• Exactly once
• Several solutions: E.g., snapshots

• Survive failure

2
Taking Snapshots: the Na¨ıve Way

Periodically
• Pause all operators
• Buffer all transient messages
• Each stateful operator does a
checkpoint

Recovery
• revert to the last checkpoint

Applied in Naiadaa

a
Murray, Derek G., et al. “Naiad: a timely dataflow system.” Proceedings of the
Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013
3
Asynchronous Snapshots
• Asynchronous state snapshots are taken across nodes
without pausing data processing.
• Snapshots maintain system-wide consistency through
coordinated timing.
• Snapshots enable consistent state recovery after failures.
• Applied in Flink6

6Carbone, Paris, et al. ”Lightweight asynchronous snapshots for distributed


dataflows.” arXiv preprint arXiv:1506.08603(2015).
4
Automatic Partitioning and Scaling

1. Parallelism techniques supported by Big Streams

• Pipeline-parallel (A || B): This type of parallelism involves


processing different stages of a task simultaneously, like an
assembly line in a factory.
• Task-parallel (D || E): Here, different tasks are performed in
parallel, which can be independent or part of a larger job.
• Data-parallel (G || G): In this model, the same operation is
applied to different partitions of data in parallel, enhancing
performance and efficiency.
5
Reordering and Elimination

• Concept: Moving more selective operations upstream to filter


data early.
• Benefit: Reduces the amount of data processed by downstream
operators, optimizing the performance and efficiency of the
data processing pipeline.

6
Reordering and Elimination

• Concept: Eliminate redundant computations by sharing


subgraphs between operations.
• Benefit: Saves computational resources by avoiding duplicate
processing, improving system efficiency.

7
Operator Separation

• Concept: Separate complex operators into smaller computational


steps for improved processing efficiency.
• Benefit: Enhances resource utilization, increases throughput, and
allows for more granular scaling and fault tolerance.
• Operator separation is profitable if it enables other optimizations
such as operator hoisting or sinking, or if the resulting pipeline
parallelism pays off when running on multiple cores
8
Fusion

• Concept: Combines multiple processing operators into one to avoid


data serialization and transport costs between operations.
• Benefit: Reduces the overhead associated with moving data between
operators, enhancing overall process efficiency and improves runtime
performance by minimizing inter-operator communication and data
handling.
• Recall narrow dependencies in Spark

9
Placement

• Concept: Assign operations to specific hosts and cores to optimize


resource utilization and performance.
• Benefit: Ensures workload is distributed evenly, facilitating
parallelism and reducing latency.

10
Load Balancing

• Concept: Distribute workload evenly across resources.


• Benefit: Cheap for stateless operators and ensures optimal
resource utilization. However, Expensive for stateful operators as it
requires state splitting and migration7

7
Rivetti, Nicolo, et al. “Efficient key grouping for near-optimal load balancing in
stream processing systems.” Proceedings of the 9th ACM International Conference
on Distributed Event-Based Systems. ACM, 2015.
11
State Sharing

• Concept: Optimize for space by avoiding unnecessary copies of


data through shared state.
• Benefit: Reduces memory footprint, improves cache
performance, and simplifies state management.

12
Batching

• Concept: Aggregates multiple data items into a single batch for


simultaneous processing, enhancing efficiency.
• Benefit : Increases throughput by reducing the per-item
processing overhead and improving resource utilization and
facilitates execution scaling, allowing the system to adapt to
larger workloads by adjusting the batch size.

13
Algorithm Selection and Load Shedding

•Concept: Use a faster algorithm for implementing an


operator as part of the physical query plan.
•Benefit: Provides performance optimization by selecting the
most efficient computational method for the operation.

14
Algorithm Selection and Load Shedding

•Concept: Degrade gracefully when overloaded by selectively


dropping requests or reducing functionality.
•Benefit: Prevents system failure under high load, ensuring
sustained operation and service availability.

15
Streaming Systems Overview

16
Open-Source Stream Processing Systems (1/3)

Reliable handling of huge numbers of concurrent reads and writes


Can be used as data-source / data-sink for Storm, Samza, Flink,
Spark and many more systems
Fault tolerant: Messages are persisted on disk and replicated within
the cluster. Messages (reads and writes) can be repeated

True streaming over distributed dataflow


Low level API: Programmers have to specify the logic of each vertex
in the flow graph
Full understanding and hard coding of all used operators is required
Enables very high throughput (single purpose programs with small
overhead)
True streaming built on top of Apache Kafka and Hadoop YARN
State is first class citizen
Low level API
17
Open-Source Stream Processing Systems (2/3)

Spark implements a batch execution engine


The execution of a job graph is done in stages

Operator outputs are materialized in memory (or disk) until the con-
suming operator is ready to consume the materialized data
Spark uses Discretized Streams (D-Streams)
Streams are interpreted as a series of deterministic batch-processing
jobs

Micro batches have a fixed granularity (default 3 seconds)

All windows defined in queries must be multiples of this granularity

18
Open-Source Stream Processing Systems (3/3)
Flink’s runtime is a native streaming engine
Based on Nephele/PACTs

Queries are compiled to a program in the form of an operator DAG

Operator DAGs are compiled to job graphs

Job graphs are generic streaming programs


Flink implements “true streaming”
The whole job graph is deployed concurrently in the cluster

Operators are long-running: Continuously consume input and produce


output

Output tuples are immediately forwarded to succeeding operators and


are available for further processing (enables pipeline parallelism)

19
Programming with Streams

There are different abstraction levels


that a programmer can use to express
streaming computations. stream
processing frameworks hide execution
details from the programmers and
manage them in the background.

20
Programming with Streams (2)

Low-Level APIs.
A dataflow program is represented as a directed graph, whose nodes
represent a computation and whose edges represent connections among
dataflow nodes.

A stream-processing system distributes a dataflow graph across multiple


machines and is responsible for managing the partitioning of data, the
network communication, as well as program recovery in case of machine
failure.

Dataflow programming offers programmers complete freedom to imple-


ment their business logic, but require them to have good knowledge of
the execution internals.

21
Programming with Streams (3)

Functional APIs.
stream-processing frameworks such as Spark or Flink offer higher-level
functional APIs. These APIs are more declarative than low-level dataflow
programming by giving programmers the ability to specify data-stream
programs as transformations on data-streams.

High-Level Declarative Languages.


In the past, several research projects in stream processing have proposed
and some of them offerd declarative SQL-like language for data stream
processing.
Declarativity has the disadvantage of limiting the opportunities for fine-
tuning the performance of applications. However, a declarative language
allows for automatic optimization and shifts the responsibility of opti-
mization from the programmer to the system.

22
Low-Level Dataflow Programming (1)
• Logical Dataflow
• A dataflow program is
represented as a directed
graph of operators, and a
set of edges connecting
those operators.
• Operators are independent
processing units defined by
the programmer, which
take input and produce
output.
• Operators can only
communicate with each
other by their input and
output connections.
• The bottom figure shows
how the Split operator is
written using Java-like
pseu- docode.
23
Low-Level Dataflow Programming (2)

• Physical Dataflow
• A logical data flow graph is deployed in a distributed environment,
in the form a physical dataflow graph.
• Before execution, systems typically create several parallel
instances of the same operator, which we refer to as tasks.
• A system is able to scale out by dis- tributing these tasks across
many ma- chines, akin to a MapReduce execution.
• In low-level dataflow programming, the programmers can control
the physical dataflow execution such as the degree of parallelism

24
Low-Level Dataflow Programming (3)

Stateful Operators.
Unlike a simple operator such as Split, certain operators need to keep
mutable state. For instance, in the word counting example, counting
the word occurrences received by an operator, requires storing the words
received thus far along with their respective counts. Thus, the Count
operator must keep a state of the current counts. (need prior knowledge)

In this example, the state is read (get method) and updated (put method)
in every call of the onArrivingDataPoint event handler.

25
Low-Level Dataflow Programming (4)

Partitioning strategies in physical dataflows determine the allocation


of records between the parallel tasks of two connected logical opera-
tors.They give control over data exchange patterns that fundamentally
occur in physical dataflow.
Random partitioning: each output record of a task is shipped to a
uniformly random assigned task of a receiving operator. distributing the
workload evenly among tasks of the same operator.

Broadcast partitioning: send records to every parallel task of the next


operator.

Partitioning by key: guarantees that records with the same key (e.g.,
declared by the user) are sent to the same par- allel task of consuming
operators

User defined partitioning functions: (e.g., geo-partitioning or machine


learning model selection ).

26
Functional APIs

More declarative APIs than the


previous one (lower-level ones).

Certain details of how to exe-


cute the computations can be
omitted, and programmers need
only specify what should be com-
puted.

Collection Abstractions for


Streams.

27
Declarative SQL-like APIs

Can you see the repeating pattern?


With batch processing, low-level APIs MapReduce/Spark RDD

SQL Abstraction: Hive, PIG, SparkSQL, etc.

Same thing happens with streaming systems

We gain the same benefits


Wider usability

Optimized pipelines

Most of big stream processing systems depend on Apache Calcite for


SQL parsing and logical plan generation
Spark is an exception. Which optimizer does Spark use?

28
Apache Calcite8

An Apache project for building databases and data management sys-


tems
Has an SQL parser and optimization component
SQL Standard compliant
Used by different other Apache projects: Hive,
Drill, Flink, Storm, Samza
Generates an optimized logical plan
Initial release in 2014
Queries written in SQL are portable among Calcite-compliant systems
Translation into a physical execution plan is system-specific.
Like SparkSQL, each system translates the logical plan into a DAG of
its low-level APIs

8https://calcite.apache.org/

29
Architecture

30
Conventional Database

• JDBC Client: This is an application or component that uses JDBC to


connect and execute SQL commands on a database server.
• JDBC Server: Acts as an intermediary between the JDBC client and
the actual database. It receives SQL commands from the client.
• SQL Parser/Validator: This component parses the SQL queries
received and validates them according to the SQL grammar and
database schema.
• Query Optimizer: Responsible for determining the most efficient way to
execute a given SQL query.
• Data-flow Operators: These are the executable units that perform the
actual operations on the data (like joins, filters, aggregations, etc.).
• Metadata: Information about the database structure, such as tables,
columns, data types, and other schema details.
• Data: The actual data stored in the database.

31
Calcite

• JDBC Client: Interfaces with the JDBC server in a manner similar to


traditional setups.
• JDBC Server: Receives SQL commands and offers enhanced
modularity for customization.
• Optional SQL Parser/Validator: Can be tailored or skipped for flexibility
in query processing.
• Core: Central coordinator of the database server's query processing.
• Query Optimizer: A pluggable component that determines efficient
query execution plans and can be customized.
• Pluggable 3rd Party Ops: Allows integration of external operations into
the query processing flow.
• Metadata SPI: Facilitates access and modification of database
metadata by external tools.
• Pluggable Rules: Supports the addition of custom rules for query
optimization and execution.
32
Using Calcite with MySQL

33
Json File for JDBC Connection

34
Using Calcite with Kafka

35
Json File for Kafka Connection

Output

36
Relational Algebra

• Streaming Operators
• Delta: convert relation to stream
• Chi: convert stream to relation

• In SQL, STREAM keyword signifies Delta

• Core Operators
• Scan: Retrieves data rows from a table.
• Filter: Selects data rows meeting specific criteria.
• Project: Chooses and possibly transforms columns.
• Join: Combines rows from multiple tables.
• Sort: Orders rows by specified columns.
• Aggregate: Summarizes data with calculations.
• Union: Merges results from multiple queries.
• Values: Creates rows with specified values.

37
Simple Queries

38
Stream-Table

39
Aggregation and Windows on Streams
Aggregation is indicated
by the GROUP BY
clause
A tuple can contribute
to more than one
aggregate, e.g., in the
case of sliding window

40
Making Progress

It’s not enough to get the right result. We


need to give the right result at the right
time.
Ways to make progress without
compromising safety:
Monotonic columns (e.g. rowtime) and
expressions
(e.g. floor rowtime to hour )
Punctuations (aka watermarks)
Or a combination of both

41
Window Functions

42
Join Stream to a Table

43
Join Stream to a Stream

44
The End

45

You might also like