DBT Unit4 PDF
DBT Unit4 PDF
TECHNOLOGIES
Unit - 4: Data-Stream Management
Raghu B. A.
Department of Computer Science and Engineering
DATABASE TECHNOLOGIES
Introduction
● What is Streamed Data?
○ Data that is produced from multiple sources continuously.
○ E.g.:
■ Social Media Feeds
■ Sensors
■ Cameras
■ E-commerce purchases
■ Log files
■ Server Activity
■ Geo location
● Traditionally, data was captured, stored and pushed out. With the introduction of
streaming analytics, organizations can gain insights and take decisions on the state
of the data.
● Unbounded Data: A type of dataset that is infinite in size (at least theoretically)
● Bounded Data: A dataset of known size
DATABASE TECHNOLOGIES
Data-Stream Management
Layman’s explanation:
Stream processing is the practice of taking action on a series of data
at the time the data is created.
Textbook explanation:
Stream processing is the discipline and related set of techniques used
to extract information from unbounded data.
DATABASE TECHNOLOGIES
Data-Stream Management
Stream Processing
DATABASE TECHNOLOGIES
Data-Stream Management
Stream Processing
In stream-processing, there are different timelines as to when the events are produced and
when the events are handled by the system.
Event Time:
● The time when the event was created.
● The time information is provided by the local clock of the device generating the event.
Processing Time:
● The time when the event is handled by the stream-processing system.
● This is the clock of the server running the processing logic.
Cons:
● Size and frequency are unpredictable
● Algorithms must be relatively fast to process the data
Stream Processing
Processing happens in real-time. Takes few Waits for the entire lot to perform processing.
seconds or milliseconds. Takes a lot of time.
Data size is unknown and infinite. Data size is known and finite.
Used in social media, stock market etc. Used in billing systems, payroll systems etc.
DATABASE TECHNOLOGIES
Data-Stream Management
Lambda Architecture
● Lambda architecture is a way of processing large quantities of data (i.e. “Big Data”) that
provides access to batch-processing and stream-processing methods with a hybrid
approach.
DATABASE TECHNOLOGIES
Data-Stream Management
Lambda Architecture (Cont’d…)
● Batch Layer
○ New data comes continuously, as a feed to the data system. It gets fed to the
batch layer and the speed layer simultaneously.
○ It looks at all the data at once and eventually corrects the data in the stream layer.
○ The batch layer has two very important functions:
■ To manage the master dataset
■ To pre-compute the batch views
● Speed Layer (Stream Layer)
○ This layer handles the data that are not already delivered in the batch view due to
the latency of the batch layer.
○ In addition, it only deals with recent data in order to provide a complete view of the
data to the user by creating real-time views.
● Serving Layer
○ The outputs from the batch layer in the form of batch views and those coming from
the speed layer in the form of near real-time views get forwarded to serving layer.
○ This layer indexes the batch views so that they can be queried in low-latency on
an ad-hoc basis.
DATABASE TECHNOLOGIES
Data-Stream Management
Kappa Architecture
● The Kappa architecture is a software architecture that is event-based and able to handle all
data at all scale in real-time for transactional AND analytical workloads.
DATABASE TECHNOLOGIES
Data-Stream Management
Advantages of Kappa
● Handle all the use cases with a single architecture
● One codebase that is always in synch
● One set of infrastructure and technology
● The heart of the infrastructure is real-time, scalable, and reliable
● Improved data quality with guaranteed ordering and no mismatches
● No need to re-architect for new use cases
Disadvantages of Kappa
● This architecture is incapable of managing computation-intensive applications in respect of large-
scale machine learning use cases.
● Due to the unavailability of a batch layer the batch processing tasks mostly suffer.
DATABASE TECHNOLOGIES
Data-Stream Management
Evolution of Spark
1. The First Wave: Functional APIs
● In the earlier days, Spark used memory and expressive functional API.
● Spark memory model uses RAM to cache data as it is being processed, resulting in up to
100 times faster processing than Hadoop Map-Reduce.
● Resilient Distributed Dataset (RDD) brought a rich functional programming model that
abstracted out the complexities of distributed computing on a cluster.
● It introduced the concepts of transformations and actions that offered a more expressive
programming model than the map and reduce stages.
DATABASE TECHNOLOGIES
Data-Stream Management
3. A Unified Engine
● Nowadays, Spark is a unified analytics engine offering batch and streaming capabilities that
is compatible with a polyglot approach to data analytics, offering APIs in Scala, Java,
Python, and the R language.
DATABASE TECHNOLOGIES
Data-Stream Management
Spark Components:
Abstraction Layers offered by Spark
1. Spark Core
2. Spark SQL
Spark Core
● Contains execution engine and a set of low-level functional APIs used to distribute computations to a cluster
of computing resources, called executors.
● Its cluster abstraction allows it to submit workloads to YARN, Mesos, and Kubernetes.
● Uses its own standalone cluster mode, in which Spark runs as a dedicated service in a cluster of machines.
● Its data source abstraction enables the integration of many different data providers, such as files, block
stores, databases, and event brokers.
Spark SQL
● Implements the higher-level Dataset and DataFrame APIs of Spark and adds SQL support on top of
arbitrary data sources.
● It also introduces a series of performance improvements through the Catalyst query engine, and code
generation and memory management from project Tungsten.
DATABASE TECHNOLOGIES
Data-Stream Management
References::
● R3 - Gerard Maas, Francois Garillot - Stream Processing with Apache Spark_Mastering Structured
Streaming and Spark Streaming-O'Reilly Media (2019)
● https://databricks.com/glossary/lambda-architecture#:~:text=Lambda%20architecture%20is%20a%
20way,problem%20of%20computing%20arbitrary%20functions
● https://www.javatpoint.com/apache-kafka-architecture
● https://www.kai-waehner.de/blog/2021/09/23/real-time-kappa-architecture-mainstream-replacing-ba
tch-lambda/
● https://medium.com/cloud-believers/some-key-points-on-lambda-architecture-kappa-architecture-a
nd-their-advantages-disadvantages-91b51cbc7b0a#:~:text=two%20heterogeneous%20systems.-,D
isadvantages,batch%20processing%20tasks%20mostly%20suffer.
THANK YOU
Raghu B. A.
Department of Computer Science and Engineering
[email protected]
DATABASE
TECHNOLOGIES
Unit - 4: Data-Stream Management
Raghu B. A.
Department of Computer Science and Engineering
DATABASE TECHNOLOGIES
Data-Stream Management
Stream-Processing Model
Stream-Processing Model
Stream-Processing Model
Window Aggregations
● Stream-processing systems often feed themselves on actions that occur in real time: social
media messages, clicks on web pages, e-commerce transactions, financial events, or
sensor readings are also frequently encountered examples of such events.
● Even though seeing every transaction individually might not be useful or even practical, we
might be interested in seeing the properties of events seen over a recent period of time; for
example, the last 15 minutes or the last hour, or maybe even both.
● As these events keep coming in, the older ones usually become less and less relevant to
whichever processing you are trying to accomplish.
● These regular and recurrent time-based aggregations are called windows.
Tumbling Windows:
Sliding Windows:
Event Time
● Event time refers to the timeline when the events were originally generated.
● Typically, a clock available at the generating device places a timestamp in the event itself, meaning
that all events from the same source could be chronologically ordered even in the case of
transmission delays.
Processing Time
● Processing time is the time when the event is handled by the stream-processing system. This time is
relevant only at the technical or implementation level.
● For example, it could be used to add a processing timestamp to the results and in that way,
differentiate duplicates, as being the same output values with different processing times.
Watermark
● The watermark is the oldest timestamp that we will accept on the data stream.
● Any events that are older than this expectation are not taken into the results of the stream processing.
● The streaming engine can choose to process them in an alternative way, like report them in a late
arrivals channel.
● However, to account for possibly delayed events, this watermark is usually much larger than the
average delay we expect in the delivery of the events.
● Note also that this watermark is a fluid value that monotonically increases over time, sliding a window
of delay-tolerance as the time observed in the data-stream progresses.
● Watermarks are non-decreasing by nature.
DATABASE TECHNOLOGIES
Data-Stream Management - Effect of Time on Streaming (Cont’d…)
Consider the following series of events produced and
processed over time.
NOTE:
● Event-time processing is another form of stateful computation and is subject to the same
limitation:
to handle watermarks, the stream processor needs to store a lot of intermediate data and, as
such, consume a significant amount of memory that roughly corresponds to:
the length of the watermark × the rate of arrival × message size.
● Since we need to wait for the watermark to expire to be sure that we have all elements that
comprise an interval, stream processes that use a watermark and want to have a unique, final
result for each interval, must delay their output for at least the length of the watermark.
● A too small watermark will lead to dropping too many events and produce severely incomplete
results.
● A too large watermark will delay the output of results deemed complete for too long and
increase the resource needs of the stream processing system to preserve all intermediate
events.
● It is left to the users to ensure they choose a watermark
suitable for the event-time processing they require and
Suitable for the computing resources they have available.
DATABASE TECHNOLOGIES
Data-Stream Management
References::
● R3 - Gerard Maas, Francois Garillot - Stream Processing with Apache Spark_Mastering
Structured Streaming and Spark Streaming-O'Reilly Media (2019)
● https://www.techopedia.com/definition/30581/ad-hoc-query-sql-programming#:~:text=In%
20SQL%2C%20an%20ad%20hoc,under%20dynamic%20programming%20SQL%20que
ry
.
● https://nlp.stanford.edu/IR-book/html/htmledition/text-classification-and-naive-bayes-1.ht
ml#:~:text=A%20standing%20query%20is%20like,terms%20such%20as%20multicore%
20processors.
THANK YOU
Raghu B. A.
Department of Computer Science and Engineering
[email protected]
DATABASE
TECHNOLOGIES
Unit - 4: Data-Stream Management
Raghu B. A.
Department of Computer Science and Engineering
DATABASE TECHNOLOGIES
Data-Stream Management
Streaming Architecture
Streaming Architecture
Why do we need Stream Data Architecture?
● Can manage never-ending streams of events
● Real-time processing
● Detects patterns in time-series data
● Data is easily scalable
The following are the three levels of interaction between the batch and stream-
processing components:
1) Code Reuse
(a) Often born out of a reference batch implementation, seeks to reemploy as much of it as
possible, so as not to duplicate efforts.
(b) This is an area in which Spark shines, since it is particularly easy to call functions that
transform Resilient Distributed Databases (RDDs) and DataFrames—they share most of
the same APIs, and only the setup of the data input and output is distinct.
2) Data Reuse
(a) A streaming application feeds itself from a feature or data source prepared, at regular
intervals, from a batch processing job.
(b) For example, some international applications must handle time conversions, and a
frequent pitfall is that daylight saving time (DST) rules change on a more frequent basis
than expected. In this case, it is good to be thinking of this data as a new dependent
source that our streaming application feeds itself off.
DATABASE TECHNOLOGIES
The Use of a Batch-Processing Component
Data-Stream Management in a Streaming Application
The following are the three levels of interaction between the batch and the
stream-processing components:
1) Code Reuse
2) Data Reuse
3) Mixed Processing
(a) The application itself is understood to have both a batch and a
streaming component during its lifetime.
(b) This pattern does happen relatively frequently, out of a will to
manage both the precision of insights provided by an application,
and as a way to deal with the versioning and the evolution of the
application itself.
DATABASE TECHNOLOGIES
Data-Stream Management
NOTE:
●
In certain situations it shows that the streaming algorithm performs measurably
worse on some inputs.
●
With the worst input condition, the batch algorithm, which proceeds in hindsight
with strictly more information, is always expected to perform better than a
streaming algorithm.
DATABASE TECHNOLOGIES 2) Streaming Algorithms Can’t Be Guaranteed
Data-Stream Management to Measure Well Against Batch Algorithms
●
In the bin-packing problem, an input of a set of objects of different sizes or
different weights must be fitted into a number of bins or containers, each of them
having a set volume or set capacity in terms of weight or size.
●
The challenge is to find an assignment of objects into bins that minimizes the
number of containers used.
●
The online algorithm processes the items in arbitrary order and then places each
item in the first bin that can accommodate it;
●
If no such bin exists, it opens a new bin and puts the item within that new bin.
●
This greedy approximation algorithm always allows placing the input objects into
a set number of bins that is, at worst, sub-optimal; meaning we might use more
bins than necessary.
DATABASE TECHNOLOGIES 2) Streaming Algorithms Can’t Be Guaranteed
Data-Stream Management to Measure Well Against Batch Algorithms
To Summarize:
A) If there is a known competitive ratio for the streaming algorithm at hand, and the
resulting performance is acceptable, running just the stream processing might be
enough.
B) If there is no known competitive ratio between the implemented stream processing
and a batch version, running a batch computation on a regular basis is a valuable
benchmark to which to hold one’s application.
DATABASE TECHNOLOGIES
Data-Stream Management
References::
● R3 - Gerard Maas, Francois Garillot - Stream Processing with Apache
Spark_Mastering Structured Streaming and Spark Streaming-O'Reilly Media (2019)
● https://www.softkraft.co/streaming-data-architecture/
● https://www.upsolver.com/blog/streaming-data-architecture-key-components
THANK YOU
Raghu B. A.
Department of Computer Science and Engineering
[email protected]
DATABASE
TECHNOLOGIES
Unit - 4: Data-Stream Management
Raghu B. A.
Department of Computer Science and Engineering
DATABASE TECHNOLOGIES
Data-Stream Management
Apache Spark presents itself as a unified engine, offering developers a consistent environment whenever
they want to develop a batch or a streaming application. The design choices that provide Spark the
stream-processing capabilities are:
● Memory Usage
○ Failure Recovery
○ Lazy Evaluation
○ Cache Hints
● Latency
● Throughput-Oriented Processing
DATABASE TECHNOLOGIES
Data-Stream Management
Spark Streaming
This is an API and a set of connectors, in which a Spark program is being served small batches of data
collected from a stream in the form of microbatches spaced at fixed time intervals, performs a given
computation, and eventually returns a result at every interval.
Structured Streaming
This is an API and a set of connectors, built on the substrate of a SQL query optimizer, Catalyst. It offers
an API based on DataFrames and the notion of continuous queries over an unbounded table that is
constantly updated with fresh records from the stream.
DATABASE TECHNOLOGIES
Data-Stream Management
1. Failure Recovery
● Spark knows exactly which data source was used to ingest the data and also knows all
the operations that were performed on it. So it can reconstitute the segment of lost data
that was on a crashed executor, from scratch if needed.
● So, Spark offers a replication mechanism, similar to distributed file systems.
● Since memory is a limited commodity, Spark makes the cache short lived.
DATABASE TECHNOLOGIES
Data-Stream Management
● If several operations are to be done on a single intermediate result, Spark lets users specify
that an intermediate value is important and how its contents should be safeguarded for later.
● This avoids multiple computations of the same result.
● In case of high temporary loads, Spark offers the opportunity to spill the cache to secondary
storage to preserve the functional aspects of a data process.
DATABASE TECHNOLOGIES
Data-Stream Management
References::
Raghu B. A.
Department of Computer Science and Engineering
[email protected]
DATABASE TECHNOLOGIES
Unit - 4: Data-Stream Management
Apache Kafka
• For example
• “Student with ID 23489 entered
building”
Spark Streaming-Kafka
What is Event Streaming Platform?
• Example
• Subscribe to all fire sensors in b block
Spark Streaming-Kafka
Content based routing
• Example
• Subscribe to all fire sensors in b block
Spark Streaming-Kafka
Communication advantages/disadvantages
Pros Cons
4. What is Kafka Cluster? What is the key benefit of creating Kafka cluster?
● Producer is a client who send or publish the record. Producer applications write data
to topics and consumer applications read from topics.
• Content-based system
Spark Streaming-Kafka
Topic based Routing
• Example
• Subscribe to all fire sensors in BE block
Spark Streaming-Kafka
Content based routing
• Example
• Subscribe to all fire sensors in b block
Spark Streaming-Kafka
Communication advantages/disadvantages
Pros Cons
http://www.slideshare.net/nathanmarz/storm-distributed-and-
faulttolerant-realtime-computation
History
• Problems?
• Cumbersome to build applications (manual
+ tedious + serialize/deserialize message)
• Brittle ( No fault tolerance )
• Pain to scale - same application logic
spread across many workers, deployed
separately
• Hadoop ?
• For parallel batch processing : No Hacks for realtime
• Map/Reduce is built to leverage data localization on HDFS to
distribute computational jobs.
• Works on big data.
• Storm !
– Stream process data in realtime with no
latency!
– Generates big data!
Features
• Simple programming model
• Topology - Spouts – Bolts
• Spouts
• Source of streams : Twitterhose API
• Stream of tweets or some crawler
Concept - Bolts
• Bolts
• Process (one or more ) input stream and produce new
streams
• Functions
• Filter, Join, Apply/Transform etc
• Parallelize to make it fast! – multiple processes constitute a
bolt
Concepts – Topology & Grouping
• Topology
• Graph of computation – can
have cycles
• Network of Spouts and Bolts
• Spouts and bolts execute as
many tasks across the cluster
• Grouping
• How to send tuples between the
components / tasks?
Concepts – Grouping
• Shuffle Grouping
• Distribute streams “randomly” to
bolt’s tasks
• Fields Grouping
• Group a stream by a subset of its fields
• All Grouping
• All tasks of bolt receive all input tuples
• Useful for joins
• Global Grouping
• Pick task with lowers id
Zookeeper
• Open source server for highly reliable distributed
coordination.
• As a replicated synchronization service with eventual
consistency.
• Features
• Robust
• Persistent data replicated across multiple nodes
• Master node for writes
• Concurrent reads
• Comprises a tree of znodes, - entities roughly representing file
system nodes.
• Use only for saving small configuration data.
Cluster
Features
• Simple programming model
• Topology - Spouts – Bolts
• Configuring Reliability
• Config.TOPOLOGY_ACKERS to 0.
• you can emit them as unanchored tuples
Exactly Once Semantics ?
• Trident
• High level abstraction for realtime computing on top of storm
• Stateful stream processing with low latency distributed quering
• Provides exactly-once semantics ( avoid over counting )
How can we do ?
Store the transaction id with the count
in the database as an atomic value
https://storm.apache.org/documentation/Trident-state
Exactly Once Mechanism
Lets take a scenario
Design
https://storm.apache.org/documentation/Trident-state
Improvements and Future Work
• Lax security policies
• Performance and scalability improvements
• Presently with just 20 nodes SLAs that require processing more
than a million records per second is achieved.
• High Availability (HA) Nimbus
• Though presently not a single point of failure, it does affect degrade
functionality.
• Enhanced tooling and language support
DEMO
Topology
Total
Parse Intermediate
Count Bolt Ranker
Tweet Bolt Ranker Bolt
Bolt
UE19CS344
Apache Flink is a framework and distributed processing engine for stateful computations
over unbounded and bounded data streams.
Flink has been designed to run in all common cluster environments, perform computations
at in-memory speed and at any scale.
Apache Flink is a big data stream processing and batch processing framework that is
developed by the Apache Software Foundation.
Flink can be easily deployed with Hadoop as well as other frameworks and provides better
communication, data distribution over a stream of data.
Apache Flink engine is developed in JAVA and Scala language that provides high throughput,
event management, and low latency.
Database Technologies
Features of Apache Flink
High performance
Flink is designed to achieve high performance and low latency. Flink's pipelined data processing
gives better performance.
Exactly-once stateful computation
Flink's distributed checkpoint processing helps to guarantee processing each record exactly once.
Flexible streaming windows
Flink supports data-driven windows. This means we can design a window based on time, counts, or
sessions. A window can also be customized which allows us to detect specific patterns in event
streams.
Fault tolerance
Flink's distributed, lightweight snapshot mechanism helps in achieving a great degree of fault
tolerance. It allows Flink to provide high-throughput performance with guaranteed delivery.
Database Technologies
Features of Apache Flink
Memory management
It efficiently does memory management by using hashing, indexing, caching, and sorting.
Optimizer
Flink's batch data processing API is optimized in order to avoid memory-consuming operations such
as shuffle, sort, and so on. It also makes sure that caching is used in order to avoid heavy disk IO
operations.
Stream and batch in one platform
Flink provides APIs for both batch and stream data processing.
Libraries
Flink has a rich set of libraries to do machine learning, graph processing, relational data processing,
and so on. Because of its architecture, it is very easy to perform complex event processing and
alerting. Event time semantics Flink supports event time semantics. This helps in processing
streams where events arrive out of order. Sometimes events may come delayed.
Database Technologies
Difference between Apache Flink and Apache Spark
Apache Flink uses Kappa-architecture, the architecture which uses the only stream (of data) for processing,
whereas,
Hadoop and Spark use Lambda architecture, which uses batches (of data) and micro-batches (of streamed
data) for processing.
Process each event in real time Process event in near real time.
Flink offers a wide range of techniques for Spark offers basic windowing strategies
windowing.
Flink has Gelly for graph processing Spark has GraphX for graph processing.
Flink provides an optimizer that optimizes jobs Spark jobs need to be optimized manually by developers
before execution on the streaming engine
Database Technologies
Basic concepts
Apache Flink has the following two daemons running on Master and
Slaves nodes.
1. Program
It is the application-developed program that the client system will submit
for execution.
3. DataFlow Graph
In this phase, the application job is converted into a data flow graph for
further execution.
4. Job Manager
In this phase, the Job Manager demon of Apache Flink schedules the task
and sends it to the Task Managers for execution. The Job manager also
monitors the intermediate results.
5. Task Manager
In this phase, the Task Managers performs the execution of the task
assign by the Job Manager.
Database Technologies
Windowing
An infinite DataStream is divided into finite slices called windows based on the timestamps of
elements or other criteria.
To do windowing on a stream, we need to assign a key on which the distribution can be made and
a function which describes what transformations to perform on a windowed stream.
To slice streams into windows, we can use pre-implemented Flink window assigners such as,
tumbling windows, sliding windows, global and session windows.
Database Technologies
Window Assigners
Global windows
Global windows are never-ending windows unless specified by a trigger. Each element is assigned to one single per-key global
Window. If we don't specify custom trigger, no computation will ever get triggered.
Tumbling windows
Tumbling windows are created based on certain times. They are fixed-length windows and non over lapping. It is useful when
you need to do computation of elements in specific time. For example, tumbling window of 10 minutes can be used to compute
a group of events occurring in 10 minutes time.
Sliding windows
Sliding windows are like tumbling windows but they are overlapping. They are fixed-length windows overlapping the previous
ones by a user given window slide parameter. This type of windowing is useful when you want to compute something out of a
group of events occurring in a certain time frame.
Session windows
Session windows are useful when windows boundaries need to be decided upon the input data. Session windows allows
flexibility in window start time and window size. We can also provide session gap configuration parameter which indicates how
long to wait before considering the session in closed.
Database Technologies
Window Assigners
Session Window
Tumbling Window Sliding Window
Database Technologies
Physical partitioning
Flink allows us to perform physical partitioning of the stream data. The different types of
partitioning are Custom partitioning, Random partitioning, Rebalancing partitioning and Rescaling.
Custom partitioning
We can provide custom implementation of a partitioner.
Random partitioning
Random partitioning randomly partitions data streams in an evenly manner.
Rescaling
It uses a round robin method to distribute the data evenly.
Rescaling is used to distribute the data across operations, perform transformations on sub-sets
of data and combine them together. This rebalancing happens over a single node only, hence it
does not require any data transfer across networks.
Database Technologies
Event time and watermarks
Flink supports different concepts of time for its streaming API.
Event time
The time at which event occurred on its producing device. For example in IoT project, the time at which sensor captures a
reading. Generally these event times needs to embed in the record before they enter Flink. At the time processing, these
timestamps are extracted and considering for windowing. Event time processing can be used for out of order events.
Processing time
Processing time is the time of machine executing the stream of data processing. Processing time windowing considers only
that timestamps where event is getting processed. Processing time is simplest way of stream processing as it does not require
any synchronization between processing machines and producing machines. In distributed asynchronous environment
processing time does not provide determinism as it is dependent on the speed at which records flow in the system.
Ingestion time
This is time at which a particular event enters Flink. All time based operations refer to this timestamp. Ingestion time is more
expensive operation than processing but it gives predictable results. Ingestion time programs cannot handle any out of order
events as it assigs timestamp only after the event is entered the Flink system.
Database Technologies
Fault tolerance
It is based on the principle of flink stream. When the stream processing fails, the data stream processing can be resumed
through snapshots. To understand Flink’s fault tolerance mechanism, we need to first understand the concept of barrier.
Stream barrier is the core element of Flink distributed snapshot. It will be treated as the record of data flow, inserted into the
data stream, grouped the records in the data flow, and pushed forward along the direction of data flow.
Each barrier will carry a snapshot ID, and records belonging to the snapshot will be pushed to the front of the barrier. Because
barrier is very lightweight, it does not interrupt the data flow.
The records that appear before the barrier belong to the corresponding snapshot of the barrier, and the records that
Multiple barriers from different snapshots may appear in the data stream at the same time, ie., multiple snapshots may be
When an intermediate operator receives a barrier, it will send the barrier to the data stream of the snapshot belonging to
the barrier.
When the sink operator receives the barrier, it will confirm the snapshot to the checkpoint coordinator.
Until all sink operators confirm the snapshot, it is considered that the snapshot has been completed.
It also includes the state held by the operator to ensure the correct recovery of data flow processing when the stream
If an operator contains any form of state, that state must be part of the snapshot.
Database Technologies
Operator States
There are two states of the operator: system state and User defined state
System state
When an operator performs calculation and processing, it needs to buffer the data, the state of the data buffer is
Flink system will collect or aggregate the recorded data from the and put it in the buffer until the data in the buffer is
processed
It can be a simple variable such as a Java object in a function, or a key / value state related to the function.
THANK YOU
J. Ruby Dinakar
Department of Computer Science and Engineering
[email protected]
DATABASE TECHNOLOGIES
Unit - 4: Data-Stream Management
Amazon Kinesis
Prof. Raghu B. A.
Department of Computer Science and Engineering
DATABASE TECHNOLOGIES
Prof. Raghu B. A.
Department of Computer Science and Engineering
DATABASE TECHNOLOGIES
Amazon Kinesis
In the Cloud
Spark’s expressive programming models and advanced analytics capabilities can be used on the
cloud, including the offerings of the major players: Amazon, Microsoft, and Google.
In this section, we provide a brief tour of the ways in which the streaming capabilities of Spark
can be used on the cloud infrastructure and with native cloud functions, and, if relevant, how
they compare with the cloud providers’ own proprietary stream-processing system.
It comes with rich semantics for defining producers and consumers of streaming data, along
with connectors for the pipelines created with those stream endpoints. We touched on
Kinesis in Chapter 19, in which we described the connector between Kinesis and Spark
Streaming.
DATABASE TECHNOLOGIES
Amazon Kinesis
Amazon Kinesis is the streaming delivery platform of Amazon Web Services (AWS).
DATABASE TECHNOLOGIES
Amazon Kinesis
There is a connector between Kinesis and Structured Streaming, as well, which is available
in two flavors:
• One offered natively to users of the Databricks edition of Spark, itself available on the
AWS and Microsoft Azure clouds
• The open source connector under JIRA Spark-18165, which offers a way to stream data
out of Kinesis easily
Those connectors are necessary because Kinesis, by design, does not come with a
comprehensive stream-processing paradigm besides a language of continuous queries on
AWS analytics, which cover simpler SQL-based queries. Therefore, the value of Kinesis is to
let clients implement their own processing from robust sources and sinks produced with
the battle-tested clients of the Kinesis SDK.
With Kinesis, it is possible to use the monitoring and throttling tools of the AWS platform,
getting a production ready stream delivery “out of the box” on the AWS cloud.
DATABASE TECHNOLOGIES
Amazon Kinesis
The open source connector between Amazon Kinesis and Structured Streaming is a
contribution from Qubole engineers. This library was developed and allows Kinesis to be a full
citizen of the Spark ecosystem, letting Spark users define analytic processing of arbitrary
complexity.
Finally, note that the Kinesis connector for Spark Streaming was based on the older receiver
model, which comes with some performance issues.
This Structured Streaming client is much more modern in its implementation, but it has not
yet migrated to the version 2 of the data source APIs, introduced in Spark 2.3.0.
Kinesis is a region of the Spark ecosystem which would welcome easy contributions updating
the quality of its implementations.
DATABASE TECHNOLOGIES
Amazon Kinesis
Amazon Kinesis
https://www.amazonaws.cn/en/products/?nc1=f_dr#analytics
THANK YOU
Raghu B. A.
Department of Computer Science and Engineering