0% found this document useful (0 votes)
76 views152 pages

DBT Unit4 PDF

This document provides an overview of data stream management and stream processing technologies. It defines streamed data as data produced continuously from multiple sources. It discusses the key differences between data at rest (batch processing) and data in motion (stream processing), including how stream processing handles unbounded data in real-time. The document also covers stream processing concepts like event time and processing time, as well as distributed stream processing and challenges maintaining state. Finally, it introduces architectures like Lambda and Kappa for stream processing and provides an overview of Apache Spark.

Uploaded by

Chaitanya Madhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views152 pages

DBT Unit4 PDF

This document provides an overview of data stream management and stream processing technologies. It defines streamed data as data produced continuously from multiple sources. It discusses the key differences between data at rest (batch processing) and data in motion (stream processing), including how stream processing handles unbounded data in real-time. The document also covers stream processing concepts like event time and processing time, as well as distributed stream processing and challenges maintaining state. Finally, it introduces architectures like Lambda and Kappa for stream processing and provides an overview of Apache Spark.

Uploaded by

Chaitanya Madhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 152

DATABASE

TECHNOLOGIES
Unit - 4: Data-Stream Management

Raghu B. A.
Department of Computer Science and Engineering
DATABASE TECHNOLOGIES
Introduction
● What is Streamed Data?
○ Data that is produced from multiple sources continuously.
○ E.g.:
■ Social Media Feeds
■ Sensors
■ Cameras
■ E-commerce purchases
■ Log files
■ Server Activity
■ Geo location
● Traditionally, data was captured, stored and pushed out. With the introduction of
streaming analytics, organizations can gain insights and take decisions on the state
of the data.
● Unbounded Data: A type of dataset that is infinite in size (at least theoretically)
● Bounded Data: A dataset of known size
DATABASE TECHNOLOGIES
Data-Stream Management

Types of Data (The Notion of Time in Stream Processing):

Data at Rest (Static Data) Data in Motion (Dynamic Data)


Analysis of data is done after data collection. Data is analyzed as and when it is
generated.
Batch Processing. Stream Processing.
Size determines time and space. Unbounded size but finite time and space.
May not be real-time (or) Data collected in the past. Real-time data (present).
Need to wait for the entire data for computation. It is relatively fast and simple computations.
Eg: Feedback Eg: Transaction data
DATABASE TECHNOLOGIES
Data-Stream Management

What is Stream Processing?

Layman’s explanation:
Stream processing is the practice of taking action on a series of data
at the time the data is created.

Textbook explanation:
Stream processing is the discipline and related set of techniques used
to extract information from unbounded data.
DATABASE TECHNOLOGIES
Data-Stream Management

Stream Processing
DATABASE TECHNOLOGIES
Data-Stream Management

Stream Processing
In stream-processing, there are different timelines as to when the events are produced and
when the events are handled by the system.

Event Time:
● The time when the event was created.
● The time information is provided by the local clock of the device generating the event.

Processing Time:
● The time when the event is handled by the stream-processing system.
● This is the clock of the server running the processing logic.

The Factor of Uncertainty


 Dealing with uncertainty is an important aspect of stream processing.
 Stream processing lets us extract information from infinite data streams observed as
events delivered over time.
 As we receive and process (streaming) data, we need to deal with the additional
complexity of event-time and the uncertainty introduced by an unbounded input.
DATABASE TECHNOLOGIES
Data-Stream Management
Stream Processing
Pros:
● Processing speed is fast
● Processing is simple

Cons:
● Size and frequency are unpredictable
● Algorithms must be relatively fast to process the data

Examples/Use Cases of Stream Processing


● Device Monitoring
● Fault Detection
● Billing Modernization
● Fleet Management
● Media Recommendations
● Faster Loan Processing
A common thread among these use cases is the need of the business to process the data
and create actionable insights in a short period of time from when the data was received.
In all cases, data is better when consumed as fresh as possible.
DATABASE TECHNOLOGIES
Data-Stream Management

Stream Processing

Differences between Stream Processing and Batch Processing


Stream Processing Batch Processing
Processing of continuous stream of data Processing of high volume of data in batch within a
immediately as it is produced. specific time span.

Processing happens in real-time. Takes few Waits for the entire lot to perform processing.
seconds or milliseconds. Takes a lot of time.

Data size is unknown and infinite. Data size is known and finite.

Response is immediate. Response is provided after job completion.

Used in social media, stock market etc. Used in billing systems, payroll systems etc.
DATABASE TECHNOLOGIES
Data-Stream Management

Distributed Stream Processing


● In stream processing, only a small portion of the data is available at a time.
● In distributed stream processing, the processing load is distributed among a series of
executors into partitions. The goal is to provide an abstraction that hides this complexity
from the user and lets us reason about the stream as a whole.

Stateful Stream Processing in a Distributed System:


● Consider the scenario of presidential elections. We can count the votes per candidate as
each vote is cast. At any moment, we have a partial count by participant that lets us see
the current standing as well as the voting trend. We can probably anticipate a result.
● To accomplish this, the stream processor needs to keep an internal register of the votes
seen so far. To ensure a consistent count, this register must recover from any partial failure
as we can’t ask the citizens to vote again.
● This shows the challenges of stateful stream processing running in a distributed
environment.
● The main problems of Stateful Stream Processing in a Distributed System are:
○ State has to be preserved over time
○ Data consistency has to be guaranteed even during partial system failures.
DATABASE TECHNOLOGIES
Data-Stream Management

Lambda Architecture
● Lambda architecture is a way of processing large quantities of data (i.e. “Big Data”) that
provides access to batch-processing and stream-processing methods with a hybrid
approach.
DATABASE TECHNOLOGIES
Data-Stream Management
Lambda Architecture (Cont’d…)
● Batch Layer
○ New data comes continuously, as a feed to the data system. It gets fed to the
batch layer and the speed layer simultaneously.
○ It looks at all the data at once and eventually corrects the data in the stream layer.
○ The batch layer has two very important functions:
■ To manage the master dataset
■ To pre-compute the batch views
● Speed Layer (Stream Layer)
○ This layer handles the data that are not already delivered in the batch view due to
the latency of the batch layer.
○ In addition, it only deals with recent data in order to provide a complete view of the
data to the user by creating real-time views.
● Serving Layer
○ The outputs from the batch layer in the form of batch views and those coming from
the speed layer in the form of near real-time views get forwarded to serving layer.
○ This layer indexes the batch views so that they can be queried in low-latency on
an ad-hoc basis.
DATABASE TECHNOLOGIES
Data-Stream Management

Lambda Architecture (Cont’d…)


Advantages of Lambda Architecture
● No Server Management – you do not have to install, maintain, or administer any software.
● Flexible Scaling – your application can be either automatically scaled or scaled by the adjustment
of its capacity
● Automated High Availability – serverless applications have already built-in availability and faults
tolerance. It represents a guarantee that all requests will get a response about whether they were
successful or not.
● Business Agility – React in real-time to changing business/market scenarios
Challenges with Lambda Architectures
● This architecture is complex and redundant. In order for the specific use case, it must be
configured correspondingly.
● In order to run both batch and speed layer synchronously, more computational time and effort are
needed.
● Because of its distinct and distributed nature maintaining and supporting both layers is difficult
and cumbersome.
● Three types of layers (batch, speed, and serving) and different technologies for running multiple
software, make this architecture challenging to implement.
DATABASE TECHNOLOGIES
Data-Stream Management

Kappa Architecture
● The Kappa architecture is a software architecture that is event-based and able to handle all
data at all scale in real-time for transactional AND analytical workloads.
DATABASE TECHNOLOGIES
Data-Stream Management

Kappa Architecture (Cont’d…)


Unlike the Lambda Architecture, in this approach, you only do re-processing when your processing
code changes, and you need to recompute your results.

Advantages of Kappa
● Handle all the use cases with a single architecture
● One codebase that is always in synch
● One set of infrastructure and technology
● The heart of the infrastructure is real-time, scalable, and reliable
● Improved data quality with guaranteed ordering and no mismatches
● No need to re-architect for new use cases

Disadvantages of Kappa
● This architecture is incapable of managing computation-intensive applications in respect of large-
scale machine learning use cases.
● Due to the unavailability of a batch layer the batch processing tasks mostly suffer.
DATABASE TECHNOLOGIES
Data-Stream Management

Introducing Apache Spark


Apache Spark is a fast, reliable, and fault-tolerant distributed computing framework for large-
scale data processing.

Evolution of Spark
1. The First Wave: Functional APIs
● In the earlier days, Spark used memory and expressive functional API.
● Spark memory model uses RAM to cache data as it is being processed, resulting in up to
100 times faster processing than Hadoop Map-Reduce.
● Resilient Distributed Dataset (RDD) brought a rich functional programming model that
abstracted out the complexities of distributed computing on a cluster.
● It introduced the concepts of transformations and actions that offered a more expressive
programming model than the map and reduce stages.
DATABASE TECHNOLOGIES
Data-Stream Management

Evolution of Spark (Cont’d…)


2. The Second Wave: SQL
● With the introduction of Spark SQL and DataFrames, Spark SQL adds SQL support to any
dataset that has a schema. It makes it possible to query a comma-separated values (CSV),
Parquet, or JSON dataset.
● From a performance point of view, Spark SQL brought a query optimizer and a physical
execution engine to Spark, making it even faster while using fewer resources.

3. A Unified Engine
● Nowadays, Spark is a unified analytics engine offering batch and streaming capabilities that
is compatible with a polyglot approach to data analytics, offering APIs in Scala, Java,
Python, and the R language.
DATABASE TECHNOLOGIES
Data-Stream Management

Spark Components:
Abstraction Layers offered by Spark
1. Spark Core
2. Spark SQL

3. Libraries offered by Spark


4. Spark Streaming
5. Structured Streaming
DATABASE TECHNOLOGIES
Data-Stream Management

Spark Components: Abstraction Layers offered by Spark


1. Spark Core
2. Spark SQL

Spark Core
● Contains execution engine and a set of low-level functional APIs used to distribute computations to a cluster
of computing resources, called executors.
● Its cluster abstraction allows it to submit workloads to YARN, Mesos, and Kubernetes.
● Uses its own standalone cluster mode, in which Spark runs as a dedicated service in a cluster of machines.
● Its data source abstraction enables the integration of many different data providers, such as files, block
stores, databases, and event brokers.

Spark SQL
● Implements the higher-level Dataset and DataFrame APIs of Spark and adds SQL support on top of
arbitrary data sources.
● It also introduces a series of performance improvements through the Catalyst query engine, and code
generation and memory management from project Tungsten.
DATABASE TECHNOLOGIES
Data-Stream Management

Spark Components: Libraries offered by Spark


1. Spark Streaming
2. Structured Streaming
Spark Streaming
● It was the first stream-processing framework built on top of the distributed processing capabilities of the core
Spark engine.
● It is built on the concept of Microbatch model.
● It uses the same functional programming paradigm as the Spark Core, but it introduces a new abstraction,
the Discretized Stream (DStream), which exposes a programming model to operate on the underlying data
in the stream.
Structured Streaming
● It is a stream processor built on top of the Spark SQL abstraction.
● It extends the Dataset and DataFrame APIs with streaming capabilities.
● It adopts the schema-oriented transformation model, which confers the structured part of its name, and
inherits all the optimizations implemented in Spark SQL.
● It supports event-time, streaming joins, and separation from the underlying runtime.
● It delivers a unified model that brings stream processing to the same level of batch-oriented applications.
DATABASE TECHNOLOGIES
Data-Stream Management

References::
● R3 - Gerard Maas, Francois Garillot - Stream Processing with Apache Spark_Mastering Structured
Streaming and Spark Streaming-O'Reilly Media (2019)

● https://databricks.com/glossary/lambda-architecture#:~:text=Lambda%20architecture%20is%20a%
20way,problem%20of%20computing%20arbitrary%20functions

● https://www.javatpoint.com/apache-kafka-architecture

● https://www.kai-waehner.de/blog/2021/09/23/real-time-kappa-architecture-mainstream-replacing-ba
tch-lambda/

● https://medium.com/cloud-believers/some-key-points-on-lambda-architecture-kappa-architecture-a
nd-their-advantages-disadvantages-91b51cbc7b0a#:~:text=two%20heterogeneous%20systems.-,D
isadvantages,batch%20processing%20tasks%20mostly%20suffer.
THANK YOU

Raghu B. A.
Department of Computer Science and Engineering
[email protected]
DATABASE
TECHNOLOGIES
Unit - 4: Data-Stream Management

Raghu B. A.
Department of Computer Science and Engineering
DATABASE TECHNOLOGIES
Data-Stream Management

Stream-Processing Model

Sources and Sinks


● Data streams can be made accessible in the streaming framework of Apache Spark using
the concept of streaming data sources.
● In the context of stream processing, accessing data from a stream is often referred to as
consuming the stream.
● We call the abstraction used to write a data stream outside of Apache Spark’s control a
streaming sink.
● Simplified Streaming Model:
DATABASE TECHNOLOGIES
Data-Stream Management

Data Stream Management System


(DSMS)

The diagram represents a DSMS. The system accepts data


streams as input, and also accepts queries. These queries may
be of two kinds:
● Conventional ad-hoc queries:
○ An ad-hoc query is a loosely typed command/query
whose value depends upon some variable. Each
time the command is executed, the result is different,
depending on the value of the variable. It cannot be
predetermined and usually comes under dynamic
programming SQL query.
● Standing queries:
○ A standing query is like any other query except that it is
periodically executed on a collection to which new
documents are incrementally added over time.
DATABASE TECHNOLOGIES
Data-Stream Management

Stream-Processing Model

Transformations and Aggregations


● Transformations:
○ They computations that express themselves in the same way for every element in the stream.
○ For example, creating a derived stream that doubles every element of its input stream
corresponds to a transformation.
● Aggregations:
○ They produce results that depend on many elements and potentially every element of the stream
observed until now.
○ For example, collecting the top five largest numbers of an input stream corresponds to an
aggregation.
○ Computing the average value of some reading every 10 minutes is also an example of an
aggregation.
● Transformations have narrow dependencies (to produce one element of the output, you need only one
of the elements of the input), whereas aggregations have wide dependencies (to produce one element
of the output you would need to observe many elements of the input stream encountered so far).
DATABASE TECHNOLOGIES
Data-Stream Management

Stream-Processing Model

Window Aggregations
● Stream-processing systems often feed themselves on actions that occur in real time: social
media messages, clicks on web pages, e-commerce transactions, financial events, or
sensor readings are also frequently encountered examples of such events.
● Even though seeing every transaction individually might not be useful or even practical, we
might be interested in seeing the properties of events seen over a recent period of time; for
example, the last 15 minutes or the last hour, or maybe even both.
● As these events keep coming in, the older ones usually become less and less relevant to
whichever processing you are trying to accomplish.
● These regular and recurrent time-based aggregations are called windows.

Types of Window Aggregation


1. Tumbling Windows
2. Sliding Windows
DATABASE TECHNOLOGIES
Data-Stream Management

Tumbling Windows:

● Tumbling windows are the norm when we need to


produce aggregates of our data over regular periods of
time, with each period independent from previous
periods.
● For instance, “the maximum and minimum ambient
temperature each hour” or “the total energy
consumption (kW) each 15 minutes” are examples of
window aggregations.
● Time periods are inherently consecutive and non-
overlapping. This grouping of a fixed time period, in
which each group follows the previous and does not
overlap, is called tumbling windows.
● It usually reads like “a grouping function each ‘x’ period
of time.”
DATABASE TECHNOLOGIES
Data-Stream Management

Sliding Windows:

● Sliding windows are aggregates over a period of


time that are reported at a higher frequency than
the aggregation period itself. As such, sliding
windows refer to an aggregation with two time
specifications:
● the window length and the reporting frequency.
● For example, “the average share price over the last
day reported hourly.” This combination of a sliding
window with the average function is the most widely
known form of a sliding window, commonly known
as a moving average.
● It usually reads like “a grouping function over a time
interval x reported each y frequency.”
DATABASE TECHNOLOGIES
Data-Stream Management

Types of Window Aggregation


Tumbling Windows
Sliding Windows
DATABASE TECHNOLOGIES
Data-Stream Management

Stateful Stream Processing

● Stateful stream processing is the discipline by which we compute something


out of the new elements of data observed in our input data stream and refresh
internal data that helps us perform this computation.
● For example, if we are trying to do anomaly detection, the internal state that
we want to update with every new stream element would be a machine
learning model, whereas the computation we want to perform is to say
whether an input element should be classified as an anomaly or not.
● This pattern of computation is supported by a distributed streaming system
such as Apache Spark because it can take advantage of a large amount of
computing power and is an exciting new way of reacting to real-time data.
DATABASE TECHNOLOGIES
Data-Stream Management

Effect of Time on Streaming


Computing on Time-stamped Events
● Timestamping is an operation that consists of adding a register of time at the moment of the
generation of the message, which will become a part of the data stream.
● It is an ubiquitous practice that is present in both the most humble embedded devices as well as the
most complex logs in financial transaction systems.

Timestamps as the Provider of the Notion of Time


● The importance of time stamping is that it allows users to reason on their data considering the
moment at which it was generated.
● So, because event logs form a large proportion of the data streams being analyzed today, those
timestamps help make sense of what happened to a particular system at a given time.
● But it is an operation prone to different forms of failure in which some events could be delayed,
reordered, or lost.
● Often, users of a framework such as Apache Spark want to compensate for those hazards without
having to compromise on the reactivity of their system. Out of this desire was born a discipline for
producing the following:
○ Clearly marked correct and reordered results
○ Intermediary prospective results
DATABASE TECHNOLOGIES
Data-Stream Management - Effect of Time on Streaming (Cont’d…)

Event Time
● Event time refers to the timeline when the events were originally generated.
● Typically, a clock available at the generating device places a timestamp in the event itself, meaning
that all events from the same source could be chronologically ordered even in the case of
transmission delays.
Processing Time
● Processing time is the time when the event is handled by the stream-processing system. This time is
relevant only at the technical or implementation level.
● For example, it could be used to add a processing timestamp to the results and in that way,
differentiate duplicates, as being the same output values with different processing times.
Watermark
● The watermark is the oldest timestamp that we will accept on the data stream.
● Any events that are older than this expectation are not taken into the results of the stream processing.
● The streaming engine can choose to process them in an alternative way, like report them in a late
arrivals channel.
● However, to account for possibly delayed events, this watermark is usually much larger than the
average delay we expect in the delivery of the events.
● Note also that this watermark is a fluid value that monotonically increases over time, sliding a window
of delay-tolerance as the time observed in the data-stream progresses.
● Watermarks are non-decreasing by nature.
DATABASE TECHNOLOGIES
Data-Stream Management - Effect of Time on Streaming (Cont’d…)
Consider the following series of events produced and
processed over time.

● In an ideal world, using a network with zero delay,


events are processed immediately as they are
created.
● Note that there can be no processing events below
that line, because it would mean that events are
processed before they are created.
● The vertical distance between the diagonal and the
processing time is the delivery delay: the time
elapsed between the production of the event and its
eventual consumption.
DATABASE TECHNOLOGIES
Data-Stream Management - Effect of Time on Streaming (Cont’d…)

Now consider a 10-second window aggregation:

● The window corresponding to the time interval


00:30-00:40 is highlighted.
● It contains two events with event time 00:33
and 00:39.
● The window boundaries are well defined, as
we can see in the highlighted area.
● This means that the window has a defined
start and end.
● We know what’s in and what’s out by the time
the window closes.
● Its contents are arbitrary. They are unrelated
to when the events were generated.
● For example, although we would assume that
a 00:30-00:40 window would contain the event
00:36, we can see that it has fallen out of the
resulting set because it was late.
DATABASE TECHNOLOGIES
Data-Stream Management - Effect of Time on Streaming (Cont’d…)

Now consider the same 10-second window defined on


event time.

● In this case, the window 00:30-00:40 contains all the


events that were created in that period of time.
● Note this window has no natural upper boundary that
defines when the window ends.
● The event created at 00:36 was late for more than 20
seconds. As a consequence, to report the results of
the window 00:30-00:40, we need to wait at least until
01:00.
● What if an event is dropped by the network and never
arrives? How long do we wait?
● To resolve this problem, we introduce an arbitrary
deadline called a watermark to deal with the
consequences of this open boundary, like lateness,
ordering, and de-duplication.
DATABASE TECHNOLOGIES
Data-Stream Management - Effect of Time on Streaming (Cont’d…)

Now, let us apply this concept of watermark to our event-time diagram.


● Watermark closes the open boundary left by the definition of event-
time window, providing criteria to decide what events belong to the
window, and what events are too late to be considered for
processing.
● After the watermark is defined, the stream processor can operate in
2 modes with relation to that specific stream:
○ either it is producing output relative to events that are all older
than the watermark, in which case the output is final because all
of those elements have been observed so far, and no further
event older than that will ever be considered, or
○ It is producing an output relative to the data that is before the
watermark and a new, delayed element newer than the
watermark could arrive on the stream at any moment and can
change the outcome.
● In this latter case, we can consider the output as provisional
because newer data can still change the final outcome, whereas in
the former case, the result is final and no new data will be able to
change it.
DATABASE TECHNOLOGIES
Data-Stream Management - Effect of Time on Streaming (Cont’d…)

NOTE:
● Event-time processing is another form of stateful computation and is subject to the same
limitation:

to handle watermarks, the stream processor needs to store a lot of intermediate data and, as
such, consume a significant amount of memory that roughly corresponds to:

the length of the watermark × the rate of arrival × message size.
● Since we need to wait for the watermark to expire to be sure that we have all elements that
comprise an interval, stream processes that use a watermark and want to have a unique, final
result for each interval, must delay their output for at least the length of the watermark.
● A too small watermark will lead to dropping too many events and produce severely incomplete
results.
● A too large watermark will delay the output of results deemed complete for too long and
increase the resource needs of the stream processing system to preserve all intermediate
events.
● It is left to the users to ensure they choose a watermark

suitable for the event-time processing they require and

Suitable for the computing resources they have available.
DATABASE TECHNOLOGIES
Data-Stream Management

References::
● R3 - Gerard Maas, Francois Garillot - Stream Processing with Apache Spark_Mastering
Structured Streaming and Spark Streaming-O'Reilly Media (2019)

● https://www.techopedia.com/definition/30581/ad-hoc-query-sql-programming#:~:text=In%
20SQL%2C%20an%20ad%20hoc,under%20dynamic%20programming%20SQL%20que
ry
.

● https://nlp.stanford.edu/IR-book/html/htmledition/text-classification-and-naive-bayes-1.ht
ml#:~:text=A%20standing%20query%20is%20like,terms%20such%20as%20multicore%
20processors.
THANK YOU

Raghu B. A.
Department of Computer Science and Engineering
[email protected]
DATABASE
TECHNOLOGIES
Unit - 4: Data-Stream Management

Raghu B. A.
Department of Computer Science and Engineering
DATABASE TECHNOLOGIES
Data-Stream Management

Streaming Architecture

What is Streaming Architecture?


● A streaming data architecture is a framework that is capable of ingesting
and processing massive amounts of streaming data from a variety
of different sources.
● Traditional data solutions focused on processing data in batches.
● A streaming data architecture consumes data immediately as it is
generated, stores it and performs data manipulation, analytics etc.
DATABASE TECHNOLOGIES
Data-Stream Management

Streaming Architecture
Why do we need Stream Data Architecture?
● Can manage never-ending streams of events
● Real-time processing
● Detects patterns in time-series data
● Data is easily scalable

Components of a Stream Architecture:


● Message Broker (Stream Processor)
○ Apache Kafka
○ RabbitMQ
● Batch processing and Real-time ETL (extract, transform and load) Tools
● Streaming Data Storage
● Data Analytics or Server-less Query Engine
DATABASE TECHNOLOGIES
Data-Stream Management Data Platform

The following components constitute a data platform:


1) The Hardware Level
➢ On-premises hardware installations, data centers, or potentially
virtualized in homogeneous cloud solutions with a base operating
system installed.
2) The Persistence Level
➢ On top of that baseline infrastructure, you would find distributed
storage solutions like the Hadoop Distributed File System (HDFS)
—among many other distributed storage systems.
➢ On the cloud, this persistence layer is provided by a dedicated
service such as Amazon Simple Storage Service (Amazon S3) or
Google Cloud Storage.
3) The Resource Manager
➢ Most cluster architectures offer a single point of negotiation to
submit jobs to be executed on the cluster. This is the task of the
resource manager, like YARN and Mesos, and the more evolved
schedulers of the cloud-native era, like Kubernetes.
DATABASE TECHNOLOGIES
Data-Stream Management Data Platform
DATABASE TECHNOLOGIES
Data-Stream Management Data Platform (Cont’d…)

4) The Execution Engine


➢ At an even higher level, there is the execution engine, which is tasked with
executing the actual computation.
➢ It holds the interface with the programmer’s input and describes the data
manipulation.
➢ Apache Spark, Apache Flink, or MapReduce would be examples of this.
5) A Data Ingestion Component
➢ Data ingestion server that could be plugged directly into that engine.
➢ The realm of messaging systems or log processing engines such as
Apache Kafka is set at this level.
6) A Processed Data Sink
7) A Visualization Layer
DATABASE TECHNOLOGIES
Data-Stream Management Data Platform (Cont’d…)

4) The Execution Engine


5) A Data Ingestion Component
6) A Processed Data Sink
➢ On the output side of an execution engine, you will frequently find a high-level
data sink, which might be either another analytics system (in the case of an
execution engine tasked with an Extract, Transform and Load [ETL] job), a
NoSQL database, or some other service.
7) A Visualization Layer
➢ Results of data-processing are useful only if they are integrated in a larger
framework. These results are often plugged into a visualization.
➢ Since the data being analyzed evolves quickly, that visualization has moved
away from the old static report toward more real-time visual interfaces, often
using some web-based technology.
DATABASE TECHNOLOGIES
The Use of a Batch-Processing Component
Data-Stream Management in a Streaming Application

The following are the three levels of interaction between the batch and stream-
processing components:
1) Code Reuse
(a) Often born out of a reference batch implementation, seeks to reemploy as much of it as
possible, so as not to duplicate efforts.
(b) This is an area in which Spark shines, since it is particularly easy to call functions that
transform Resilient Distributed Databases (RDDs) and DataFrames—they share most of
the same APIs, and only the setup of the data input and output is distinct.
2) Data Reuse
(a) A streaming application feeds itself from a feature or data source prepared, at regular
intervals, from a batch processing job.
(b) For example, some international applications must handle time conversions, and a
frequent pitfall is that daylight saving time (DST) rules change on a more frequent basis
than expected. In this case, it is good to be thinking of this data as a new dependent
source that our streaming application feeds itself off.
DATABASE TECHNOLOGIES
The Use of a Batch-Processing Component
Data-Stream Management in a Streaming Application

The following are the three levels of interaction between the batch and the
stream-processing components:
1) Code Reuse
2) Data Reuse
3) Mixed Processing
(a) The application itself is understood to have both a batch and a
streaming component during its lifetime.
(b) This pattern does happen relatively frequently, out of a will to
manage both the precision of insights provided by an application,
and as a way to deal with the versioning and the evolution of the
application itself.
DATABASE TECHNOLOGIES
Data-Stream Management

Streaming Vs Batch Algorithms


There are two important considerations that we need to take into account when
selecting a general architectural model for our streaming application:
1) Streaming algorithms are sometimes completely different in nature
2) Streaming algorithms can’t be guaranteed to measure well against batch
algorithms

1) Streaming Algorithms Are Sometimes Completely Different in


Nature
Consider the following example: We decide to go skiing. We can buy skis for $500 or
rent them for $50. Should we rent or buy?
● Our intuitive strategy is to first rent, to see if we like skiing. But suppose we do: in
this case, we will eventually realize we will have spent more money than we would
have if we had bought the skis in the first place.
DATABASE TECHNOLOGIES
Data-Stream Management

1) Streaming Algorithms Are Sometimes Completely Different in


Nature (Cont’d…)

NOTE:

In certain situations it shows that the streaming algorithm performs measurably
worse on some inputs.

With the worst input condition, the batch algorithm, which proceeds in hindsight
with strictly more information, is always expected to perform better than a
streaming algorithm.
DATABASE TECHNOLOGIES 2) Streaming Algorithms Can’t Be Guaranteed
Data-Stream Management to Measure Well Against Batch Algorithms

Consider the bin-packing problem:


In the bin-packing problem, an input of a set of objects of different sizes or
different weights must be fitted into a number of bins or containers, each of them
having a set volume or set capacity in terms of weight or size.

The challenge is to find an assignment of objects into bins that minimizes the
number of containers used.

The online algorithm processes the items in arbitrary order and then places each
item in the first bin that can accommodate it;

If no such bin exists, it opens a new bin and puts the item within that new bin.

This greedy approximation algorithm always allows placing the input objects into
a set number of bins that is, at worst, sub-optimal; meaning we might use more
bins than necessary.
DATABASE TECHNOLOGIES 2) Streaming Algorithms Can’t Be Guaranteed
Data-Stream Management to Measure Well Against Batch Algorithms

A better algorithm is the first fit decreasing strategy:



operates by first sorting the items to be inserted in decreasing order of their sizes, and

then inserting each item into the first bin in the list with sufficient remaining space.

It relies on the idea that we can first sort the items in decreasing order of sizes

before we begin processing them and packing them into bins.

The larger issue is that there is no guarantee that a streaming algorithm will perform
better than a batch algorithm, because those algorithms must function without
foresight.
Consider the example:
✔ A worker that receives the data as batch, as if it were all in a storage room from the
beginning, and
✔ other worker receiving the data in a streaming fashion, as if it were on a conveyor belt,
✔ then no matter how clever our streaming worker is, there is always a way to place
items on the conveyor belt in such a pathological way that he will finish his task with
an arbitrarily worse result than the batch worker.
DATABASE TECHNOLOGIES
Data-Stream Management

2) Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch


Algorithms (Cont’d…)
The takeaway message from this discussion is:
a) Streaming systems are indeed “lighter”. Their semantics can express a lot of low-
latency analytics in expressive terms.
b) Streaming APIs invite us to implement analytics using streaming or online
algorithms in which heuristics are sadly limited, as we’ve seen earlier.

To Summarize:
A) If there is a known competitive ratio for the streaming algorithm at hand, and the
resulting performance is acceptable, running just the stream processing might be
enough.
B) If there is no known competitive ratio between the implemented stream processing
and a batch version, running a batch computation on a regular basis is a valuable
benchmark to which to hold one’s application.
DATABASE TECHNOLOGIES
Data-Stream Management

References::
● R3 - Gerard Maas, Francois Garillot - Stream Processing with Apache
Spark_Mastering Structured Streaming and Spark Streaming-O'Reilly Media (2019)

● https://www.softkraft.co/streaming-data-architecture/

● https://www.upsolver.com/blog/streaming-data-architecture-key-components
THANK YOU

Raghu B. A.
Department of Computer Science and Engineering
[email protected]
DATABASE
TECHNOLOGIES
Unit - 4: Data-Stream Management

Raghu B. A.
Department of Computer Science and Engineering
DATABASE TECHNOLOGIES
Data-Stream Management

Apache Spark as a Stream-Processing Engine


What is Apache Spark Streaming?
● Spark Streaming is a library provided in Apache Spark for processing live data streams
that is scalable, has high-throughput and is fault-tolerant.
● Spark Streaming can ingest data from multiple sources such as Kafka, Flume, Kinesis or
TCP sockets; and process this data using complex algorithms provided in the Spark API
including algorithms provided in the Spark MLlib and GraphX libraries.
● Processed data can be pushed to live dashboards, file systems and databases.
DATABASE TECHNOLOGIES
Data-Stream Management

Apache Spark as a Stream-Processing Engine


Spark Streaming Architecture

● Spark Streaming discretizes streaming data into tiny, sub-second


micro-batches instead of treating it as a single record at a time.
● The Receivers of Spark Streaming accept data concurrently and
buffer it in the memory of Spark workers.
● The latency-optimized Spark engine processes the batches in a
matter of milliseconds and outputs the results to other systems.
● The Spark tasks are dynamically assigned to the workers based on
the data location and available resources.
● This ensures better load distribution and faster fault recovery using
this technique.
DATABASE TECHNOLOGIES
Data-Stream Management

Apache Spark as a Stream-Processing Engine


Benefits of Discretized Stream Processing:
● Faster recovery from failure
● Unification of batch, streaming and interactive analytics
● Integration of advanced analytics like machine learning and interactive SQL is possible
● Increased performance
○ High throughput
○ Low latency

Apache Spark presents itself as a unified engine, offering developers a consistent environment whenever
they want to develop a batch or a streaming application. The design choices that provide Spark the
stream-processing capabilities are:

● Memory Usage
○ Failure Recovery
○ Lazy Evaluation
○ Cache Hints
● Latency
● Throughput-Oriented Processing
DATABASE TECHNOLOGIES
Data-Stream Management

Apache Spark as a Stream-Processing Engine


Stream-processing APIs
Spark offers two different stream-processing APIs:
● Spark Streaming
● Structured Streaming

Spark Streaming
This is an API and a set of connectors, in which a Spark program is being served small batches of data
collected from a stream in the form of microbatches spaced at fixed time intervals, performs a given
computation, and eventually returns a result at every interval.

Structured Streaming
This is an API and a set of connectors, built on the substrate of a SQL query optimizer, Catalyst. It offers
an API based on DataFrames and the notion of continuous queries over an unbounded table that is
constantly updated with fresh records from the stream.
DATABASE TECHNOLOGIES
Data-Stream Management

Apache Spark as a Stream Processing Engine


Memory Usage
● Spark offers in-memory storage of slices of a dataset.
● This must be initially loaded from a data source.
○ The data source can be a distributed file system or a storage medium.
● This technique is analogous to the operation of caching data.

1. Failure Recovery
● Spark knows exactly which data source was used to ingest the data and also knows all
the operations that were performed on it. So it can reconstitute the segment of lost data
that was on a crashed executor, from scratch if needed.
● So, Spark offers a replication mechanism, similar to distributed file systems.
● Since memory is a limited commodity, Spark makes the cache short lived.
DATABASE TECHNOLOGIES
Data-Stream Management

Apache Spark as a Stream Processing Engine


Memory Usage (Cont’d…)
2. Lazy Evaluation
● If a program consists of a series of linear operations, with the previous one feeding into the
next, the intermediate results disappear right after the next step has consumed its input.
3. Cache Hints

● If several operations are to be done on a single intermediate result, Spark lets users specify
that an intermediate value is important and how its contents should be safeguarded for later.
● This avoids multiple computations of the same result.
● In case of high temporary loads, Spark offers the opportunity to spill the cache to secondary
storage to preserve the functional aspects of a data process.
DATABASE TECHNOLOGIES
Data-Stream Management

Apache Spark as a Stream Processing Engine

The alongside figure represents the


data flow of operations on cached
values.
DATABASE TECHNOLOGIES
Data-Stream Management

Apache Spark as a Stream Processing Engine


Latency
● Although microbatching is still the dominating internal execution mode of stream processing in Apache Spark, it
has its drawbacks.
○ Any microbatch delays the processing of any particular element of a batch by at least the time of the batch interval.
○ Microbatches create a baseline latency.
○ For many applications, a latency in the space of a few minutes is sufficient; for example:
■ Having a dashboard that refreshes you on key performance indicators of your website over the last few minutes
■ Extracting the most recent trending topics in a social network
■ Computing the energy consumption trends of a group of households
■ Introducing new media in a recommendation system
● Spark is an equal-opportunity processor and delays all data elements for at most one batch before acting on them.
● On an average, this setup provides fast processing.
● If response time is essential for specific elements, alternative stream processors like Apache Flink or Apache
Storm might be a better fit. But if you’re just interested in fast processing on average, such as when monitoring a
system, Spark makes an interesting proposition.
DATABASE TECHNOLOGIES
Data-Stream Management

Apache Spark as a Stream Processing Engine


Throughput-Oriented Processing
● The Spark core engine is optimized for distributed batch processing.
● Its application ensures that large amounts of data can be processed per unit of time.
● Spark transfers the overhead of distributed task scheduling by having many elements to process
at once.
● It utilizes in-memory techniques, query optimizations, caching, and even code generation to
speed up the transformational process of a dataset.
● When using Spark in an end-to-end application, an important constraint is that downstream
systems receiving the processed data must also be able to accept the full output provided by the
streaming process. Otherwise, we risk creating application bottlenecks that might cause
cascading failures when faced with sudden load peaks.
DATABASE TECHNOLOGIES
Data-Stream Management

Apache Spark as a Stream Processing Engine


Spark’s Polyglot API
● Spark, far from being just a library for distributing computation, has become a polyglot
framework that the user can interface with using Scala, Java, Python, or the R language.
● The development language is Scala, so the versatile interface has let programmers of various
levels and backgrounds implementing their own data analytics needs.

Fast Implementation of Data Analysis


● Spark’s advantages in developing a streaming data analytics pipeline offers a concise, high-
level API in Scala and compatible APIs in Java and Python
● It also offers the simple model of Spark as a practical shortcut throughout the development
process.
● Component reuse with Spark is a valuable asset, as is access to the Java ecosystem of
libraries for machine learning and many other fields.
● It lets you quickly prototype your streaming data pipeline solution, getting first results quickly
enough to choose the right components at every step of the pipeline development.
● Finally, stream processing with Spark lets you benefit from its model of fault tolerance, with
the confidence that faulty machines are not going to affect the streaming application.
DATABASE TECHNOLOGIES
Data-Stream Management

Apache Spark as a Stream Processing Engine


To summarise,

Spark is a framework that, while making trade-offs in latency, optimizes for


building a data analytics pipeline with agility: fast prototyping in a rich environment
and stable runtime performance under adverse conditions are problems it
recognizes and tackles head-on, offering users significant advantages.
DATABASE TECHNOLOGIES
Data-Stream Management

References::

● R3 - Gerard Maas, Francois Garillot - Stream Processing with Apache


Spark_Mastering Structured Streaming and Spark Streaming-O'Reilly Media (2019)
THANK YOU

Raghu B. A.
Department of Computer Science and Engineering
[email protected]
DATABASE TECHNOLOGIES
Unit - 4: Data-Stream Management

Apache Kafka

Department of Computer Science and Engineering


Spark Streaming-Kafka
Obvious reasons to learn Kafka
Spark Streaming-Kafka
Why Kafka?- Evolution from Developer’s View
Spark Streaming-Kafka
Demand of this era applications

• Stream processing requires processing of


events

• Think of an event as something that


happens at a time

• For example
• “Student with ID 23489 entered
building”
Spark Streaming-Kafka
What is Event Streaming Platform?

Allows produce and Consume


No Clue
Stores Streams
Process
“ Apache Kafka”
Spark Streaming-Kafka
Is Kafka an enterprise messaging system?
Spark Streaming
Why Kafka?

• Can we have an intermediary that


connects
• Different sources
• Multiple backends

• Decouple the pipeline so that


producers and consumers do not
need to know about each other
Spark Streaming-Kafka
Kafka Terminologies
Spark Streaming-Kafka
Partitions
Spark Streaming-Kafka
Topic based routing

• Publishers send messages with topic labels

• Subscribers subscribe to topics


• And will receive all messages on that topic

• Example
• Subscribe to all fire sensors in b block
Spark Streaming-Kafka
Content based routing

• Subscribers define matching criteria

• And will receive all messages that match the criteria

• Example
• Subscribe to all fire sensors in b block
Spark Streaming-Kafka
Communication advantages/disadvantages

Pros Cons

• No hard-wired connections • Design and maintenance of


between publishers and topics
subscribers • Performance overhead due
• Flexible: Easy to add and to communication
remove publishers or infrastructure
subscribers
THANK YOU

Department of Computer Science and Engineering


DATABASE TECHNOLOGIES
Unit - 4: Data-Stream Management

Apache Kafka Architecture

Department of Computer Science and Engineering


Spark Streaming-Kafka
Let’s recall

1. Do you know the process diagram of Kafka with components?


Spark Streaming-Kafka
Let’s Recall…

2. What is Zookeeper in Kafka? Can we use kafka without zookeeper?


1) ZooKeeper is an open-source system which helps in managing the cluster
environment of Kafka.
2) It is an integral part of Kafka architecture.
3) One can not use Kafka without ZooKeeper.
4) The client servicing becomes unavailable once ZooKeeper is down.
5) ZooKeeper facilitates communication between different nodes of the cluster.
6) In a cluster environment, only one node works as a leader while others are
followers only. It is a ZooKeeper responsibility to choose a leader in a cluster node.
7) Whenever any server added or removed in a cluster, topic added or removed,
ZooKeeper sends that information to each node.
8) Choosing leadership node, better synchronization between different nodes,
configuration management are the key roles of ZooKeeper.
Spark Streaming-Kafka
Let’s recall…

3. Why do we need Kafka rather than other messaging service?


Let’s talk about some modern source of data nowadays which is data —
transactional data such as orders, inventory, and shopping carts — is being
augmented with things such as clicking, likes, recommendations and searches on
a web page.
All this data is deeply important to analyze the consumers’ behaviors, and it can
feed a set of predictive analytics engines that can be the differentiator for
companies. 
• Support low latency message delivery.
• Handling real time traffic.
• Assurance for fault tolerance.
• Easy to integrate with Spark applications to process high volume of messaging data.
• Ability to create a cluster of messaging containers which monitored and supervised by a
coordination server like Zookeeper.
So, when we need to handle this kind of volume of data, we need Kafka to solve
this problem. 
Spark Streaming-Kafka
Let’s recall….

4. What is Kafka Cluster? What is the key benefit of creating Kafka cluster?

• Kafka cluster is a group of more than one broker.


• Kafka cluster has a zero downtime, when we do the expansion of cluster.
• This cluster use to manage the persistence and replication of message data.
• This cluster offer’s strong durability due to cluster centric design.
• In the Kafka cluster, one of the brokers serves as the controller, which is
responsible for managing the states of partitions and replicas and for
performing administrative tasks like reassigning partitions.
Spark Streaming-Kafka
Let’s recall….

5. What is producer and consumer in Kafka?

● Producer is a client who send or publish the record. Producer applications write data
to topics and consumer applications read from topics. 

● Consumer is a subscriber who consume the messages which predominantly stores in


a partition. Consumer is a separate process and can be separate application
altogether which run in individual machine.
Spark Streaming-Kafka
Topics in Kafka

• Particular Stream of data


• Similar to table in DB
• Identified by Name
Spark Streaming-Kafka
KAFKA

Which Messages get Delivered to Each Subscriber?

• Topic based system

• Content-based system
Spark Streaming-Kafka
Topic based Routing

• Publishers send messages with topic labels

• Subscribers subscribe to topics


• And will receive all messages on that topic

• Example
• Subscribe to all fire sensors in BE block
Spark Streaming-Kafka
Content based routing

• Subscribers define matching criteria

• And will receive all messages that match the criteria

• Example
• Subscribe to all fire sensors in b block
Spark Streaming-Kafka
Communication advantages/disadvantages

Pros Cons

• No hard-wired connections • Design and maintenance of


between publishers and topics
subscribers • Performance overhead due
• Flexible: Easy to add and to communication
remove publishers or infrastructure
subscribers
Spark Streaming-Kafka
Exercise 2 (5 minutes)

• Consider a bookstore portal with various


activities such as
• Login
• List books
• Get book details
• Buy book
• Check status of order
• Return book
• Logout
• Assume we have 3 backend modules
• Security
• Order processing
• Book information
• Would you use a topic-based or content-
based system? What would be the topics /
content..?
Spark Streaming-Kafka
Exercise 2 (Solution)

• Would you use a topic-based or


content-based system? What
would be the topics / content..?
• Probably topic-based, since each
message type can be a topic
THANK YOU

Department of Computer Science and Engineering


APACHE STORM

A scalable distributed & fault tolerant real time


computation system
( Free & Open Source )

Shyam Rajendran 17-Feb-15


Agenda
•  History & the whys
•  Concept & Architecture
•  Features
•  Demo!
History
•  Before the Storm
Workers
Queues
Analyzing Real Time Data (old)

http://www.slideshare.net/nathanmarz/storm-distributed-and-
faulttolerant-realtime-computation
History
•  Problems?
•  Cumbersome to build applications (manual
+ tedious + serialize/deserialize message)
•  Brittle ( No fault tolerance )
•  Pain to scale - same application logic
spread across many workers, deployed
separately

•  Hadoop ?
•  For parallel batch processing : No Hacks for realtime
•  Map/Reduce is built to leverage data localization on HDFS to
distribute computational jobs.
•  Works on big data.

Why not as one self-contained application?


http://nathanmarz.com.
Enter the Storm!

•  BackType ( Acquired by Twitter )


Nathan Marz* + Clojure

•  Storm !
–  Stream process data in realtime with no
latency!
–  Generates big data!
Features
•  Simple programming model
•  Topology - Spouts – Bolts

•  Programming language agnostic


•  ( Clojure, Java, Ruby, Python default )
•  Fault-tolerant
•  Horizontally scalable
•  Ex: 1,000,000 messages per second on
a 10 node cluster
•  Guaranteed message processing
•  Fast : Uses zeromq message queue
•  Local Mode : Easy unit testing
Concepts – Steam and Spouts
•  Stream
•  Unbounded sequence of tuples ( storm data model )
•  <key, value(s)> pair ex. <“UIUC”, 5>

•  Spouts
•  Source of streams : Twitterhose API
•  Stream of tweets or some crawler
Concept - Bolts
•  Bolts
•  Process (one or more ) input stream and produce new
streams

•  Functions
•  Filter, Join, Apply/Transform etc
•  Parallelize to make it fast! – multiple processes constitute a
bolt
Concepts – Topology & Grouping
•  Topology
•  Graph of computation – can
have cycles
•  Network of Spouts and Bolts
•  Spouts and bolts execute as
many tasks across the cluster

•  Grouping
•  How to send tuples between the
components / tasks?
Concepts – Grouping
•  Shuffle Grouping
•  Distribute streams “randomly” to
bolt’s tasks

•  Fields Grouping
•  Group a stream by a subset of its fields

•  All Grouping
•  All tasks of bolt receive all input tuples
•  Useful for joins

•  Global Grouping
•  Pick task with lowers id
Zookeeper
•  Open source server for highly reliable distributed
coordination.
•  As a replicated synchronization service with eventual
consistency.
•  Features
•  Robust
•  Persistent data replicated across multiple nodes
•  Master node for writes
•  Concurrent reads
•  Comprises a tree of znodes, - entities roughly representing file
system nodes.
•  Use only for saving small configuration data.
Cluster
Features
•  Simple programming model
•  Topology - Spouts – Bolts

•  Programming language agnostic


•  ( Clojure, Java, Ruby, Python default )

•  Guaranteed message processing


•  Fault-tolerant
•  Horizontally scalable
•  Ex: 1,000,000 messages per second on a 10 node cluster
•  Fast : Uses zeromq message queue
•  Local Mode : Easy unit testing
Guranteed Message Processing
•  When is a message “Fully Proceed” ?
"fully processed" when the
tuple tree has been exhausted
and every message in the tree
has been processed

A tuple is considered failed


when its tree of messages fails
to be fully processed within a
specified timeout.

•  Storms’s reliability API ?


•  Tell storm whenever you create a new link in the tree of tuples
•  Tell storm when you have finished processing individual tuple
Fault Tolerance APIS
•  Emit(tuple, output)
•  Emits an output tuple, perhaps anchored on an input tuple (first
argument)
•  Ack(tuple)
•  Acknowledge that you (bolt) finished processing a tuple
•  Fail(tuple)
•  Immediately fail the spout tuple at the root of tuple topology if
there is an exception from the database, etc.
•  Must remember to ack/fail each tuple
•  Each tuple consumes memory. Failure to do so results in
memory leaks.
Fault-tolerant
•  Anchoring
•  Specify link in the tuple tree.
( anchor an output to one or
more input tuples.)
•  At the time of emitting new
tuple
•  Replay one or more tuples.

How? "acker" tasks


•  Every individual tuple must be acked. •  Track DAG of tuples for every
spout
•  If not task will run out of memory!
•  Every tuple ( spout/bolt ) given
•  Filter Bolts ack at the end of execution
a random 64 bit id
•  Join/Aggregation bolts use multi ack .
•  Every tuple knows the ids of all
spout tuples for which it exits.
What’s the catch?
Failure Handling
•  A tuple isn't acked because the task died:
•  Spout tuple ids at the root of the trees for the failed tuple will time
out and be replayed.
•  Acker task dies:
•  All the spout tuples the acker was tracking will time out and be
replayed.
•  Spout task dies:
•  The source that the spout talks to is responsible for replaying the
messages.
•  For example, queues like Kestrel and RabbitMQ will place all pending
messages back on the queue when a client disconnects.
Storm Genius
•  Major breakthrough : Tracking algorithm
•  Storm uses mod hashing to map a spout tuple id to an
acker task.
•  Acker task:
•  Stores a map from a spout tuple id to a pair of values.
•  Task id that created the spout tuple
•  Second value is 64bit number : Ack Val
•  XOR all tuple ids that have been created/acked in the tree.
•  Tuple tree completed when Ack Val = 0

•  Configuring Reliability
•  Config.TOPOLOGY_ACKERS to 0.
•  you can emit them as unanchored tuples
Exactly Once Semantics ?
•  Trident
•  High level abstraction for realtime computing on top of storm
•  Stateful stream processing with low latency distributed quering
•  Provides exactly-once semantics ( avoid over counting )

How can we do ?
Store the transaction id with the count
in the database as an atomic value

https://storm.apache.org/documentation/Trident-state
Exactly Once Mechanism
Lets take a scenario

•  Count aggregation of your stream


•  Store running count in database. Increment count after processing tuple.
•  Failure!

Design

•  Tuples are processed as small batches.


•  Each batch of tuples is given a unique id called the "transaction id" (txid).
•  If the batch is replayed, it is given the exact same txid.
•  State updates are ordered among batches.
Exactly Once Mechanism (contd.)
Design
•  Processing txid = 3
•  Database state
man => [count=3, txid=1]
dog => [count=4, txid=3]
apple => [count=10, txid=2]

["man"] •  If they're the same : SKIP


["man"] ( Strong Ordering )
["dog"] •  If they're different,
you increment the count.

man => [count=5, txid=3]


dog => [count=4, txid=3]
apple => [count=10, txid=2]

https://storm.apache.org/documentation/Trident-state
Improvements and Future Work
•  Lax security policies
•  Performance and scalability improvements
•  Presently with just 20 nodes SLAs that require processing more
than a million records per second is achieved.
•  High Availability (HA) Nimbus
•  Though presently not a single point of failure, it does affect degrade
functionality.
•  Enhanced tooling and language support
DEMO
Topology

Total
Parse Intermediate
Count Bolt Ranker
Tweet Bolt Ranker Bolt
Bolt

Tweet Spout Report


Bolt
Downloads
•  Download the binaries, Install and Configure -
ZooKeeper.
•  - Download the code, build, install - zeromq and jzmq.
•  - Download the binaries, Install and Configure – Storm.
References
•  https://storm.apache.org/
•  http://www.slideshare.net/nathanmarz/storm-distributed-and-
faulttolerant-realtime-computation
•  http://hortonworks.com/blog/the-future-of-apache-storm/
•  http://zeromq.org/intro:read-the-manual
•  http://www.thecloudavenue.com/2013/11/
InstallingAndConfiguringStormOnUbuntu.html
•  https://storm.apache.org/documentation/Setting-up-a-Storm-
cluster.html
DATABASE TECHNOLOGIES

UE19CS344

Prof. J. Ruby Dinakar


Department of Computer Science and Engineering
DATABASE TECHNOLOGIES

Unit - 4: Data-Stream Management


Apache Flink

Prof. J. Ruby Dinakar


Department of Computer Science and Engineering
DATABASE TECHNOLOGIES
Apache Flink

 Apache Flink is a framework and distributed processing engine for stateful computations
over unbounded and bounded data streams.
 Flink has been designed to run in all common cluster environments, perform computations
at in-memory speed and at any scale.
 Apache Flink is a big data stream processing and batch processing framework that is
developed by the Apache Software Foundation.
 Flink can be easily deployed with Hadoop as well as other frameworks and provides better
communication, data distribution over a stream of data.
 Apache Flink engine is developed in JAVA and Scala language that provides high throughput,
event management, and low latency.
Database Technologies
Features of Apache Flink

High performance
Flink is designed to achieve high performance and low latency. Flink's pipelined data processing
gives better performance.
Exactly-once stateful computation
Flink's distributed checkpoint processing helps to guarantee processing each record exactly once.
Flexible streaming windows
Flink supports data-driven windows. This means we can design a window based on time, counts, or
sessions. A window can also be customized which allows us to detect specific patterns in event
streams.
Fault tolerance
Flink's distributed, lightweight snapshot mechanism helps in achieving a great degree of fault
tolerance. It allows Flink to provide high-throughput performance with guaranteed delivery.
Database Technologies
Features of Apache Flink

Memory management
It efficiently does memory management by using hashing, indexing, caching, and sorting.
Optimizer
Flink's batch data processing API is optimized in order to avoid memory-consuming operations such
as shuffle, sort, and so on. It also makes sure that caching is used in order to avoid heavy disk IO
operations.
Stream and batch in one platform
Flink provides APIs for both batch and stream data processing.
Libraries
Flink has a rich set of libraries to do machine learning, graph processing, relational data processing,
and so on. Because of its architecture, it is very easy to perform complex event processing and
alerting. Event time semantics Flink supports event time semantics. This helps in processing
streams where events arrive out of order. Sometimes events may come delayed.
Database Technologies
Difference between Apache Flink and Apache Spark
Apache Flink uses Kappa-architecture, the architecture which uses the only stream (of data) for processing,
whereas,
Hadoop and Spark use Lambda architecture, which uses batches (of data) and micro-batches (of streamed
data) for processing.

Apache Flink Apache Spark


Offers native streaming Uses micro batches

Process each event in real time Process event in near real time.

Flink offers a wide range of techniques for Spark offers basic windowing strategies
windowing.
Flink has Gelly for graph processing Spark has GraphX for graph processing.
Flink provides an optimizer that optimizes jobs Spark jobs need to be optimized manually by developers
before execution on the streaming engine
Database Technologies
Basic concepts

Stream & Transformation:


The Flink program composed of two basic building blocks:
stream and transformation.
Stream is an intermediate result data, while transformation is an operation.
It calculates and processes one or more input streams and outputs one or more result streams.
When a Flink program is executed, it is mapped to streaming dataflow.
A streaming dataflow is composed of a group of streams and transformation operators.
It is similar to a DAG diagram.
It starts from one or more source operators and ends with one or more sink operators.
Database Technologies
Apache Flink
schematic diagram of mapping from
Flink program to streaming
dataflow.

flinkkafka consumer is a source


operator,
map, keyby, timewindow and apply
are transformation operators,
and rollingsink is a sink operator.
Database Technologies
Parallel Dataflow
In Flink, programs are inherently parallel and distributed:
A stream can be divided into multiple stream partitions, and an operator can be divided into multiple operator subtasks.
Each operator subtask is executed independently in different threads.
The parallelism of an operator is equal to the number of operator subtasks.
The parallelism of a stream is always equal to the parallelism of the operator that generates it.
Database Technologies
Apache Flink – Layered Architecture
Flink architecture consists of various components such as deploy, core processing, and
APIs.Flink has a layered architecture where each component is a part of a specific layer. Each
layer is built on top of the others for clear abstraction.
Flink is designed to run on local machines, in a YARN cluster, or on the cloud.
Runtime is Flink's core data processing engine that receives the program through APIs in the
form of JobGraph.
It is a simple parallel data flow with a set of tasks that produce and consume data streams.
The DataStream and DataSet APIs are the interfaces programmers can use for defining the
Job.
JobGraphs are generated by these APIs when the programs are compiled.
Once compiled, the DataSet API allows the optimizer to generate the optimal execution plan
while DataStream API uses a stream build for efficient execution plans.
The optimized JobGraph is then submitted to the executors according to the deployment
model.
You can choose a local, remote, or YARN mode of deployment. If you have a Hadoop cluster
already running, it is always better to use a YARN mode of deployment.
Database Technologies
Distributed execution
Apache Flink is a distributed stream processing system that works
on Mater - Slaves fashion.
The Master serves a Manager node whereas Slaves servers as
worker nodes.
The master receives the request from the client and assigns it to
slave nodes for further processing.
The Slave Nodes work on those assigned tasks and update to the
Master node.

Apache Flink has the following two daemons running on Master and
Slaves nodes.

1. Master Node(Job Manager)


The Job Manager daemons run on the Master node of the cluster
and work as a coordinator in the Flink system. It receives the
program code from the client system and assigns that task to slave
nodes for further processing.

2. Slave Nodes(Task Manager)


The Task Manager demons run on the Slave nodes of the cluster
and perform the actual operation in the Flink system. It receives the
command from the Job Manager and performs the required action.
Database Technologies
Apache Flink Execution Flow
Apache Flink executes the applications program in the following steps.

1. Program
It is the application-developed program that the client system will submit
for execution.

2. Parse and Optimize


In this phase, the code parsing to check syntax error, Type Extractor, and
optimation is done.

3. DataFlow Graph
In this phase, the application job is converted into a data flow graph for
further execution.

4. Job Manager
In this phase, the Job Manager demon of Apache Flink schedules the task
and sends it to the Task Managers for execution. The Job manager also
monitors the intermediate results.

5. Task Manager
In this phase, the Task Managers performs the execution of the task
assign by the Job Manager.
Database Technologies
Windowing

An infinite DataStream is divided into finite slices called windows based on the timestamps of
elements or other criteria.

This slicing helps processing data in chunks by applying transformations.

To do windowing on a stream, we need to assign a key on which the distribution can be made and
a function which describes what transformations to perform on a windowed stream.

To slice streams into windows, we can use pre-implemented Flink window assigners such as,
tumbling windows, sliding windows, global and session windows.
Database Technologies
Window Assigners
Global windows
Global windows are never-ending windows unless specified by a trigger. Each element is assigned to one single per-key global
Window. If we don't specify custom trigger, no computation will ever get triggered.
Tumbling windows
Tumbling windows are created based on certain times. They are fixed-length windows and non over lapping. It is useful when
you need to do computation of elements in specific time. For example, tumbling window of 10 minutes can be used to compute
a group of events occurring in 10 minutes time.
Sliding windows
Sliding windows are like tumbling windows but they are overlapping. They are fixed-length windows overlapping the previous
ones by a user given window slide parameter. This type of windowing is useful when you want to compute something out of a
group of events occurring in a certain time frame.
Session windows
Session windows are useful when windows boundaries need to be decided upon the input data. Session windows allows
flexibility in window start time and window size. We can also provide session gap configuration parameter which indicates how
long to wait before considering the session in closed.
Database Technologies
Window Assigners

Session Window
Tumbling Window Sliding Window
Database Technologies
Physical partitioning

Flink allows us to perform physical partitioning of the stream data. The different types of
partitioning are Custom partitioning, Random partitioning, Rebalancing partitioning and Rescaling.
Custom partitioning
We can provide custom implementation of a partitioner.
Random partitioning
Random partitioning randomly partitions data streams in an evenly manner.
Rescaling
It uses a round robin method to distribute the data evenly.
Rescaling is used to distribute the data across operations, perform transformations on sub-sets
of data and combine them together. This rebalancing happens over a single node only, hence it
does not require any data transfer across networks.
Database Technologies
Event time and watermarks
Flink supports different concepts of time for its streaming API.
Event time
The time at which event occurred on its producing device. For example in IoT project, the time at which sensor captures a
reading. Generally these event times needs to embed in the record before they enter Flink. At the time processing, these
timestamps are extracted and considering for windowing. Event time processing can be used for out of order events.
Processing time
Processing time is the time of machine executing the stream of data processing. Processing time windowing considers only
that timestamps where event is getting processed. Processing time is simplest way of stream processing as it does not require
any synchronization between processing machines and producing machines. In distributed asynchronous environment
processing time does not provide determinism as it is dependent on the speed at which records flow in the system.
Ingestion time
This is time at which a particular event enters Flink. All time based operations refer to this timestamp. Ingestion time is more
expensive operation than processing but it gives predictable results. Ingestion time programs cannot handle any out of order
events as it assigs timestamp only after the event is entered the Flink system.
Database Technologies
Fault tolerance
It is based on the principle of flink stream. When the stream processing fails, the data stream processing can be resumed
through snapshots. To understand Flink’s fault tolerance mechanism, we need to first understand the concept of barrier.

Stream barrier is the core element of Flink distributed snapshot. It will be treated as the record of data flow, inserted into the
data stream, grouped the records in the data flow, and pushed forward along the direction of data flow.

Each barrier will carry a snapshot ID, and records belonging to the snapshot will be pushed to the front of the barrier. Because
barrier is very lightweight, it does not interrupt the data flow.

Fig. data flow with barrier


Database Technologies
Snapshot

The records that appear before the barrier belong to the corresponding snapshot of the barrier, and the records that

appear after the barrier belong to the next snapshot

Multiple barriers from different snapshots may appear in the data stream at the same time, ie., multiple snapshots may be

generated simultaneously at the same time

When an intermediate operator receives a barrier, it will send the barrier to the data stream of the snapshot belonging to

the barrier.

When the sink operator receives the barrier, it will confirm the snapshot to the checkpoint coordinator.

Until all sink operators confirm the snapshot, it is considered that the snapshot has been completed.

It also includes the state held by the operator to ensure the correct recovery of data flow processing when the stream

processing system fails.

If an operator contains any form of state, that state must be part of the snapshot.
Database Technologies
Operator States

There are two states of the operator: system state and User defined state

System state

When an operator performs calculation and processing, it needs to buffer the data, the state of the data buffer is

associated with the operator.

Flink system will collect or aggregate the recorded data from the and put it in the buffer until the data in the buffer is

processed

User defined state

It can be a simple variable such as a Java object in a function, or a key / value state related to the function.
THANK YOU
J. Ruby Dinakar
Department of Computer Science and Engineering
[email protected]
DATABASE TECHNOLOGIES
Unit - 4: Data-Stream Management

Amazon Kinesis

Prof. Raghu B. A.
Department of Computer Science and Engineering
DATABASE TECHNOLOGIES

Unit - 4: Data-Stream Management


Amazon Kinesis

Prof. Raghu B. A.
Department of Computer Science and Engineering
DATABASE TECHNOLOGIES
Amazon Kinesis

In the Cloud
Spark’s expressive programming models and advanced analytics capabilities can be used on the
cloud, including the offerings of the major players: Amazon, Microsoft, and Google.

In this section, we provide a brief tour of the ways in which the streaming capabilities of Spark
can be used on the cloud infrastructure and with native cloud functions, and, if relevant, how
they compare with the cloud providers’ own proprietary stream-processing system.

Amazon Kinesis on AWS


Amazon Kinesis is the streaming delivery platform of Amazon Web Services (AWS).

It comes with rich semantics for defining producers and consumers of streaming data, along
with connectors for the pipelines created with those stream endpoints. We touched on
Kinesis in Chapter 19, in which we described the connector between Kinesis and Spark
Streaming.
DATABASE TECHNOLOGIES
Amazon Kinesis

Amazon Kinesis is the streaming delivery platform of Amazon Web Services (AWS).
DATABASE TECHNOLOGIES
Amazon Kinesis

There is a connector between Kinesis and Structured Streaming, as well, which is available
in two flavors:

• One offered natively to users of the Databricks edition of Spark, itself available on the
AWS and Microsoft Azure clouds
• The open source connector under JIRA Spark-18165, which offers a way to stream data
out of Kinesis easily

Those connectors are necessary because Kinesis, by design, does not come with a
comprehensive stream-processing paradigm besides a language of continuous queries on
AWS analytics, which cover simpler SQL-based queries. Therefore, the value of Kinesis is to
let clients implement their own processing from robust sources and sinks produced with
the battle-tested clients of the Kinesis SDK.

With Kinesis, it is possible to use the monitoring and throttling tools of the AWS platform,
getting a production ready stream delivery “out of the box” on the AWS cloud.
DATABASE TECHNOLOGIES
Amazon Kinesis

The open source connector between Amazon Kinesis and Structured Streaming is a
contribution from Qubole engineers. This library was developed and allows Kinesis to be a full
citizen of the Spark ecosystem, letting Spark users define analytic processing of arbitrary
complexity.

Finally, note that the Kinesis connector for Spark Streaming was based on the older receiver
model, which comes with some performance issues.

This Structured Streaming client is much more modern in its implementation, but it has not
yet migrated to the version 2 of the data source APIs, introduced in Spark 2.3.0.

Kinesis is a region of the Spark ecosystem which would welcome easy contributions updating
the quality of its implementations.
DATABASE TECHNOLOGIES
Amazon Kinesis

Amazon Kinesis

As a summary, Kinesis in AWS is a stream-delivery mechanism that introduces


producing and consuming streams, and connects them to particular endpoints.
However, it comes with limited built-in analytics capabilities, which makes it
complementary to a streaming analytics engine, such as Spark’s Streaming
modules.

https://www.amazonaws.cn/en/products/?nc1=f_dr#analytics
THANK YOU
Raghu B. A.
Department of Computer Science and Engineering

You might also like