Lecture 9 - Realtime Analytics
Lecture 9 - Realtime Analytics
Real-time Analytics
1
Lecture Outlines
• Real-time Analysis frameworks
• Apache Storm
• Apache Spark
Review
• Batch Analytics
Keywords
2
Review
NoSQL
3
Real-time Analysis
frameworks
• Real-time data analytics is the process of collecting, analyzing, and
acting on data in real-time as it is generated by various sources such
as sensors, financial markets, or customer interactions.
• The goal of real-time data analytics is to gain insight into the data
and provide business intelligence to make decisions quickly.
• Real-time analytics can be used to detect fraud, monitor customer
behavior, adjust prices and promotions, or manage inventory levels
in real-time.
4
Real-time Analysis frameworks
Stream
Processing
• Stream processing is a type of data processing that involves taking
action on data as it is received in real-time, rather than waiting until
the data is stored in a database.
• Stream processing is typically used when time-sensitive decisions
must be made quickly, such as fraud detection or emergency
responses.
• The data is processed in small chunks over a continuous period of
time, allowing for faster decision-making.
5
Stream Processing
Apache
Storm
• Apache Storm is a framework for distributed and fault-tolerant real-time
computation.
• Storm can be used for real-time processing of streams of data.
• Storm can ingest data from a variety of sources such as publish-subscribe
messaging frameworks, messaging queues and other custom connectors.
• Storm is a scalable and distributed framework, and offers reliable
processing of messages.
• Storm has been designed to run indefinitely and process streams of data
in real-time.
• The processing latencies with Storm are in the order of milliseconds.
6
Apache Storm
Concepts
• Topology:
• A computation job on the Storm cluster, called a “topology”, is a
graph of computation.
• A Storm topology comprises multiple worker processes that are
distributed on the cluster.
• Each worker process runs a subset of the topology.
• A topology is composed of two types of nodes;
• Spouts and Bolts.
• Figure shows some examples of these Storm topologies.
• The nodes in a topology are connected by directed edges.
• Each node receives a stream of data from other nodes and produces a new
stream.
7
Apache Storm
Concepts
• Tuples:
• The nodes in a topology consume data which is in the form of tuples.
• Each node receives data tuples from the previous node and produces tuples
which are processed further by the downstream nodes.
• A tuple is an ordered list of values.
• Tuples can contain primitive data types .
• Stream:
• Stream is an unbounded sequence of tuples.
• The nodes in a topology receive streams, process them and produce new
streams.
• The output streams can be consumed and processed by any downstream nodes
in the topology.
• In complex topologies, as shown in Figure (b), a node can produce or ingest
multiple streams. 8
Stream Processing
Apache
Storm
9
Apache Storm
Concepts
• Spout:
• Spout is a type of a node in a topology, which is a source of streams.
• Spouts receive data from external sources and produce them into the
topology as streams of tuples.
• Spouts do not process the tuples; they simply produce the tuples which are
consumed by the bolts in the topology.
10
Apache Storm
Concepts
• Bolt:
• Bolt is a type of a node in a topology that processes tuples.
• Bolts receive streams of tuples, process them and produce output streams.
• Bolts can receive streams either from spouts or other bolts.
• Bolts can perform various types of data processing operations such as filtering,
aggregation, joins, custom functions, etc.
• Storm topologies are designed such that each bolt performs simple
transformations on the data stream.
• Complex transformations are broken down into simpler transformations, which
are performed by multiple bolts.
• Since the different bolts process data in parallel, Storm can achieve low latencies
for data processing.
11
Apache Storm
Concepts
• Workers:
• Spouts and bolts have multiple worker processes.
• Each worker process itself has multiple threads of execution (called tasks).
• These tasks process the data in parallel.
12
Apache Storm
Stream
Groupings
Stream Groupings
• Since the bolts in a topology can have multiple tasks
(threads of execution), some mechanism is required to
define how the streams should be partitioned among the
tasks.
• This partitioning is defined in terms of stream groupings.
• Stream groupings define how the tuples produced by a
spout or bolt are distributed among the tasks of a
downstream bolt.
Storm supports the following types of stream groupings:
• Shuffle Grouping:
• In shuffle grouping, tuples are randomly distributed across the
tasks such that each task gets an equal number of tuples.
13
Apache Storm
Stream
Groupings
Stream Groupings
• Field Grouping:
• In field grouping, a grouping field is
specified by which the tuples in a
stream are grouped.
• Tuples with the same value of the
grouping field are always sent to the
same task.
• All Grouping:
• In all grouping, the stream is broadcast
to all the tasks in the bolt.
• This type of grouping is used where the
stream is to be replicated to all tasks in
the destination bolt. 14
Apache Storm
Stream
Groupings
• Global Grouping:
• In global grouping, the entire stream is
sent to a particular task of the
destination bolt (task with the lowest
ID).
• Direct Grouping:
• In direct grouping, the sender node
(spout or bolt) decides which task in the
destination bolt should receive the
stream.
15
Apache Storm
Architecture
• Figure shows the components of a Storm cluster.
• A Storm cluster consists of the
• Nimbus, Supervisor and Zookeeper components.
• Nimbus is responsible for :
• distributing topology code and tasks around the cluster,
• launching workers across the cluster,
• and monitoring the execution of topologies.
• Nimbus sends signals to supervisors to start or stop processes.
16
Apache Storm
Architecture
• Figure shows the components of a Storm cluster.
• Supervisor nodes
• communicate with Nimbus through Zookeeper.
• A Storm cluster has one or more Supervisor nodes on which the worker
processes run
• Zookeeper
• A high performance distributed coordination service for maintaining
configuration information, naming, providing distributed synchronization
and group services.
• Required for coordination of the Storm cluster.
• Zookeeper maintains the operational state of the cluster
17
Apache Storm
Architecture
18
Apache Storm
Architecture
• Storm topologies include implementations of spouts and bolts and the
topology definitions.
• Topologies are packaged as JAR files and submitted to the Nimbus node
for execution.
• The Nimbus uploads the topology to all supervisors and signals the
supervisors to launch worker processes.
• The spout and bolt tasks (threads of execution) are assigned to the
worker processes on the supervisor nodes.
19
Apache Storm
Architecture
• The topologies are monitored by the Nimbus node.
• If a worker on a supervisor fails, the supervisor restarts it.
• If a supervisor fails the Nimbus re-assigns the tasks to other
supervisors.
• If the Nimbus dies, the worker processes are not affected as the state
information is maintained by Zookeeper.
• The Nimbus and Supervisor spirits are run under supervision (using
tools such as monit, supervisord), so that they can be restarted if they
die.
20
Apache Storm
Reliable Processing
• Reliable Processing Storm provides
reliable processing of tuples.
• Storm guarantees that each tuple
produced by a spout is processed.
• Within a topology, a tuple which is
emitted by a spout is processed by the
bolts resulting in the creation of
multiple tuples which are based on the
original tuple.
• This results in a tuple tree as shown in
Figure.
21
Apache Storm
Reliable Processing
• Bolts in a topology acknowledge the processing of tuples to the
upstream bolts or spouts.
• If all bolts in a tuple tree acknowledge that a tuple has been successfully
processed, the spout marks the tuple processing to be completed,
performs cleanup and sends an acknowledgment to the external data
source.
• any bolt in the tuple tree indicates that tuple processing failed (or
timed-out), the spout marks the tuple processing as failed.
• When tuple processing fails, the spout re-produces the tuple.
22
Real-time Analysis frameworks
In-Memory
Processing
• In-memory processing is a data processing technology where data is
held in the computer's volatile main memory (RAM) instead of a disk-
based storage system.
• This allows for faster processing of data since data is accessed directly
from the main memory, instead of having to be retrieved from slower
disk-based storage systems.
• In-memory processing can be used for a variety of applications, such
as analytics, real-time decision-making, and data integration.
23
In-Memory Processing
Apache
Spark
• This section describes the Spark Streaming component for analysis of
streaming data such as sensor data, clickstream data, web server logs,
etc.
• The streaming data is ingested and analyzed in micro-batches.
• Spark Streaming enables scalable, high throughput and fault-tolerant
stream processing.
• Spark Streaming provides a high-level abstraction called DStream
(discretized stream).
• Spark can ingest data from various types of data sources such as
• publish-subscribe messaging frameworks, messaging queues, distributed file
systems and custom connectors.
24
In-Memory Processing
Apache
Spark
• The data ingested is converted into DStreams. Figure shows the Spark
Streaming components.
25
In-Memory Processing
Apache
Spark
• Spark provides operations for DStreams.
• Figure shows a DStream, which is composed of RDDs, where each RDD
contains data from a certain time interval.
• The DStream operations are translated into operations on the
underlying RDDs.
• DStream transformations such as map, flatMap, filter, reduceByKey are
stateless as the transformation are applied to the RDDs in the DStream
separately
26
In-Memory Processing
Apache
Spark
27
In-Memory Processing
Apache
Spark
• Spark also supports stateful operations such as windowed operations
and updateStateByKey operation.
• Stateful operations require checkpointing for fault tolerance purposes.
• For stateful operations, a checkpoint directory is provided to which
RDDs are checkpointed periodically.
• Figure shows an example of a window operation.
• Window operations allow the computations to be done over a sliding
window of data.
• For window operations, a window length and a slide interval in
specified.
28
In-Memory Processing
Apache
Spark
29
Apache Spark
Operations
• window
• The window operation returns a new DStream from a sliding window over the
source Dstream.
• countByWindow
• The countByWindow operation counts the number of elements in a window
over the Dstream.
• reduceByWindow
• The reduceByWindow operation aggregates the elements in a sliding window
over a stream using the specified function.
30
Apache Spark
Operations
• reduceByKeyAndWindow
• The reduceByKeyAndWindow operation when applied on DStream containing
key-value pairs, aggregates values of each key in a sliding window over a stream
using the specified function.
• The reduceByKeyAndWindow operation has two forms.
• In one form the reduced value over a new window is calculated by
applying the specified function over the whole window.
• In the other form, the reduced value over a new window is calculated
by applying the function to the new values which entered the window
and an inverse function over the values which left the window.
31
Apache Spark
Operations
• countByValueAndWindow
• The reduceByKeyAndWindow operation when applied on DStream containing
key-value pairs, returns a new DStream with key-value pairs where the value is
the count for each key (number of elements) in the sliding window.
• updateStateByKey
• Another type of stateful operation is the updateStateByKey operation which
maintains and tracks the state for each key in a dataset.
• The updateStateByKey operation requires a state to be defined and an update
function for updating the state using the previous state and the new values.
32
Apache Spark
Operations
33
Next lecture
• Interactive Analytics
Assignment
Deadline
last lecture
Previous Deadline