0% found this document useful (0 votes)

8 views

Lecture 9 - Realtime Analytics

Uploaded by

Mohammed Albohiry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Lecture 9 - Realtime Analytics

Uploaded by

Mohammed Albohiry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

BIG DATA Master IT Lecture 9 Course code : M25331

Real-time Analytics

Dr. Ali Haider Shamsan

1
Lecture Outlines
• Real-time Analysis frameworks
• Apache Storm
• Apache Spark

Review

• Batch Analytics

Keywords

big data, Database, Real-time Analytics

2
Review
NoSQL

• Hadoop and MapReduce

• Pig
• Apache Oozie
• Apache Spark
• Apache Solr

3
Real-time Analysis
frameworks
• Real-time data analytics is the process of collecting, analyzing, and
acting on data in real-time as it is generated by various sources such
as sensors, financial markets, or customer interactions.
• The goal of real-time data analytics is to gain insight into the data
and provide business intelligence to make decisions quickly.
• Real-time analytics can be used to detect fraud, monitor customer
behavior, adjust prices and promotions, or manage inventory levels
in real-time.

4
Real-time Analysis frameworks
Stream
Processing
• Stream processing is a type of data processing that involves taking
action on data as it is received in real-time, rather than waiting until
the data is stored in a database.
• Stream processing is typically used when time-sensitive decisions
must be made quickly, such as fraud detection or emergency
responses.
• The data is processed in small chunks over a continuous period of
time, allowing for faster decision-making.

5
Stream Processing
Apache
Storm
• Apache Storm is a framework for distributed and fault-tolerant real-time
computation.
• Storm can be used for real-time processing of streams of data.
• Storm can ingest data from a variety of sources such as publish-subscribe
messaging frameworks, messaging queues and other custom connectors.
• Storm is a scalable and distributed framework, and offers reliable
processing of messages.
• Storm has been designed to run indefinitely and process streams of data
in real-time.
• The processing latencies with Storm are in the order of milliseconds.
6
Apache Storm

Concepts
• Topology:
• A computation job on the Storm cluster, called a “topology”, is a
graph of computation.
• A Storm topology comprises multiple worker processes that are
distributed on the cluster.
• Each worker process runs a subset of the topology.
• A topology is composed of two types of nodes;
• Spouts and Bolts.
• Figure shows some examples of these Storm topologies.
• The nodes in a topology are connected by directed edges.
• Each node receives a stream of data from other nodes and produces a new
stream.

7
Apache Storm

Concepts
• Tuples:
• The nodes in a topology consume data which is in the form of tuples.
• Each node receives data tuples from the previous node and produces tuples
which are processed further by the downstream nodes.
• A tuple is an ordered list of values.
• Tuples can contain primitive data types .
• Stream:
• Stream is an unbounded sequence of tuples.
• The nodes in a topology receive streams, process them and produce new
streams.
• The output streams can be consumed and processed by any downstream nodes
in the topology.
• In complex topologies, as shown in Figure (b), a node can produce or ingest
multiple streams. 8
Stream Processing
Apache
Storm

9
Apache Storm

Concepts
• Spout:
• Spout is a type of a node in a topology, which is a source of streams.
• Spouts receive data from external sources and produce them into the
topology as streams of tuples.
• Spouts do not process the tuples; they simply produce the tuples which are
consumed by the bolts in the topology.

10
Apache Storm

Concepts
• Bolt:
• Bolt is a type of a node in a topology that processes tuples.
• Bolts receive streams of tuples, process them and produce output streams.
• Bolts can receive streams either from spouts or other bolts.
• Bolts can perform various types of data processing operations such as filtering,
aggregation, joins, custom functions, etc.
• Storm topologies are designed such that each bolt performs simple
transformations on the data stream.
• Complex transformations are broken down into simpler transformations, which
are performed by multiple bolts.
• Since the different bolts process data in parallel, Storm can achieve low latencies
for data processing.
11
Apache Storm

Concepts
• Workers:
• Spouts and bolts have multiple worker processes.
• Each worker process itself has multiple threads of execution (called tasks).
• These tasks process the data in parallel.

12
Apache Storm
Stream
Groupings
Stream Groupings
• Since the bolts in a topology can have multiple tasks
(threads of execution), some mechanism is required to
define how the streams should be partitioned among the
tasks.
• This partitioning is defined in terms of stream groupings.
• Stream groupings define how the tuples produced by a
spout or bolt are distributed among the tasks of a
downstream bolt.
Storm supports the following types of stream groupings:
• Shuffle Grouping:
• In shuffle grouping, tuples are randomly distributed across the
tasks such that each task gets an equal number of tuples.
13
Apache Storm
Stream
Groupings
Stream Groupings
• Field Grouping:
• In field grouping, a grouping field is
specified by which the tuples in a
stream are grouped.
• Tuples with the same value of the
grouping field are always sent to the
same task.
• All Grouping:
• In all grouping, the stream is broadcast
to all the tasks in the bolt.
• This type of grouping is used where the
stream is to be replicated to all tasks in
the destination bolt. 14
Apache Storm
Stream
Groupings
• Global Grouping:
• In global grouping, the entire stream is
sent to a particular task of the
destination bolt (task with the lowest
ID).
• Direct Grouping:
• In direct grouping, the sender node
(spout or bolt) decides which task in the
destination bolt should receive the
stream.

15
Apache Storm

Architecture
• Figure shows the components of a Storm cluster.
• A Storm cluster consists of the
• Nimbus, Supervisor and Zookeeper components.
• Nimbus is responsible for :
• distributing topology code and tasks around the cluster,
• launching workers across the cluster,
• and monitoring the execution of topologies.
• Nimbus sends signals to supervisors to start or stop processes.

16
Apache Storm

Architecture
• Figure shows the components of a Storm cluster.
• Supervisor nodes
• communicate with Nimbus through Zookeeper.
• A Storm cluster has one or more Supervisor nodes on which the worker
processes run
• Zookeeper
• A high performance distributed coordination service for maintaining
configuration information, naming, providing distributed synchronization
and group services.
• Required for coordination of the Storm cluster.
• Zookeeper maintains the operational state of the cluster
17
Apache Storm

Architecture

18
Apache Storm

Architecture
• Storm topologies include implementations of spouts and bolts and the
topology definitions.
• Topologies are packaged as JAR files and submitted to the Nimbus node
for execution.
• The Nimbus uploads the topology to all supervisors and signals the
supervisors to launch worker processes.
• The spout and bolt tasks (threads of execution) are assigned to the
worker processes on the supervisor nodes.

19
Apache Storm

Architecture
• The topologies are monitored by the Nimbus node.
• If a worker on a supervisor fails, the supervisor restarts it.
• If a supervisor fails the Nimbus re-assigns the tasks to other
supervisors.
• If the Nimbus dies, the worker processes are not affected as the state
information is maintained by Zookeeper.
• The Nimbus and Supervisor spirits are run under supervision (using
tools such as monit, supervisord), so that they can be restarted if they
die.

20
Apache Storm

Reliable Processing
• Reliable Processing Storm provides
reliable processing of tuples.
• Storm guarantees that each tuple
produced by a spout is processed.
• Within a topology, a tuple which is
emitted by a spout is processed by the
bolts resulting in the creation of
multiple tuples which are based on the
original tuple.
• This results in a tuple tree as shown in
Figure.
21
Apache Storm

Reliable Processing
• Bolts in a topology acknowledge the processing of tuples to the
upstream bolts or spouts.
• If all bolts in a tuple tree acknowledge that a tuple has been successfully
processed, the spout marks the tuple processing to be completed,
performs cleanup and sends an acknowledgment to the external data
source.
• any bolt in the tuple tree indicates that tuple processing failed (or
timed-out), the spout marks the tuple processing as failed.
• When tuple processing fails, the spout re-produces the tuple.

22
Real-time Analysis frameworks
In-Memory
Processing
• In-memory processing is a data processing technology where data is
held in the computer's volatile main memory (RAM) instead of a disk-
based storage system.
• This allows for faster processing of data since data is accessed directly
from the main memory, instead of having to be retrieved from slower
disk-based storage systems.
• In-memory processing can be used for a variety of applications, such
as analytics, real-time decision-making, and data integration.

23
In-Memory Processing
Apache
Spark
• This section describes the Spark Streaming component for analysis of
streaming data such as sensor data, clickstream data, web server logs,
etc.
• The streaming data is ingested and analyzed in micro-batches.
• Spark Streaming enables scalable, high throughput and fault-tolerant
stream processing.
• Spark Streaming provides a high-level abstraction called DStream
(discretized stream).
• Spark can ingest data from various types of data sources such as
• publish-subscribe messaging frameworks, messaging queues, distributed file
systems and custom connectors.
24
In-Memory Processing
Apache
Spark
• The data ingested is converted into DStreams. Figure shows the Spark
Streaming components.

25
In-Memory Processing
Apache
Spark
• Spark provides operations for DStreams.
• Figure shows a DStream, which is composed of RDDs, where each RDD
contains data from a certain time interval.
• The DStream operations are translated into operations on the
underlying RDDs.
• DStream transformations such as map, flatMap, filter, reduceByKey are
stateless as the transformation are applied to the RDDs in the DStream
separately

26
In-Memory Processing
Apache
Spark

27
In-Memory Processing
Apache
Spark
• Spark also supports stateful operations such as windowed operations
and updateStateByKey operation.
• Stateful operations require checkpointing for fault tolerance purposes.
• For stateful operations, a checkpoint directory is provided to which
RDDs are checkpointed periodically.
• Figure shows an example of a window operation.
• Window operations allow the computations to be done over a sliding
window of data.
• For window operations, a window length and a slide interval in
specified.
28
In-Memory Processing
Apache
Spark

29
Apache Spark

Operations

• window
• The window operation returns a new DStream from a sliding window over the
source Dstream.
• countByWindow
• The countByWindow operation counts the number of elements in a window
over the Dstream.
• reduceByWindow
• The reduceByWindow operation aggregates the elements in a sliding window
over a stream using the specified function.

30
Apache Spark

Operations

• reduceByKeyAndWindow
• The reduceByKeyAndWindow operation when applied on DStream containing
key-value pairs, aggregates values of each key in a sliding window over a stream
using the specified function.
• The reduceByKeyAndWindow operation has two forms.
• In one form the reduced value over a new window is calculated by
applying the specified function over the whole window.
• In the other form, the reduced value over a new window is calculated
by applying the function to the new values which entered the window
and an inverse function over the values which left the window.

31
Apache Spark

Operations

• countByValueAndWindow
• The reduceByKeyAndWindow operation when applied on DStream containing
key-value pairs, returns a new DStream with key-value pairs where the value is
the count for each key (number of elements) in the sliding window.
• updateStateByKey
• Another type of stateful operation is the updateStateByKey operation which
maintains and tracks the state for each key in a dataset.
• The updateStateByKey operation requires a state to be defined and an update
function for updating the state using the previous state and the new values.

32
Apache Spark

Operations

33
Next lecture

• Interactive Analytics

Assignment

How you found the “Big Data” subject, is it USEFUL or USELESS?

Explain your point of view supported with examples.

Deadline

last lecture

Previous Deadline

Explain the data acquisition processes in AWS and AZURE ( Assignment 4)

Introduction To KPLER
No ratings yet
Introduction To KPLER
33 pages
Project 2 AI
No ratings yet
Project 2 AI
33 pages
2022 HRM Systems Diagnostic Checklists
No ratings yet
2022 HRM Systems Diagnostic Checklists
5 pages
Aurora Cidr03
No ratings yet
Aurora Cidr03
12 pages
EY Imagining The Digital Future
50% (2)
EY Imagining The Digital Future
56 pages
Lec 03
No ratings yet
Lec 03
16 pages
Apache Storm
No ratings yet
Apache Storm
29 pages
PPT 2.1.4
No ratings yet
PPT 2.1.4
23 pages
Unit_3
No ratings yet
Unit_3
55 pages
Apache Storm Tutorial
No ratings yet
Apache Storm Tutorial
22 pages
Apache
No ratings yet
Apache
12 pages
Benefits of Apache Storm
No ratings yet
Benefits of Apache Storm
3 pages
HD Mod012 Storm
No ratings yet
HD Mod012 Storm
79 pages
Apache Storm
No ratings yet
Apache Storm
39 pages
Building Python Real-Time Applications With Storm - Sample Chapter
No ratings yet
Building Python Real-Time Applications With Storm - Sample Chapter
18 pages
BDA UNIT-2 (Final)
No ratings yet
BDA UNIT-2 (Final)
27 pages
Storm Berkeley
No ratings yet
Storm Berkeley
91 pages
Stream Processing Everywhere
No ratings yet
Stream Processing Everywhere
46 pages
Big Data Pipelines The Riseof Real Time
No ratings yet
Big Data Pipelines The Riseof Real Time
7 pages
An Introduction To Apache Storm
No ratings yet
An Introduction To Apache Storm
10 pages
Apache Storm Tutorial
100% (1)
Apache Storm Tutorial
64 pages
Streaming Ecosystem
No ratings yet
Streaming Ecosystem
31 pages
ECS765P - W10 - Stream Processing
No ratings yet
ECS765P - W10 - Stream Processing
39 pages
Big Data
No ratings yet
Big Data
12 pages
Analytics On Big Fast Data Using A Realtime Stream Data Processing Architecture
No ratings yet
Analytics On Big Fast Data Using A Realtime Stream Data Processing Architecture
34 pages
Lectur 5
No ratings yet
Lectur 5
37 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
Cs498 Week 12 Slide
No ratings yet
Cs498 Week 12 Slide
100 pages
Apache Storm Thesis
100% (2)
Apache Storm Thesis
7 pages
2.storm
No ratings yet
2.storm
2 pages
2015 - Liquid Stream Processing Across Web Browsers and Web Servers
No ratings yet
2015 - Liquid Stream Processing Across Web Browsers and Web Servers
18 pages
BDA
No ratings yet
BDA
16 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Analysis of Real Time Stream Processing Systems Considering Latency
No ratings yet
Analysis of Real Time Stream Processing Systems Considering Latency
7 pages
Big Data Architecture
No ratings yet
Big Data Architecture
9 pages
Module4 1
No ratings yet
Module4 1
68 pages
Lambda Architecture
No ratings yet
Lambda Architecture
20 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
Uint 4miningdatastream 230810162429 9d7c02a7
No ratings yet
Uint 4miningdatastream 230810162429 9d7c02a7
11 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
Expose BDD
No ratings yet
Expose BDD
16 pages
Big Data Analysis Apache Storm Perspecti
No ratings yet
Big Data Analysis Apache Storm Perspecti
6 pages
DataStreaming L-4
No ratings yet
DataStreaming L-4
16 pages
ECS765P - W11 - Stream Processing II
No ratings yet
ECS765P - W11 - Stream Processing II
47 pages
Bài Giảng Spark Streaming
No ratings yet
Bài Giảng Spark Streaming
75 pages
13 Graves
No ratings yet
13 Graves
12 pages
Real time data streaming new techniques
No ratings yet
Real time data streaming new techniques
5 pages
Unit 2 BD Mining Data Streams
No ratings yet
Unit 2 BD Mining Data Streams
34 pages
Arinto Murdopo Josep Subirats Group 4 EEDC 2012
No ratings yet
Arinto Murdopo Josep Subirats Group 4 EEDC 2012
19 pages
2016 05 10 Apache Nifi Deep Dive 160511170654
No ratings yet
2016 05 10 Apache Nifi Deep Dive 160511170654
34 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
R Storm Resource Aware Scheduling in Storm
No ratings yet
R Storm Resource Aware Scheduling in Storm
13 pages
Rad Stack
No ratings yet
Rad Stack
10 pages
lec19
No ratings yet
lec19
24 pages
Module II
No ratings yet
Module II
22 pages
DAV Chapter3
No ratings yet
DAV Chapter3
44 pages
Big Data
No ratings yet
Big Data
4 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Oracle 11g Streams Implementer's Guide
From Everand
Oracle 11g Streams Implementer's Guide
Ann L. R. McKinnell
No ratings yet
Mastering Nmap - A Guide to Network Scanning & Security: Security Books
From Everand
Mastering Nmap - A Guide to Network Scanning & Security: Security Books
Erwin Dirks
No ratings yet
Mastering OpenStack: Design, deploy, and manage clouds in mid to large IT infrastructures
From Everand
Mastering OpenStack: Design, deploy, and manage clouds in mid to large IT infrastructures
Omar Khedher
No ratings yet
Data Science Nigeria Annual Report
No ratings yet
Data Science Nigeria Annual Report
64 pages
Omdia - Private Lte 5g Networks European Csps - Dec22
No ratings yet
Omdia - Private Lte 5g Networks European Csps - Dec22
19 pages
RDBMS To MongoDB Migration
No ratings yet
RDBMS To MongoDB Migration
19 pages
Forest Supply Chains During Digitalization: Current Implementations and Prospects in Near Future
No ratings yet
Forest Supply Chains During Digitalization: Current Implementations and Prospects in Near Future
16 pages
Carolinas Healthcare System (243084, 243231)
No ratings yet
Carolinas Healthcare System (243084, 243231)
4 pages
Sanjanasingesume
No ratings yet
Sanjanasingesume
3 pages
Técnicas de Inteligencia Artificial Aplicadas en Ciberseguridad
No ratings yet
Técnicas de Inteligencia Artificial Aplicadas en Ciberseguridad
63 pages
Prefabrication and Modular Construction 2020: Smartmarket
No ratings yet
Prefabrication and Modular Construction 2020: Smartmarket
68 pages
2021 APEC Initiative On Closing The Digital Skills Gap - A Focus On Measurement and Digital Readiness
No ratings yet
2021 APEC Initiative On Closing The Digital Skills Gap - A Focus On Measurement and Digital Readiness
6 pages
Sac Interviw Qa
No ratings yet
Sac Interviw Qa
17 pages
[FREE PDF sample] Mastering Microsoft Fabric: SAASification of Analytics 1st Edition Debananda Ghosh ebooks
100% (3)
[FREE PDF sample] Mastering Microsoft Fabric: SAASification of Analytics 1st Edition Debananda Ghosh ebooks
55 pages
University of Northampton
No ratings yet
University of Northampton
4 pages
The Go To Market Revolution
No ratings yet
The Go To Market Revolution
6 pages
Enhancing Decision Making
100% (1)
Enhancing Decision Making
28 pages
Accenture Procurement BPS Brochure December 2017
100% (1)
Accenture Procurement BPS Brochure December 2017
12 pages
Fahad Hussain Resume
No ratings yet
Fahad Hussain Resume
2 pages
The Innovation Imperative
No ratings yet
The Innovation Imperative
13 pages
Go With The Flow: Transportation Information Management Solutions From IBM
No ratings yet
Go With The Flow: Transportation Information Management Solutions From IBM
8 pages
Lean Six Sigma: Green Belt (Csse-Gb)
100% (1)
Lean Six Sigma: Green Belt (Csse-Gb)
10 pages
How To Make Viral Youtube Shorts Using AI
No ratings yet
How To Make Viral Youtube Shorts Using AI
5 pages
wKxDFIDRS3eF3HO6AM2K Jan27RSallam
No ratings yet
wKxDFIDRS3eF3HO6AM2K Jan27RSallam
17 pages
AI Governance
No ratings yet
AI Governance
298 pages
Digital Transformation in Financial Services Adjusted
No ratings yet
Digital Transformation in Financial Services Adjusted
2 pages
Budget Buddy
No ratings yet
Budget Buddy
13 pages
Mark R. Schurtman, MBA: Business Analyst
No ratings yet
Mark R. Schurtman, MBA: Business Analyst
4 pages
GAIQ Practice Answeres Google Analytics
No ratings yet
GAIQ Practice Answeres Google Analytics
4 pages

Lecture 9 - Realtime Analytics

Uploaded by

Lecture 9 - Realtime Analytics

Uploaded by

BIG DATA Master IT Lecture 9 Course code : M25331

Dr. Ali Haider Shamsan

big data, Database, Real-time Analytics

• Hadoop and MapReduce

How you found the “Big Data” subject, is it USEFUL or USELESS?

Explain the data acquisition processes in AWS and AZURE ( Assignment 4)

You might also like