Streaming Graph Processing Unit5

Streaming graph processing enables real-time analysis and updates of graphs as data streams arrive, crucial for applications like social network analysis and fraud detection. It involves a series of steps including data ingestion, transformation, and storage, often represented as a Directed Acyclic Graph (DAG). The document also discusses the use of Kafka and AWS services for building scalable and reliable data pipelines, emphasizing the importance of data sources, sinks, and transformations in stream processing.

Uploaded by

raghavkapoor670

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

Streaming Graph Processing Unit5

Uploaded by

raghavkapoor670

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Streaming graph processing involves analyzing and updating graphs as data streams arrive

in real-time.
This approach is crucial for applications like social network analysis, fraud detection, and
recommendation systems, where graphs constantly evolve. Unlike traditional batch
processing, which handles data in chunks, streaming graph processing handles data as it
streams in, allowing for immediate analysis and action.
Example
Let’s consider an application that processes a stream of system log events in order to power
a dashboard giving immediate insight into what’s going on with the system. These are the
steps it will take:
1. Read the raw data from log files in a watched directory.
2. Parse and filter it, transforming it into event objects.
3. Enrich the events: Using a machine ID field in the event object, look up the data we
have on that machine and attach it to the event
4. Save the enriched, denormalized data to persistent storage such as Cassandra or
HDFS for deeper offline analysis (first data sink).
5. Perform windowed aggregation using the attached data calculate real-time statistics
over the events that happened within the time window of the last minute.
6. Push the results to an in-memory K-V store that backs the dashboard (second data
sink).
We can represent each of the above steps with a box in a diagram and connect them using
arrows representing the data flow:
The boxes and arrows form a graph, specifically a Directed Acyclic Graph (a DAG). This
model is at the core of modern stream processing engines.
Transformations
Transformations are used to express the business logic of a streaming application. On the
low level, a processing task receives some stream items, performs arbitrary processing,
and emits some items. It may emit items even without receiving anything (acting as a
stream source) or it may just receive and not emit anything (acting as a sink).
Due to the nature of distributed computation, you can’t just provide arbitrary imperative
code that processes the data — you must describe it declaratively.
This is why streaming applications share some principles with functional and dataflow
programming. This requires some time to get used to when coming from the imperative
programming.
These are the main types of stateless transformation:
 Map transforms one record to one record, e.g. Change format of the record, enrich
record with some data.
 Filter filters out the records that doesn’t satisfy the predicate.
 FlatMap is the most general type of stateless transformation, outputting zero or
more records for each input record, e.g. tokenize a record containing a sentence into
individual words.
However, many types of computation involve more than one record. In this case, the
processor must maintain internal state across the records. When counting the records in the
stream, for example, you have to maintain the current count.
Stateful transformations:
 Aggregation: Combines all the records to produce a single value,
e.g. min, max, sum, count, avg.
 Group-and-aggregate: Extracts a grouping key from the record and computes a
separate aggregated value for each key.
 Join: Joins same-keyed records from several streams.
 Sort: Sorts the records observed in the stream.
Sources and Sinks
A stream processing application accesses data sources and sinks via its connectors. They are
a computation job’s point of contact with the outside world.

Although the connectors do their best to unify the various kinds of resources under the
same “data stream” paradigm, there are still many concerns that need your attention as
they limit what you can do within your stream processing application.
Is It Unbounded?
The first decision when building a computation job is to decide whether it will deal with
bounded (finite) or unbounded (infinite) data.
Bounded data is handled in batch jobs and there are fewer concerns to deal with, as data
boundaries are within the dataset itself.
You don’t have to worry about windowing, late events, or event time skew. Examples of
bounded, finite resources are plain files, Hadoop Distributed File System (HDFS), or
iterating through database query results.
Unbounded data streams allow for continuous processing; however, you have to use
windowing for operations that cannot effectively work on infinite input (such as sum, avg,
or sort).
In the unbounded category, the most popular choice is Kafka. Some databases can be
turned into unbounded data sources by exposing the journal, the stream of all data changes,
as an API for third parties.
Is It Replayable?
The data source is replayable if you can easily replay the data. This is generally necessary for
fault tolerance: if anything goes wrong during the computation, the data can be replayed
and the computation can be restarted from the beginning.
Bounded data sources are mostly replayable (e.g. plain file, HDFS file). The replayability of
infinite data sources is limited by the disk space necessary to store the whole data stream —
an Apache Kafka source is replayable with this limitation. On the other hand, some sources
can be read only once (e.g. TCP socket source, JMS Queue source).
Is It Distributed?
A distributed computation engine prefers to work with distributed data resources to
maximize performance. If the resource is not distributed, all SPE cluster members will have
to contend for access to a single data source endpoint. Kafka, HDFS, and Hazelcast IMap are
all distributed. On the other hand, a file is not; it is stored on a single machine.
Data Locality?
If you’re looking to achieve record breaking throughput for your application, you’ll have to
think carefully about how close you can deliver your data to the location where the stream
processing application will consume and process it. For example, if your source is HDFS, you
should align the topologies of the Hadoop and SPE clusters so that each machine that hosts
an HDFS member also hosts a node of the SPE cluster.
Kafka
A data pipeline using AWS and Kafka involves using Kafka, a distributed streaming platform,
to handle the flow of data, and AWS services for processing and storage. Kafka is often
used as a message broker, enabling real-time streaming data pipelines and
applications. AWS provides services like MSK(managed streaming Apache Kafka) Connect
and Data Pipeline to build and manage these pipelines.
1. Kafka as the Backbone:
 Kafka acts as a central hub for ingesting, storing, and distributing streaming data.
 It facilitates real-time data processing by allowing producers to publish data to
topics and consumers to subscribe and process it.
 Kafka can be used for various use cases, including tracking user activity, processing
IoT events, and building data pipelines.
2. AWS Services for Data Processing and Storage:
 AWS Data Pipeline:
This service helps sequence, schedule, and manage data processing tasks, including ETL
activities.
 AWS Glue:
Glue provides a platform for data cataloguing, data transformation, and ETL jobs.
 AWS MSK Connect:
This feature simplifies building data pipelines by allowing you to define connectors that
move data between Kafka and other AWS services like S3, Redshift, Elasticsearch, and
RDS.
 Amazon S3:
A scalable object storage service that can be used for storing processed data from Kafka.
 Amazon Redshift:
A data warehouse service that can be used for analyzing large datasets from Kafka.
 Amazon EMR:
A managed Hadoop and Spark cluster environment that can be used for processing data
from Kafka.
 Amazon Kinesis Data Streams:
A fully managed streaming data service that can be used as an alternative to Kafka.
3. Example Data Pipeline Workflow:
1. Data Producers: Applications or systems generate data and publish it to Kafka
topics.
2. Kafka as a Hub: Kafka receives and stores the data in its topics.
3. Consumers: Applications or AWS services (e.g., Spark, Glue, Lambda) consume data
from Kafka topics.
4. Processing and Transformation: Consumers perform data transformations,
aggregations, and other processing tasks.
5. Data Storage: Processed data is then stored in services like S3, Redshift, or other
destinations.
4. Benefits of using Kafka and AWS for data pipelines:
 Scalability: Kafka and AWS services offer scalability to handle large volumes of data.
 Reliability: Kafka provides fault tolerance and data durability.
 Real-time Processing: Kafka allows for real-time data processing and analysis.
 Flexibility: AWS offers various services that can be combined with Kafka to create
custom data pipelines.
5. Setting up a Data Pipeline:
 Choose Kafka (MSK): You can use Amazon Managed Streaming for Apache Kafka
(MSK) or a self-managed Kafka cluster.
 Create Kafka Topics: Define topics to store different types of data streams.
 Connect to AWS Services: Configure connectors using MSK Connect to move data
between Kafka and other AWS services.
 Set up Data Processing: Define data processing tasks using AWS Glue or other
services.
 Configure Data Storage: Choose the appropriate storage solution (e.g., S3, Redshift)
for your processed data.

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
CSS 3rd Periodical Exam (Test Paper)
100% (2)
CSS 3rd Periodical Exam (Test Paper)
3 pages
BDA UNIT-2 (Final)
No ratings yet
BDA UNIT-2 (Final)
27 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Big Data IV Nit
No ratings yet
Big Data IV Nit
15 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
BDA
No ratings yet
BDA
16 pages
What is Streaming Data
No ratings yet
What is Streaming Data
4 pages
Unit 3 Data Analytics (1)
No ratings yet
Unit 3 Data Analytics (1)
15 pages
ECS765P - W10 - Stream Processing
No ratings yet
ECS765P - W10 - Stream Processing
39 pages
7- Streaming 2- Calcite
No ratings yet
7- Streaming 2- Calcite
45 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Unit 2 BD Mining Data Streams
No ratings yet
Unit 2 BD Mining Data Streams
34 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
Bda Ut-2
No ratings yet
Bda Ut-2
18 pages
Whitepaper Streaming Data Solutions On Aws With Amazon Kinesis
No ratings yet
Whitepaper Streaming Data Solutions On Aws With Amazon Kinesis
33 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Apache Spark Streaming Presentation
100% (1)
Apache Spark Streaming Presentation
28 pages
Spark Streaming: Tathagata "TD" Das
No ratings yet
Spark Streaming: Tathagata "TD" Das
28 pages
Uint 4miningdatastream 230810162429 9d7c02a7
No ratings yet
Uint 4miningdatastream 230810162429 9d7c02a7
11 pages
Analytics On Big Fast Data Using A Realtime Stream Data Processing Architecture
No ratings yet
Analytics On Big Fast Data Using A Realtime Stream Data Processing Architecture
34 pages
Big Data 3rd Assignment Answers
No ratings yet
Big Data 3rd Assignment Answers
8 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
3 pages
BDA-Lec10
No ratings yet
BDA-Lec10
33 pages
Module II
No ratings yet
Module II
22 pages
Reference Guide To Stream Processing
No ratings yet
Reference Guide To Stream Processing
14 pages
lec19
No ratings yet
lec19
23 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Assignment Group 3
No ratings yet
Assignment Group 3
21 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
lec19
No ratings yet
lec19
24 pages
Unit 3
No ratings yet
Unit 3
30 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
BDA_Unit_3
No ratings yet
BDA_Unit_3
18 pages
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
No ratings yet
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
48 pages
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
6- Streaming Part 1
No ratings yet
6- Streaming Part 1
44 pages
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark
No ratings yet
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark
51 pages
Spark Streaming Through Dynamic Batch Sizing
No ratings yet
Spark Streaming Through Dynamic Batch Sizing
4 pages
Big Data Concepts - Spark & Streaming
No ratings yet
Big Data Concepts - Spark & Streaming
35 pages
Large Scale Data Pipelines
No ratings yet
Large Scale Data Pipelines
91 pages
StreamProcessingAndAnalytics Handout
No ratings yet
StreamProcessingAndAnalytics Handout
7 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
Real-Time Streaming in Big Data: Kafka and Spark With Singlestore
No ratings yet
Real-Time Streaming in Big Data: Kafka and Spark With Singlestore
23 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
5. Introduction to Data Ingestion and Processing
No ratings yet
5. Introduction to Data Ingestion and Processing
28 pages
Big Data pdf
No ratings yet
Big Data pdf
10 pages
Module4 1
No ratings yet
Module4 1
68 pages
Chapter 6
No ratings yet
Chapter 6
26 pages
Kafka
No ratings yet
Kafka
50 pages
Real time data streaming new techniques
No ratings yet
Real time data streaming new techniques
5 pages
Stream Processing and Analytics Handout
No ratings yet
Stream Processing and Analytics Handout
8 pages
Learn Cassandra in 24 Hours
From Everand
Learn Cassandra in 24 Hours
Alex Nordeen
No ratings yet
UI-UX Development Track Syllabus Overview
No ratings yet
UI-UX Development Track Syllabus Overview
4 pages
Welcome To ZTE University
No ratings yet
Welcome To ZTE University
41 pages
Resize Enable
No ratings yet
Resize Enable
3 pages
HCNA-HNTD Entry Training Guide v2.1
No ratings yet
HCNA-HNTD Entry Training Guide v2.1
418 pages
Gloveset Whitepaper, Rev-0.2
No ratings yet
Gloveset Whitepaper, Rev-0.2
7 pages
CHP-D2RRX-85-AA-S Digital Return Receiver Data Sheet
No ratings yet
CHP-D2RRX-85-AA-S Digital Return Receiver Data Sheet
3 pages
BB2 30x0 UserGuide
No ratings yet
BB2 30x0 UserGuide
159 pages
Mod01 GL CD3388EN00CD OBUV3R1UDS
No ratings yet
Mod01 GL CD3388EN00CD OBUV3R1UDS
200 pages
Introduction Manual - ASANA: Draft Written By: Samuel Castaneda 9/25/17
No ratings yet
Introduction Manual - ASANA: Draft Written By: Samuel Castaneda 9/25/17
20 pages
FIT1051-2025-S1-A2
No ratings yet
FIT1051-2025-S1-A2
4 pages
Tactical Communication Products: C-5000 Control Head Base Model
No ratings yet
Tactical Communication Products: C-5000 Control Head Base Model
3 pages
Presentation - Unit No.4 - Lesson No.1-2 - Grade 10
No ratings yet
Presentation - Unit No.4 - Lesson No.1-2 - Grade 10
26 pages
Computer Specifications Notes.docx 1
No ratings yet
Computer Specifications Notes.docx 1
3 pages
Course Introduction
No ratings yet
Course Introduction
9 pages
Manuel Utilisateur Simplifié GX DX EN - r16462 - 25 - 301-8343 - Rev. - B
No ratings yet
Manuel Utilisateur Simplifié GX DX EN - r16462 - 25 - 301-8343 - Rev. - B
12 pages
Nuguse Final Research
No ratings yet
Nuguse Final Research
96 pages
Ebooks File The DIGITAL TRANSFORMATION of The AUTOMOtive INDUSTRy Catalysts Roadmap Practice 2nd Edition Uwe Winkelhake All Chapters
100% (1)
Ebooks File The DIGITAL TRANSFORMATION of The AUTOMOtive INDUSTRy Catalysts Roadmap Practice 2nd Edition Uwe Winkelhake All Chapters
69 pages
Syllabus Revision: Class 15, "Facebook, Facebook, Facebook": Key Question
No ratings yet
Syllabus Revision: Class 15, "Facebook, Facebook, Facebook": Key Question
8 pages
Network Programming With Python Cheat Sheet 1
No ratings yet
Network Programming With Python Cheat Sheet 1
1 page
Controllogix 5570/5560 Redundancy: User Manual
No ratings yet
Controllogix 5570/5560 Redundancy: User Manual
244 pages
2024-08-15-traffic-analysis-exercise-answers
No ratings yet
2024-08-15-traffic-analysis-exercise-answers
10 pages
Programming Languages With Compiler FQuiz 1 PDF
No ratings yet
Programming Languages With Compiler FQuiz 1 PDF
3 pages
Unit Journal 7 Templates
No ratings yet
Unit Journal 7 Templates
19 pages
Short Rack TSX47 4067 40 TSXRKN52 Telemecanique Schneider Automation Manual
No ratings yet
Short Rack TSX47 4067 40 TSXRKN52 Telemecanique Schneider Automation Manual
4 pages
Aplicatii Client - Server
No ratings yet
Aplicatii Client - Server
9 pages
GIS Practical1
No ratings yet
GIS Practical1
4 pages
Module 2: Switching Concepts: Instructor Materials
No ratings yet
Module 2: Switching Concepts: Instructor Materials
20 pages
611917-04-2020types of Computers
No ratings yet
611917-04-2020types of Computers
28 pages
Microservices
No ratings yet
Microservices
5 pages

Streaming Graph Processing Unit5

Uploaded by

Streaming Graph Processing Unit5

Uploaded by

Streaming graph processing involves analyzing and updating graphs as data streams arrive

You might also like