Streaming Graph Processing Unit5
Streaming Graph Processing Unit5
in real-time.
This approach is crucial for applications like social network analysis, fraud detection, and
recommendation systems, where graphs constantly evolve. Unlike traditional batch
processing, which handles data in chunks, streaming graph processing handles data as it
streams in, allowing for immediate analysis and action.
Example
Let’s consider an application that processes a stream of system log events in order to power
a dashboard giving immediate insight into what’s going on with the system. These are the
steps it will take:
1. Read the raw data from log files in a watched directory.
2. Parse and filter it, transforming it into event objects.
3. Enrich the events: Using a machine ID field in the event object, look up the data we
have on that machine and attach it to the event
4. Save the enriched, denormalized data to persistent storage such as Cassandra or
HDFS for deeper offline analysis (first data sink).
5. Perform windowed aggregation using the attached data calculate real-time statistics
over the events that happened within the time window of the last minute.
6. Push the results to an in-memory K-V store that backs the dashboard (second data
sink).
We can represent each of the above steps with a box in a diagram and connect them using
arrows representing the data flow:
The boxes and arrows form a graph, specifically a Directed Acyclic Graph (a DAG). This
model is at the core of modern stream processing engines.
Transformations
Transformations are used to express the business logic of a streaming application. On the
low level, a processing task receives some stream items, performs arbitrary processing,
and emits some items. It may emit items even without receiving anything (acting as a
stream source) or it may just receive and not emit anything (acting as a sink).
Due to the nature of distributed computation, you can’t just provide arbitrary imperative
code that processes the data — you must describe it declaratively.
This is why streaming applications share some principles with functional and dataflow
programming. This requires some time to get used to when coming from the imperative
programming.
These are the main types of stateless transformation:
Map transforms one record to one record, e.g. Change format of the record, enrich
record with some data.
Filter filters out the records that doesn’t satisfy the predicate.
FlatMap is the most general type of stateless transformation, outputting zero or
more records for each input record, e.g. tokenize a record containing a sentence into
individual words.
However, many types of computation involve more than one record. In this case, the
processor must maintain internal state across the records. When counting the records in the
stream, for example, you have to maintain the current count.
Stateful transformations:
Aggregation: Combines all the records to produce a single value,
e.g. min, max, sum, count, avg.
Group-and-aggregate: Extracts a grouping key from the record and computes a
separate aggregated value for each key.
Join: Joins same-keyed records from several streams.
Sort: Sorts the records observed in the stream.
Sources and Sinks
A stream processing application accesses data sources and sinks via its connectors. They are
a computation job’s point of contact with the outside world.
Although the connectors do their best to unify the various kinds of resources under the
same “data stream” paradigm, there are still many concerns that need your attention as
they limit what you can do within your stream processing application.
Is It Unbounded?
The first decision when building a computation job is to decide whether it will deal with
bounded (finite) or unbounded (infinite) data.
Bounded data is handled in batch jobs and there are fewer concerns to deal with, as data
boundaries are within the dataset itself.
You don’t have to worry about windowing, late events, or event time skew. Examples of
bounded, finite resources are plain files, Hadoop Distributed File System (HDFS), or
iterating through database query results.
Unbounded data streams allow for continuous processing; however, you have to use
windowing for operations that cannot effectively work on infinite input (such as sum, avg,
or sort).
In the unbounded category, the most popular choice is Kafka. Some databases can be
turned into unbounded data sources by exposing the journal, the stream of all data changes,
as an API for third parties.
Is It Replayable?
The data source is replayable if you can easily replay the data. This is generally necessary for
fault tolerance: if anything goes wrong during the computation, the data can be replayed
and the computation can be restarted from the beginning.
Bounded data sources are mostly replayable (e.g. plain file, HDFS file). The replayability of
infinite data sources is limited by the disk space necessary to store the whole data stream —
an Apache Kafka source is replayable with this limitation. On the other hand, some sources
can be read only once (e.g. TCP socket source, JMS Queue source).
Is It Distributed?
A distributed computation engine prefers to work with distributed data resources to
maximize performance. If the resource is not distributed, all SPE cluster members will have
to contend for access to a single data source endpoint. Kafka, HDFS, and Hazelcast IMap are
all distributed. On the other hand, a file is not; it is stored on a single machine.
Data Locality?
If you’re looking to achieve record breaking throughput for your application, you’ll have to
think carefully about how close you can deliver your data to the location where the stream
processing application will consume and process it. For example, if your source is HDFS, you
should align the topologies of the Hadoop and SPE clusters so that each machine that hosts
an HDFS member also hosts a node of the SPE cluster.
Kafka
A data pipeline using AWS and Kafka involves using Kafka, a distributed streaming platform,
to handle the flow of data, and AWS services for processing and storage. Kafka is often
used as a message broker, enabling real-time streaming data pipelines and
applications. AWS provides services like MSK(managed streaming Apache Kafka) Connect
and Data Pipeline to build and manage these pipelines.
1. Kafka as the Backbone:
Kafka acts as a central hub for ingesting, storing, and distributing streaming data.
It facilitates real-time data processing by allowing producers to publish data to
topics and consumers to subscribe and process it.
Kafka can be used for various use cases, including tracking user activity, processing
IoT events, and building data pipelines.
2. AWS Services for Data Processing and Storage:
AWS Data Pipeline:
This service helps sequence, schedule, and manage data processing tasks, including ETL
activities.
AWS Glue:
Glue provides a platform for data cataloguing, data transformation, and ETL jobs.
AWS MSK Connect:
This feature simplifies building data pipelines by allowing you to define connectors that
move data between Kafka and other AWS services like S3, Redshift, Elasticsearch, and
RDS.
Amazon S3:
A scalable object storage service that can be used for storing processed data from Kafka.
Amazon Redshift:
A data warehouse service that can be used for analyzing large datasets from Kafka.
Amazon EMR:
A managed Hadoop and Spark cluster environment that can be used for processing data
from Kafka.
Amazon Kinesis Data Streams:
A fully managed streaming data service that can be used as an alternative to Kafka.
3. Example Data Pipeline Workflow:
1. Data Producers: Applications or systems generate data and publish it to Kafka
topics.
2. Kafka as a Hub: Kafka receives and stores the data in its topics.
3. Consumers: Applications or AWS services (e.g., Spark, Glue, Lambda) consume data
from Kafka topics.
4. Processing and Transformation: Consumers perform data transformations,
aggregations, and other processing tasks.
5. Data Storage: Processed data is then stored in services like S3, Redshift, or other
destinations.
4. Benefits of using Kafka and AWS for data pipelines:
Scalability: Kafka and AWS services offer scalability to handle large volumes of data.
Reliability: Kafka provides fault tolerance and data durability.
Real-time Processing: Kafka allows for real-time data processing and analysis.
Flexibility: AWS offers various services that can be combined with Kafka to create
custom data pipelines.
5. Setting up a Data Pipeline:
Choose Kafka (MSK): You can use Amazon Managed Streaming for Apache Kafka
(MSK) or a self-managed Kafka cluster.
Create Kafka Topics: Define topics to store different types of data streams.
Connect to AWS Services: Configure connectors using MSK Connect to move data
between Kafka and other AWS services.
Set up Data Processing: Define data processing tasks using AWS Glue or other
services.
Configure Data Storage: Choose the appropriate storage solution (e.g., S3, Redshift)
for your processed data.