0% found this document useful (0 votes)
7 views42 pages

BDA Unit 3

The document discusses big data streaming platforms and real-time processing, emphasizing the importance of event-driven architectures for handling continuous data streams. It outlines various technologies such as Apache Spark, Storm, Samza, and Kafka, highlighting their capabilities in processing and analyzing data in real-time. Additionally, it covers the architecture of big data pipelines, which includes message ingestion, stream processing, and analytical data storage for generating insights and reports.

Uploaded by

dnyangitte01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views42 pages

BDA Unit 3

The document discusses big data streaming platforms and real-time processing, emphasizing the importance of event-driven architectures for handling continuous data streams. It outlines various technologies such as Apache Spark, Storm, Samza, and Kafka, highlighting their capabilities in processing and analyzing data in real-time. Additionally, it covers the architecture of big data pipelines, which includes message ingestion, stream processing, and analytical data storage for generating insights and reports.

Uploaded by

dnyangitte01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Unit 3

Big Data Streaming


Platforms
Real- Time Processing for Big Data
• Big data concepts and techniques are used for architecting and
designing real-time systems and building analytics applications
over them.
• Many recent modeling and implementation approaches for
developing real-time applications are based on the concept of an
event.
• The term event refers to each data point in the system, and
stream refers to the ongoing delivery of those events.
• A series of events can also be referred to as streaming data or
data streams.
Real- Time Processing for Big Data
• Actions that are taken on those events include:
 aggregations (e.g., calculations such as sum, mean, standard
deviation),
 analytics (e.g., predicting a future event based on patterns in the
data),
 transformations (e.g., changing a number into a date format),
 enrichment (e.g., combining the data point with other data
sources to create more context and meaning),
 ingestion (e.g., inserting the data into a database).
Data Stream Processing

• Many real-time applications involve a continuous flow of


information that is transferred from one location to another
over an specific period of time.
• This type of interaction is called a stream and it implies the
transfer of a sequence of related information.
• For example, video and audio data.
• It is also used frequently to describe a sequence of data
associated with real-world events, e.g. as emitted by
sensors, devices, or other applications
Data Stream Processing
• Processing events is called as Event stream processing or Data
stream processing.
• Stream processing is a technology that allows users to query a
continuous data stream and quickly detect conditions within a
small time period from the time of receiving the data.
• Event stream processing works by handling a data set by one data
point at a time, instead of viewing data as a whole set.
• Data stream applications focuses on filtering and aggregation of a
stream data using SQL queries.
• For Example, user can query a data stream coming from a
temperature sensor and receive an alert when the temperature
reaches maximum threshold value.
Need of Data Stream Processing
• Big data establishes the core values needed by the each
application, by processing the data.
• These values are never equal.
• Some core values have much higher range shortly after it has
happened and that value diminishes very fast with time.
• Stream processing targets such scenarios.
• The key strength of stream processing is that it can provide
updated core values of data items faster, within milliseconds
to seconds.
Need of Data Stream Processing
• Data stream processing can handles never ending data streams
gracefully and naturally.
• User can detect patterns, inspect results, and can focus multiple
streams simultaneously.
• Stream processing naturally fits with time series data and
detecting patterns over time.
• Batch lets the data build up and try to process them at once while
stream processing processes data as they come in, hence spread
the processing over time.
• Stream processing requires less hardware than batch processing.
• When the amount of data is huge and it is not even possible to
store it, stream processing can handle this easily.
Data Stream Processing Platforms
• Many of the data streaming platforms are open source solutions.
• These platforms facilitate the construction of real-time
applications, mainly message-oriented or event-driven
applications.
• These applications supports, access of messages or events at a
very high rate, transfer to subsequent processing, and generation
of alerts.
• These platforms focuses on supporting event-driven data flow
through nodes in a distributed system or within a cloud
infrastructure platform.
• They provide a basis for building an analytics layer at the top of
big data stack.
Data Stream Processing Platforms

• The Hadoop ecosystem supports distributed computing and


large data processing infrastructure.
• It is mainly developed to support processing large sets of
structured, unstructured, and semi-structured data, but it
was designed as a batch processing system.
• The main drawback of Hadoop ecosystem is, it doesn't
support fast data analytics performance requirements.
• In order to support real time processing, it can be linked
with some other components.
Data Stream Processing Platforms

• In an event stream processing environment, there are two


main classes of technologies:
• 1) The system that stores the events:- consist of component
helps to data storage, and stores data based on a timestamp.
• 2) The technology that helps developers write applications
that take action on the events.
• Also known as stream processors or stream processing
engines.
Data Stream Processing Platforms
• The list of components needed for data streaming application in
Hadoop are as follows:-
• MapReduce, a distributed data processing model and execution
environment that runs on large clusters of commodity machines
• Hadoop Distributed File System (HDFS), a distributed file system
that runs on large clusters of commodity machines.
• ZooKeeper, a distributed, highly available coordination service,
providing primitives that can be used for building distributed
applications.
• Pig, a dataflow language and execution environment for exploring
very large datasets.
• Pigs runson HDFS and MapReduce clusters.
• Hive, a distributed data warehouse.
Spark
• Apache Spark is more recent framework that combines an
engine for distributing programs across clusters of machines.
• It provides Read-Evaluate-Print Loop (REPL) approach for
working with data interactively.
• Spark supports integration with the variety of tools in the
Hadoop ecosystem.
• It can read and write data in all of the data formats
supported by MapReduce.
• It can read from and write to NoSQL databases like HBase
and Cassandra.
Spark
• It uses a stream processing library called as Spark Streaming,
which is an extension of the Spark core framework.
• It is well suited for real-time processing and analysis, supporting
scalable, high throughput, and fault tolerant processing of live
data streams.
• Spark Streaming generates a Discretized stream (DStream) as a
continuous stream of data.
• Internally, a DStream is represented as a sequence of resilient
distributed datasets (RDD).
• RDDs are distributed collections that can be operated in parallel
by arbitrary functions and by transformations over data (sliding
window computations)
Spark
• DStreams can be emitted either straight from input data
streams sources, such as Apache Kafka or Flume HDFS,
databases or by passing the RDDs from other DStreams
output.
• Spark Streaming receives live input data streams through a
receiver and divides data into micro batches.
• These batches are then processed by the Spark engine to
generate the final stream of results in batches.
• The processing components used are referred to as window
transformation operators.
Spark
• Spark Streaming uses a small-interval deterministic batch (in
seconds) to divide stream into processable units.
• The size of the interval determines the throughput and latency, if
the interval is larger, the throughput and the latency is higher.
• Data can be taken from sources like Kafka, Kinesis, or TCP sockets,
and processed using complex algorithms expressed through high-
level functions like map, reduce, join, and window.
• Processed data can also be pushed out to filesystems, databases,
and live dashboards.
• Spark Streaming supports both batch and streaming applications
Spark
Storm
• Storm is a distributed real-time computation system used for
processing large volumes of data with high-velocity.
• It makes easy to process unbounded streams of data and has a
relatively simple processing model.
• It is designed to process large amount of data in a fault-tolerant
and horizontal scalable method.
• Storm is easy to setup, operate and it guarantees that every
message will be processed through the topology at least once.
• Storm is stateless and manages distributed environment and
cluster state via ZooKeeper.
• Storm reads raw stream of real-time data from one end and
passes it through a sequence of small processing units
Storm
• The internal components of Apache Storm are:-
• 1. Spout:- It is a source of stream. Generally, Storm accepts
input data from raw data sources like Twitter Streaming API,
Apache Kafka queue, Kestrel queue, etc.
• User can also write their own spouts to read data from data
sources.
• The core interface for implementing spouts is ISpout.
• Some of the specific interfaces are IRichSpout,
BaseRichSpout, KafkaSpout, etc.
Storm
• 2. Bolts:- these are logical processing units.
• Spouts pass data to bolts and bolts process and produce a
new output stream.
• It can processes any number of input streams and produces
any number of new output streams.
• Bolts can perform the operations of filtering, aggregation,
joining, interacting with data sources and databases.
• The core interface for implementing bolts is IBolt .
• Some of the common interfaces are IRichBolt, IBasicBolt,
etc.
Storm
• 3. Topology:-Spouts and bolts are connected together and
they form a topology.
• It is a DAG, where vertices are computation and edges are
stream of data.
• A simple topology starts with spouts.
• Spout emits the data to one or more bolts.
• Bolt represents a node in the topology and the output of a
bolt can be emitted into another bolt as input.
• Storm’s main job is to run the topology and will run any
number of topology at a given time.
Storm
Samza
• Apache Samza is a distributed stream processing framework.
• Its current stable version: 0.10.0
• It is very similar to Storm in that it is a stream processor with
a one-at-a time processing model and at-least-once
processing semantics.
• It uses Apache Hadoop YARN to provide fault tolerance,
processor isolation, security, and resource management.
• It was initially created at LinkedIn, submitted to the Apache
Incubator in July 2013 and was granted toplevel status in
2015.
Samza
• Samza was co-developed with the queueing system Kafka
and has th same messaging semantics i.e. Streams are
partitioned and messages (i.e. data items) inside the same
partition are ordered.
• By default, Samza employs a key-value format for storage of
data.
• Samza supports Java languages, particularly JVM.
• Scalability is achieved through running a Samza job in
several parallel tasks each of which consumes a separate
partition in file.
Samza
Samza
• Samza processes messages in order and stores processing
results durable after each step, it is able to prevent data loss
by periodically check pointing current progress and
reprocessing all data from that point onwards in case of
failure.
• Samza does not support a weaker guarantee than at-least-
once processing
FLUME
• Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data.
• It has a simple and flexible architecture based on streaming data
flows
• It is robust and fault tolerant with tunable reliability mechanisms
and many failover and recovery mechanisms.
• It uses a simple extensible data model that allows for online
analytic application.
• While Flume and Kafka both can act as the event backbone for
real-time event processing, they have different characteristics.
FLUME
• Kafka is well suited for high throughput publish-subscribe
messaging applications that require scalability and availability.
• Flume is better suited in cases when one needs to support data
ingestion and simple event processing, but is not suitable for CEP
applications.
• One of the benefits of Flume is that it supports many sources and
sinks out of the box.
Amazon Kinesis
• Amazon Kinesis is a cloud-based service for real-time data
processing over large, distributed data streams.
• Amazon Kinesis can continuously capture and store
terabytes of data per hour from hundreds of thousands of
sources such as website clickstreams, financial transactions,
social media feeds, IT logs, and location-tracking events etc.
Amazon Kinesis
• Kinesis allows integration with Storm, as it provides a Kinesis
Storm Spout that fetches data from a Kinesis stream and
emits it as tuples.
• The inclusion of this Kinesis component into a Storm
topology provides a reliable and scalable stream capture,
storage, and replay service.
Big Data Pipelines for Real-Time
Processing
• Real time processing deals with streams of data that are
captured in real-time and processed with minimal latency to
generate real-time reports or automated responses.
• It is defined as the processing of unbounded stream of input
data.
• This incoming data can be in unstructured or semi-structured
format.
• It has the same processing requirements, but with shorter
turnaround times to support real-time consumption.
Big Data Pipelines for Real-Time
Processing
• Processed data is often written to an analytical data store,
which is optimized for analytics and visualization.
• The processed data can also be ingested directly into the
analytics and reporting layer for analysis, business
intelligence, and real-time dashboard visualization.
Big Data Pipelines for Real-Time
Processing
Big Data Pipelines for Real-Time
Processing
• A real-time processing architecture has the following logical
components.
• Real-time message ingestion:- The architecture must provide
and store real-time messages to be consumed by a stream
processing consumer.
• This service could be implemented as a simple data store in
which new messages can be stored in a folder.
• It also requires a message broker, such as Azure Event Hubs,
that acts as a buffer for the messages.
• The message broker should support scale-out processing and
reliable delivery.
Big Data Pipelines for Real-Time
Processing
• Stream processing:- After capturing real-time messages, they
are processed by filtering, aggregating, and preparing the
data for analysis.
• Analytical data store:- Many big data solutions are designed
to prepare data for analysis and then serve the processed
data in a structured format that can be queried using
analytical tools.
• Analysis and reporting:- The goal of most big data solutions
is to provide insights into the data through analysis and
reporting.
Streaming Ecosystem
• The different tools used in the real time steaming have their
strengths and weaknesses, user need to select the correct
tool and use it properly in the real time data processing
pipeline.
• Streaming framework is used to provide an optimized
approach for real time streaming processing.
• In a typical streaming application, the basic job is to gather
the data from a large geographical set of users.
• To gather the data properly, a good distributed channel sort
of mechanism must be required.
• The gathered data, must be put into the channel and
provided for processing to set of servers on the commodity
hardware platform.
Kafka
• Apache Kafka is developed by the Apache Software Foundation
written in Scala and Java.
• It is an open source, stream processing platform which aim to
provide a unified, high throughput, low-latency platform for
handling real-time data feeds.
• Kafka connects with external systems to import or export data
through Kafka Connect and it also makes available Kafka Streams,
called Java stream processing library.
• It is one of the preferred solution for integrating real-time data
from multiple stream-producing sources and making that data
available to multiple stream-consuming systems concurrently –
such as, HDFS or HBase.
• A Kafka Hadoop data pipeline supports real-time big data
analytics.
Kafka
• Kafka is actually a store which stores messages which come from
processes (one or many) called producers.
• The data or messages are then partitioned in different partitions
within various Topics.
• In the Topic’s partition, the messages are indexed and stored
together with a timestamp.
• On the other end, other processes called Consumers can inquire
messages from these partitions.
• Kafka is working between these producers and consumers, and
runs on a cluster of one or more servers.
• The partitions can be distributed across cluster nodes.
• Apache Kafka efficiently processes the real-time, streaming data
when implemented along with Apache Storm, Apache HBase and
Apache Spark.
Kafka
Kafka
• Kafka is deployed as a cluster on multiple servers, so Kafka
handles its entire publish and subscribe messaging system with
the help of four APIs
• i.e., Producer API, Consumer API, Streams API and Connector API.
• Due to its ability to deliver massive streams of message in a fault-
tolerant manner, it is used as an alternative to some of the
conventional messaging systems like JMS, AMQP, etc.
• The major terms of Kafka's architecture are topics, records, and
brokers.
• Topics consist of stream of records holding different information.
• On the other hand, Brokers are responsible for replicating the
messages.
Kafka
• Producer API:
• It permits the applications to publish streams of records.
• Consumer API:
• It permits the application to subscribe to the topics and
processes the stream of records.
• Streams API:
• This API converts the input streams to output and produces
the result.
• Connector API:
• It executes the reusable producer and consumer APIs that
can link the topics to the existing applications.
Kafka
• Kafka Real Time Applications:
• Messaging:- It is used as a substitute message broker which
is used for a variety of applications
• Website Activity Tracking:-able to rebuild a user activity
tracking pipeline as a set of real-time publish-subscribe
feeds.
• Metrics:-used for operational monitoring of data.
• Log Aggregation:-Combines physical log files for servers and
places them in a central processing location.
• Stream Processing
Real Time Analytics Stack

You might also like