0% found this document useful (0 votes)
3 views34 pages

Chapter 1-1

The document provides an overview of data streaming techniques, emphasizing the continuous flow of data from various sources for real-time processing and analysis. It discusses the architecture, requirements, and examples of streaming data applications across industries, highlighting the importance of real-time insights for decision-making. Additionally, it outlines the pros and cons of data streaming, including benefits such as scalability and system visibility, alongside challenges like data overload and potential data loss.

Uploaded by

rajendranmani.p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views34 pages

Chapter 1-1

The document provides an overview of data streaming techniques, emphasizing the continuous flow of data from various sources for real-time processing and analysis. It discusses the architecture, requirements, and examples of streaming data applications across industries, highlighting the importance of real-time insights for decision-making. Additionally, it outlines the pros and cons of data streaming, including benefits such as scalability and system visibility, alongside challenges like data overload and potential data loss.

Uploaded by

rajendranmani.p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT –I

DATA STREAMING TECHNIQUES

TOPICS:

Overview, towards continuous data processing : the requirements, streaming


processing foundations, streaming processing. Differences of a real time and
streaming systems, the architectural blue print, security for streaming systems.
Introducing streaming data:
1.1. Overview:
Data is flowing everywhere around us, through phones, credit cards, sensor-
equipped buildings, vending machines, thermostats, trains, buses, planes, posts
to social media, digital pictures and video—and the list goes on.

What is Streaming?
The term "streaming" is used to describe continuous, never-ending data streams
with no beginning or end, that provide a constant feed of data that can be
utilized/acted upon without needing to be downloaded first.

Similarly, data streams are generated by all types of sources, in various formats
and volumes. From applications, networking devices, and server log files, to
website activity, banking transactions, and location data, they can all be
aggregated to seamlessly gather real-time information and analytics from a
single source of truth.

1.2. Towards continuous data processing:


What is Data Streaming?

In the early years of the internet, connectivity wasn't always reliable


and bandwidth limitations often prevented streaming data to arrive at its
destination in an unbroken sequence. Developers created buffers to allow data
streams to catch up but the resulting jitter caused such a poor user experience
that most consumers preferred to download content rather than stream it.
Data streaming pulls together data from various sources for analysis and the
creation of outputs, such as reports and alerts.

Image Source
Data Streaming is a technology that allows continuous transmission of data in
real-time from a source to a destination. Rather than waiting on the complete
data set to get collected, you can directly receive and process data when it is
generated. A continuous flow of data i.e. a data stream, is made of a series of
data elements ordered in time. The data in this streamdenotes an event or
change in the business that is useful to know about and analyze in real time.
Data streaming is the continuous transfer of data from one or more sources at a
steady, high speed for processing into specific outputs.

Data streaming is the process of continuously collecting data as it's generated


and moving it to a destination. This data is usually handled by stream
processing software to analyze, store, and act on this information. Data
streaming combined with stream processing produces real-time intelligence.

Data Streaming

Also known as stream processing or event streaming, data streaming is the


continuous flow of data as it's generated, enabling real-time processing and
analysis for immediate insights. With every industry reliant on real-time
data, today, data streaming platforms like Confluent power everything
from multiplayer games, real-time fraud detection, and social media feeds,
to stock trading platforms and GPS tracking.

As an example, the video that you see on YouTube is a Data Stream of the
video being played by your mobile device. As more and more devices connect
to the Internet, Streaming Data helps businesses access content immediately
rather than waiting for the whole entity to be downloaded.

With the advent of the Internet of Things(IoT), personal health monitoring and
home security systems have also seen a great demand in the market. For
instance, multiple health sensors are available that continuously provide metrics
such as heartbeat, blood pressure, or oxygen levels allowing you to have a
timely analysis of your health. Similarly, home security sensors can also detect
and report any unusual activity at your residence or even save that data for
identifying harder-to-detect patterns later.
In previous years, legacy infrastructure was much more structured because it
only had a handful of sources that generated data. The entire system could be
architected in a way to specify and unify the data and data structures. With the
advent of stream processing systems, the way we process data has changed
significantly to keep up with modern requirements.

Ex 1: How Streaming Data Works:


Let’s start with an analogy to help frame the concept before we dive into the
details. One way to think of streaming data is that it’s like when radio stations
constantly broadcast on particular frequencies. (These frequencies are like data
topics and you won’t consume them until you turn your processor on to them.)
When you tune your radio to a given frequency, your radio picks it up and
processes it to become audio you can understand. You want your radio to be
fast enough to keep up with the broadcast and if you want a copy of the music
you have to record it, because once it’s broadcast, it’s gone.
Two primary layers are needed to process streaming data when using
streaming systems like Apache Kafka, Confluent, Google Pub Sub, Amazon
Kinesis, and Azure Event Hubs:

 1. Storage: This layer should enable low cost, quick, replayable reads and
writes of large data streams by supporting strong consistency and record
ordering.
 2. Processing: This layer consumes and runs computations on data from the
storage layer. It also directs the storage layer to remove data no longer
needed.
There’s a broader cloud architecture needed to execute streaming data to its
fullest potential. Stream processing systems like Apache Kafka can consume,
store, enrich, and analyze data in motion. And, a number of cloud service
companies offer the capability for you to build an “off-the-shelf” data stream.
However, these options may not meet your requirements or you may face
challenges working with your legacy databases or systems. The good news is
that there is a robust ecosystem of tools you can leverage, some of them open
source, to build your own “bespoke” data stream.

How to build your own data stream. Here we describe how streaming data
works and describe the data streaming technologies for each of the four key
steps to building your own data stream.

1. Aggregate all your data sources: using a CDC streaming tool from
relational databases or transactional systems which may be located on-premises
or in the cloud. You will then connect these sources to a stream processor.
2. Build a stream processor: using a tool such as Apache Kafka or Amazon
Kinesis. The data will typically be processed sequentially and incrementally on
a record-by-record basis but it can also be processed over sliding time windows.

Your stream processor should:

 Be scalabile: because data volume can vary greatly over time.


 Be fast: because data can quickly lose its relevance and because data flows
continuously, so you can’t go back and get any missed data due to latency.
 Be fault tolerant: because data never stops flowing from many sources and
in a variety of formats.
 Be integrated: because the data should be immediately passed to
downstream applications for presentation or triggered actions.
3. Query or store: the streaming data. Leading tools to do this include Google
BigQuery, Snowflake, Amazon Kinesis Data Analytics, and Dataflow. These
tools can perform a broad range of analytics such as filtering, aggregating,
correlating, and sampling.
There are two approaches to do this:

 Query the data stream itself as it’s streaming: using KSQL (now ksqlDB),
a streaming SQL engine for Apache Kafka. KSQL provides an interactive
SQL interface for you to process data in real time in Kafka without writing
code. It supports stream processing operations such as joins, aggregations,
sessionization, and windowing.
 Store your streamed data: In this more traditional approach, you store the
message in a database or data warehouse and query after you’ve received and
stored it. Most companies choose to keep all their data given that the cost of
storage is low. Leading options for storing streaming data include Amazon
S3, Amazon Redshift, and Google Storage.
4. Output for analysis, alerts, real-time applications, data science, and
machine learning or AutoML: Once the streaming data has passed through the
query or store phase, it can output for multiple use cases:
 The best BI and analytics tools support data stream integration for a variety
of streaming analytics use cases such as powering interactive data
visualizations and dashboards which alert you and help you respond to
changes in KPIs and metrics. These real-time alerts are especially helpful in
detecting fraud.
 Streaming data can also trigger events in an application and/or system such
as an automated trading system to process a stock trade based on predefined
rules.
 Data scientists can apply algorithms in-stream instead of waiting for data to
reside in a database. This allows you to query and create visualizations of
real-time data.
 Machine learning and AutoML (automated machine learning) models can
benefit from incremental learning Python libraries such as Creme to stream
over a dataset in sequential order and interleave prediction and learning steps.

Ex 2: How data streaming works:


The advent of broadband internet, cloud computing and the internet of things
(IoT) have made data streaming easier. Today, businesses regularly use data
from IoT devices and other streaming sources to make data-driven
decisions and facilitate real-time analytics. Many companies have replaced
traditional batch processing with streaming data architectures that can
accommodate batch processing of high volumes of data.

In batch processing, new data elements are collected in a group and the entire
group is processed at some future time. In contrast, a streaming data architecture
or stream processor handles data in motion and an extract, load and transform
(ELT) batch is treated as an event in a continuous stream of events. Streams of
enterprise data are fed into data streaming software, which then routes the
streams into storage and processing, and produces outputs, such as reports and
analytics.

See how data stream processing and batch processing fit into a customized big
data pipeline system.

Overview of Stream Data Processing:

Today's data is generated by an infinite amount of sources - IoT sensors,


servers, security logs, applications, or internal/external systems. It’s almost
impossible to regulate structure, data integrity, or control the volume or velocity
of the data generated.

While traditional solutions are built to ingest, process, and structure data before
it can be acted upon, streaming data architecture adds the ability to consume,
persist to storage, enrich, and analyze data in motion.
1.3. Requirements:

As such, applications working with data streams will always require two main
functions: storage and processing. Storage must be able to record large streams
of data in a way that is sequential and consistent. Processing must be able to
interact with storage, consume, analyze and run computation on the data.

This also brings up additional challenges and considerations when working with
legacy databases or systems. Many platforms and tools are now available to
help companies build streaming data applications.

Data streams combine various sources and formats to create a comprehensive


view of operations. For instance, combining network, server, and application
data can monitor website health and quickly detect performance issues or
outages.

Image Source

This video reviews the concept of data streaming and also provides an
introduction to batch processing, which will be examined later in this section:

1.1. Examples of Streaming Data


The most common use cases for data streaming are streaming media, stock
trading, and real-time analytics. However, data stream processing is broadly
applied in nearly every industry today. This is due to the continuing rise of big
data, the Internet of Things (IoT), real-time applications, and customer
expectations for personalized recommendations and immediate response.
Streaming data is critical for any applications which depend on in-the-moment
information to support the following use cases:

 Streaming media
 Stock trading
 Real-time analytics
 Fraud detection
 IT monitoring
 Instant messaging
 Geolocation
 Inventory control
 Social media feeds
 Multiplayer video games
 Ride-sharing

Other examples of applying real-time data streaming include:

 Delivering a seamless, up-to-date customer experience across devices.


 Monitoring equipment and scheduling service or ordering new parts when
problems are detected.
 Optimizing online content and promotional offers based on each user’s
profile.
Examples of data streams
Data streaming use cases include the following:

 Weather data.
 Data from local or remote sensors.
 Transaction logs from financial systems.
 Data from health monitoring devices.
 Website activity logs.

Data comes in a steady, real-time stream, often with no beginning or end. Data
may be acted upon immediately, or later, depending on user requirements.
Streams are time stamped because they're often time-sensitive and lose value
over time. The streamed data is also often unique and not likely repeatable; it
originates from various sources and might have different formats and structures.

For example, various production sensors on a manufacturing production


line capture different types of data and aggregate the data. Each sensor's data is
then combined with data from the other sensors to provide a detailed view of the
production system. A manufacturing resource planning system can use data
from the various sensors to further refine how the production systems may be
used, when they are scheduled, when maintenance is needed and other
important metrics.

1.2. Examples:
Some real-life examples of streaming data include use cases in every industry,
including real-time stock trades, up-to-the-minute retail inventory management,
social media feeds, multiplayer games, and ride-sharing apps.

For example, when a passenger calls Lyft, real-time streams of data join
together to create a seamless user experience. Through this data, the application
pieces together real-time location tracking, traffic stats, pricing, and real-time
traffic data to simultaneously match the rider with the best possible driver,
calculate pricing, and estimate time to destination based on both real-time and
historical data.

Data Stream Examples

Data streams capture critical real-time data, such as location, stock prices, IT
system monitoring, fraud detection, retail inventory, sales, and customer
activity.

The following companies use some of these data types to power their business
activity.

1. Lyft:

Lyft requires real-time data to match riders with drivers accurately, displaying
current vehicle availability and prices based on distance, demand, and traffic
conditions. This data needs to be instantly available to set accurate user
expectations.

After the rider selects a service level, Lyft uses additional GPS and traffic data
to match the best driver to the rider based on vehicle availability, distance,
driver status, and expected time of arrival.

Lyft uses location data from the driver's phone to track their progress, match
them with other ride requests, and provide real-time updates on traffic
conditions. They have optimized their processors to handle and aggregate these
data streams for an enhanced customer experience.
Image Source

2. YouTube:

YouTube processes and stores a massive amount of data every hour due to the
more than 500 hours of video uploaded every minute, according to Statista.

YouTube must ensure high availability to support creators' content and provide
real-time data to viewers, including view counts, comments, subscribers, and
other metrics. YouTube supports live videos with real-time interaction between
content creators and viewers, requiring critical instant data transfer for
uninterrupted conversations.

Speaking of YouTube, the presenter in this video walks through how to create
an example data stream using PowerShell and Power BI:

Streaming data is the first step for any data-driven organization, fueling big data
ingestion, integration, and real-time analytics.

Pros and cons of data streaming:


Data streaming comes with both advantages and drawbacks. Among the
advantages are the following:

 Real-time business insights. Streamed data can be particularly useful for


businesses that rely on real-time or near-real-time information for informed
decision-making. Streaming lets businesses quickly identify trends and
patterns and react fast to market changes.
 Multiple data flows. Data streaming is beneficial in situations where a
continuous flow of data from multiple data pipelines must be processed into
useful output. By bringing together data from various applications, streamed
data can provide a variety of outputs based on user requirements.
 System visibility. Data streaming helps IT organizations identify issues
quickly before they become problems.
 Scalability. Real-time data processing lets businesses handle large, complex
data sets. This can be important for businesses that are growing rapidly and
need to scale and optimize their data processing capabilities to keep up with
demand.

The following are some of the drawbacks of data streaming:

 Data overload. With so much data being processed in real time, it can be
difficult to identify the most relevant information. This can lead to
businesses becoming overwhelmed by the data volume and unable to make
meaningful decisions.
 Cost. Data streaming can be expensive, particularly if businesses must invest
in new hardware and software to support it.
 Data loss or corruption. With traditional data processing methods,
businesses may be able to recover lost data from backups or other sources.
However, with data streaming, there's a risk that data may be lost or
corrupted in real time, making it impossible to recover.
 Overhead. Data streaming requires storage and processing elements, such as
a data warehouse or data lake, to prepare data for later use. The added
overhead associated with data streaming must be analyzed in terms of its
return on investment.
1.4. Streaming Processing Foundations:
What Is Stream Processing?

Stream processing means collecting and processing data in real-time as it is


generated or near-real time when it is required for a particular use case, which is
vital in circumstances where any delays would lead to negative results. It is one
method of managing the ever-increasing volumes of data that are being
generated nowadays.

The technology reads and processes a data stream continually from input
sources, writes the results to an output stream, and can use multiple threads to
enable parallelism. Stream processing can therefore support many applications
that require real-time data analysis and decision-making, such as generating
reports or triggering responses with minimal latency.

Some tasks that this complex event processing method is commonly used for
are loan risk analysis, anti-fraud detection, sensor data monitoring, and target
marketing.

A stream processing framework will ingest streaming data from input sources,
analyze it, and write the results to output streams. The processor will typically
have the following four main components:

 Input sources – where data is read from (examples include Kafka, Flume,
Social Media, and IoT Sensors).
 Output streams – where the processed data is written to (e.g., HDFS,
Cassandra, and HBase).
 Processing logic – defines how the data is processed (this can be done with
Java, Scala, or Python code).
 State management – allows the processor to keep track of its progress and
maintain state information, which can be further used for exactly-once
processing (i.e., when the same output is generated regardless of how many
times the input stream is read).

Stream processing engine components

The stream processing engine organizes data from the input source into short
batches and presents them as continuous data streams output to other
applications, simplifying the logic for developers who (re)combine data from
different sources and time scales – which are all relative when it comes to real-
time analysis.

The processing logic component is where most of the work is done, simplifying
the necessary tasks in data management for consistently and securely ingesting,
processing, and publishing data. This stage is where you define the
transformations that are applied to the data as it is consumed from a publish-
subscribe service before it is published back there or to other data storage.

Harness the full potential of AI for your business

Examples of processes here could be to analyze, filter, combine, transform, or


clean the data. For example, you might want to extract certain fields from the
data, perform some aggregations, or join different streams together.

State management is important in stream processing because, unlike batch


methods, data is processed as it arrives, meaning that the processing framework
needs to keep track of its progress. To provide exactly-once processing, the
framework needs to store state information somewhere – typically a key-value
store – where it can be restored from if necessary.

For example, if the stream processor crashes, it can be restarted from the last
checkpoint and will then pick up where it left off. Likewise, if the input stream
is replayed, the output stream will be generated correctly, even though the data
has already been processed once.

1.5. Stream processing:

Stream processing is a data management technique that involves ingesting a


continuous data stream to quickly analyze, filter, transform or enhance the data
in real time. Once processed, the data is passed off to an application, data store
or another stream processing engine.

Stream Processing:

Streaming the data is only half the battle. You also need to process that data to
derive insights.

Stream processing software is configured to ingest the continual data flow down
the pipeline and analyze that data for patterns and trends. Stream processing
may also include data visualization for dashboards and other interfaces so that
data personnel may also monitor these streams.
Image Source
Data streams and stream processing are combined to produce real-time or near
real-time insights. To accomplish this, stream processors need to offer low
latency so that analysis happens as quickly as data is received. A drop in
performance by the stream processor can lead to a backlog or data points being
missed, threatening data integrity.
Stream processing software needs to scale and be highly available. It should
handle spikes in traffic and have redundancies to prevent software crashes.
Crashes reduce your data quality since the stream is not analyzed for however
long the outage persists.

Benefits of Data Streaming:

Traditional data pipelines extract, transform, and load data before it can be acted
upon. But given the wide variety of sources and the scale and velocity by which
the data is generated today, traditional data pipelines are not able to keep up for
near real-time or real-time processing.
If your organization deals with big data and produces a steady flow of real-time
data, a robust streaming data process will allow you to respond to situations
faster. Ultimately, this can help you:
 Increase your customer satisfaction

 Make your company more competitive


 Reduce your infrastructure expenses
 Reduce fraud and other losses
Below are the specific features and benefits which ladder up these higher-level
outcomes.
Competitiveness and customer satisfaction: Stream processing enables
applications such as modern BI tools to automatically produce reports, alarms
and other actions in response to the data, such as when KPIs hit thresholds.
These tools can also analyze the data, often using machine learning algorithms,
and provide interactive visualizations to deliver you real-time insights. These
in-the-moment insights can help you respond faster than your competitors to
market events and customer issues.
Reduce your infrastructure expenses: Traditional data processing typically
involves storing massive volumes of data in data warehouses or data lakes. In
event stream processing, data is typically stored in lower volumes and therefore
you enjoy lower storage and hardware hardware costs. Plus, data streams allow
you to better monitor and report on your IT systems, helping you troubleshoot
servers, systems, and devices.

Reduce Fraud and Other Losses: Being able to monitor every aspect of your
business in real-time keeps you aware of issues which can quickly result in
significant losses, such as fraud, security breaches, inventory outages, and

production issues: Real-time data streaming lets you respond quickly to, and
even prevent, these issues before they escalate.

Data streaming provides real-time insight by leveraging the latest internal and
external information to inform decision-making in day-to-day operations and
overall strategy.

Let's examine a few more benefits of data streaming.

Increase ROI

Real-time intelligence gives companies a competitive edge by enabling quick


data collection, analysis, and action. It enhances responsiveness to market
trends, customer needs, and business opportunities, making it a valuable
distinguishing feature in the fast-paced digitalized business environment.

Increase Customer Satisfaction

Responding quickly to customer complaints and providing resolutions improves


a company's reputation, leading to positive word-of-mouth advertising and
online reviews that attract new prospects and convert them into customers.

Reduce Losses

Data streaming not only supports customer retention but also prevents losses by
providing real-time intelligence on potential issues such as system outages,
financial downturns, and data breaches. This allows companies to proactively
mitigate the impact of these events.

Let’s look at some salient features of Hevo:

 Fully Managed: It requires no management and maintenance as Hevo is


a fully automated platform.
 Data Transformation: It provides a simple interface to perfect, modify,
and enrich the data you want to transfer.
 Real-Time: Hevo offers real-time data migration. So, your data is always
ready for analysis.
 Schema Management: Hevo can automatically detect the schema of the
incoming data and map it to the destination schema.
 Live Monitoring: Advanced monitoring gives you a one-stop view to
watch all the activities that occur within pipelines.
 Live Support: Hevo team is available round the clock to extend
exceptional support to its customers through chat, email, and support
calls.

Simplify your Data Streaming & Data Analysis with Hevo today!

What are the Challenges of Data Streaming?

There are various challenges that have to be considered while dealing with Data
Streams:

 High Bandwidth Requirements


 Memory and Processing Requirements
 Requires Intelligent and Versatile Programs
 Scalability
 Contextual Ordering
 Continuous Upgradation and Adaptability

1) High Bandwidth Requirements

Unless the Data Stream is delivered in real-time, most of its benefits may not be
realized. With a variety of devices located at variable distances and generating
different volumes of data, network bandwidth must be sufficient to deliver this
data to its consumers.

2) Memory and Processing Requirements

Since data from the Data Stream is arriving continuously, a computer system
must have enough memory to store it and ensure that any part of the data is not
lost before it’s processed. Also, computer programs that process this data need
CPUs with more processing power as newer data may need to be interpreted in
the context of older data and it must be processed quickly before the next set of
data arrives.

Generally, each data packet received includes information about its source and
time of generation and must be processed sequentially. The processing should
be powerful enough to show upsells and suggestions in real-time, based on
users’ choices, browsing history, and current activity.

3) Requires Intelligent and Versatile Programs

Handling data coming from various sources at varying speeds, having diverse
semantic meanings and interpretations, coupled with multifarious processing
needs is not an easy task.

4) Scalability

Another challenge Streaming Data presents is scalability. Applications should


scale to arbitrary and manifold increases in memory, bandwidth, and processing
needs.

Consider the case of a tourist spot and related footfalls and ticketing data.
During peak hours and at random times during a given week, the footfalls
would increase sharply for a few hours leading to a big increase in the volume
of data being generated. When a server goes down, the log data being generated
increases manifold to include problems+cascading effects+events+symptoms,
etc.

5) Contextual Ordering

This is another issue that Streaming Data presents which is the need to keep
data packets in contextual order or logical sequences.

For example, during an online conference, it’s important that messages are
delivered in a sequence of occurrences, to keep the chat in context. If a
conversation is not in sequence, it will not make any sense.

6) Continuous Upgradation and Adaptability

As more and more processes are digitized and devices connect to the internet,
the diversity and quantum of the Data Stream keep increasing. This means that
the programs that handle it have to be updated frequently to handle different
kinds of data
Building applications that can handle & process Streaming Data in real-time
is challenging, taking into account many factors like ones stated above. Hence,
businesses can use tools like Hevo that help stream data to the desired
destination in real-time.

Data Stream Challenges to Consider

Data streaming opens a world of possibilities, but it also comes with challenges
to keep in mind as you incorporate real-time data into your applications.

1. Availability:

Data needs to be accessed and logged in a datastore for historical context. If you
can't view previous subscription periods, you may miss opportunities to offer
valuable products or services based on a customer's purchase history.

2. Timeliness:

Data streams must be constantly updated to avoid stale information and ensure
that the user's actions in one tab are reflected across all tabs.

3. Scalability:

To avoid data loss during spikes in volume or system outages, it's crucial to
build failsafes into your system and provision extra computing and storage
resources.

4. Ordering:

Recording a sequence of customer interactions in your CRM provides deeper


insights than just tracking individual web page visits. For example, you can see
when a person has downloaded related eBooks, viewed a product demo, and
visited the product page, giving you a clearer understanding of their interest in
the product.
What are the Use Cases of Data Streaming?

Here are a few use cases of data streaming:

 Information about your location.


 Detection of fraud.
 Live stock market trading.
 Analytics for business, sales, and marketing.
 Customer or user behaviour.
 Reporting on and keeping track of internal IT systems.
 Troubleshooting systems, servers, gadgets, and more via log monitoring.
 SIEM (Security Information and Event Management): Monitoring,
metrics, and threat detection using real-time event data and log analysis.
 Retail/warehouse inventory: A smooth user experience across all devices,
inventory management across all channels and locations.
 Matching for ridesharing: Matching riders with the best drivers based on
proximity, destination, pricing, and wait times by using location, user, and
pricing data for predictive analytics.
 AI and machine learning: This opens up new opportunities for predictive
analytics by fusing the past and present data into one brain.

Streaming Use Cases


There are many use cases for event streaming. Because it more closely
resembles how things work in the real world, almost any business process can
be represented better with event streaming than it could be with batch
processing. This includes predictive analytics, machine learning, generative AI,
fraud detection, and more.

You will find event streaming being used in a broad selection of businesses,
such as media streaming, omnichannel retail experiences, ride-sharing, etc.

For example, when a passenger calls Lyft, not only does the application know
which driver to match them to, but it also knows how long it will take based on
real-time location and historical traffic data. It can also determine how much it
should cost based on both real-time and past data.

Typical Use Cases:


 Location data
 Fraud detection
 Real-time stock trades
 Marketing, sales, and business analytics
 Customer/user activity
 Monitoring and reporting on internal IT systems
 Log Monitoring: Troubleshooting systems, servers, devices, and more
 SIEM (Security Information and Event Management): analyzing logs and real-
time event data for monitoring, metrics, and threat detection
 Retail/warehouse inventory: inventory management across all channels and
locations, and providing a seamless user experience across all devices
 Ride share matching: Combining location, user, and pricing data for predictive
analytics - matching riders with the best drivers in term of proximity,
destination, pricing, and wait times
 Machine learning and A.I.: By combining past and present data for one central
nervous system, this brings new possibilities for predictive analytics
 Predictive analytics

What are the use cases for streaming data?

A stream processing system is beneficial in most scenarios where new and


dynamic data is generated continually. It applies to most of the industry
segments and big data use cases.
Companies generally begin with simple applications, such as collecting system
logs and rudimentary processing like rolling min-max computations. Then,
these applications evolve to more sophisticated near real-time processing.
Here are some more examples of streaming data.

Data analysis

Applications process data streams to produce reports and perform actions in


response, such as emitting alarms when key measures exceed certain thresholds.
More sophisticated stream processing applications extract deeper insights by
applying machine learning algorithms to business and customer activity data.

IoT applications

Internet of Things (IoT) devices are another use case for streaming data.
Sensors in vehicles, industrial equipment, and farm machinery send data to a
streaming application. The application monitors performance, detects potential
defects in advance, and automatically places a spare part order, preventing
equipment downtime.

Financial analysis

Financial institutions use stream data to track real-time changes in the stock
market, compute value at risk, and automatically rebalance portfolios based on
stock price movements. Another financial use case is fraud detection of credit
card transactions using real-time inferencing against streaming transaction data.

Real-time recommendations

Real estate applications track geolocation data from consumers’ mobile devices
and make real-time recommendations of properties to visit. Similarly,
advertising, food, retail, and consumer applications can integrate real-time
recommendations to give more value to customers.

Service guarantees

You can implement data stream processing to track and maintain service levels
in applications and equipment. For example, a solar power company has to
maintain power throughput for its customers or pay penalties. It implements a
streaming data application that monitors all panels in the field and schedules
service in real time. Thus, it can minimize each panel's periods of low
throughput and the associated penalty payouts.

Media and gaming

Media publishers stream billions of clickstream records from their online


properties, aggregate and enrich the data with user demographic information,
and optimize the content placement. This helps publishers deliver a better, more
relevant experience to audiences. Similarly, online gaming companies use event
stream processing to analyze player-game interactions and offer dynamic
experiences to engage players.

Risk control

Live streaming and social platforms capture user behavior data in real time for
risk control over users' financial activity, such as recharge, refund, and rewards.
They view real-time dashboards to flexibly adjust risk strategies.

What is a real-time system?


Real-time systems and real-time computing have been around for decades, but
with the advent of the internet they have become very popular. Unfortunately,
with this popularity has come ambiguity and debate. What constitutes a real-
time system?

Real-time systems are classified as hard, soft, and near. The definitions
for hard and soft real-time are based on Hermann Kopetz’s book Real-Time
Systems (Springer, 2011). For near real-time the definition found in the
Portland Pattern Repository’s Wiki (http://c2.com/cgi/wiki?NearRealTime).
“Denoting or relating to a data-processing system that is slightly slower than
real-time.” Table 1.1 breaks out the common classifications of real-time
systems along with the prominent characteristics by which they differ.

Table 1.1. Classification of real-time systems:


ClassificationExamples Latency Tolerance for delay
measured in
Hard Pacemaker, anti-lock Microseconds– None—total system
brakes milliseconds failure, potential loss
of life
Soft Airline reservation Milliseconds– Low—no system
system, online stock seconds failure, no life at risk
quotes, VoIP (Skype)
Near Skype video, home Seconds–minutes High—no system
automation failure, no life at risk

You can identify hard real-time systems fairly easily. They are almost always
found in embedded systems and have very strict time requirements that, if
missed, may result in total system failure. The design and implementation of
hard real-time systems are well studied in the literature.

Determining whether a system is soft or near real-time, because the overlap in


their definitions often results in confusion. Here are three examples:

 Someone you are following on Twitter posts a tweet, and moments later
you see the tweet in your Twitter client.
 You are tracking flights around New York using the real-time Live Flight
Tracking service from FlightAware
(http://flightaware.com/live/airport/KJFK).
 You are using the NASDAQ Real Time Quotes application
(www.nasdaq.com/quotes/real-time.aspx) to track your favorite stocks.

Although these systems are all quite different, figure 1.1 shows what they have
in common.
Figure 1.1. A generic real-time system with consumers

In each of the examples, is it reasonable to conclude that the time delay may
only last for seconds, no life is at risk, and an occasional delay for minutes
would not cause total system failure? If someone posts a tweet, and you see it
almost immediately, is that soft or near real-time? What about watching live
flight status or real-time stock quotes? Some of these can go either way: what if
there were a delay in the data due to slow Wi-Fi at the coffee shop or on the
plane? As you consider these examples, the line differentiating soft and near
real-time becomes blurry, at times disappears, is very subjective, and may often
depend on the consumer of the data.

Now let’s change our examples by taking the consumer out of the picture and
focusing on the services at hand:

 A tweet is posted on Twitter.


 The Live Flight Tracking service from FlightAware is tracking flights.
 The NASDAQ Real Time Quotes application is tracking stock quotes.

We don’t know how these systems work internally, but the essence of what we
are asking is common to all of them. It can be stated as follows:

Is the process of receiving data all the way to the point where it is ready for
consumption a soft or near real-time process? this looks like figure 1.2.

Figure 1.2. A generic real-time system with no consumers

Does focusing on the data processing and taking the consumers of the data out
of the picture change your answer? For example, how would you classify the
following?

 A tweet posted to Twitter


 A tweet posted by someone whom you follow and your seeing it in your
Twitter client

If you classified them differently, why? Was it due to the lag or perceived lag in
seeing the tweet in your Twitter client? After a while, the line between whether
a system is soft or near real-time becomes quite blurry. Often people settle on
calling them real-time.

1.6. Differences between real-time and streaming systems:


A system may be labelled soft or near real-time based on the perceived delay
experienced by consumers. We have seen, with simple examples, how the
distinction between the types of real-time system can be hard to discern. This
can become a larger problem in systems that involve more people in the
conversation. Our goal here is to settle on a common language we can use to
describe these systems. When you look at the big picture, we are trying to use
one term to define two parts of a larger system. As illustrated in figure 1.3, the
end result is that it breaks down, making it very difficult to communicate with
others with these systems because we don’t have a clear definition.

Figure 1.3. Real-time computation and consumption split apart

On the left-hand side of figure 1.3 we have the non-hard real-time service, or
the computation part of the system, and on the right-hand side we have the
clients, called the consumption side of the system.

DEFINITION: STREAMING DATA SYSTEM

In many scenarios, the computation part of the system is operating in a non-hard


real-time fashion, but the clients may not be consuming the data in real time due
to network delays, application design, or a client application that isn’t even
running. Put another way, what we have is a non-hard real-time service with
clients that consume data when they need it. This is called a streaming data
system—a non-hard real-time system that makes its data available at the
moment a client application needs it. It’s neither soft nor near—it is streaming.
Figure 1.4 shows the result of applying this definition to our example
architecture from figure 1.3.

Figure 1.4. A first view of a streaming data system

The concept of streaming data eliminates the confusion of soft versus near and
real-time versus not real-time, allowing us to concentrate on designing systems
that deliver the information a client requests at the moment it is needed. but
from the standpoint of streaming, if you can split each one up and recognize the
streaming data service and streaming client.

 Someone you are following on Twitter posts a tweet, and moments later
you see the tweet in your Twitter client.
 You are tracking flights around New York using the real-time Live Flight
Tracking service from FlightAware.
 You are using the NASDAQ Real Time Quotes application to track your
favorite stocks.

 Twitter— A streaming system that processes tweets and allows clients to


request the latest tweets at the moment they are needed; some may be
seconds old, and others may be hours old.
 FlightAware— A streaming system that processes the most recent flight
status data and allows a client to request the latest data for particular
airports or flights.
 NASDAQ Real Time Quotes— A streaming system that processes the
price quotes of all stocks and allows clients to request the latest quote for
particular stocks.

You got to think and focus on what and how a service makes its data available
to clients at the moment they need it. The system is an in-the-moment system—
any system that delivers the data at the point in time when it is needed. We
don’t know how these systems work behind the scenes, we are going to learn to
assemble systems that use open source technologies to consume, process, and
present data streams.

the differences between stream processing and traditional batch processing.

Batch Processing vs Real-Time Streams:

Batch processing requires data to be downloaded before it is analyzed and


stored, while stream processing continuously ingests and analyzes data. Stream
processing is preferred for its speed, especially when real-time intelligence is
needed. Batch processing is used in scenarios where immediate analysis is not
necessary or when working with legacy technologies like mainframes.

With the complexity of today's modern requirements, legacy batch data


processing has become insufficient for most use cases, as it can only process
data as groups of transactions collected over time. Modern organizations need to
act on up-to-the-millisecond data, before the data becomes stale. Being able to
access data in real-time comes with numerous advantages and use cases.

two concepts and their use cases:

Data Stream Examples

Data streams capture critical real-time data, such as location, stock prices, IT
system monitoring, fraud detection, retail inventory, sales, and customer
activity.

The following companies use some of these data types to power their business
activity.

1. Lyft:

Lyft requires real-time data to match riders with drivers accurately, displaying
current vehicle availability and prices based on distance, demand, and traffic
conditions. This data needs to be instantly available to set accurate user
expectations.

After the rider selects a service level, Lyft uses additional GPS and traffic data
to match the best driver to the rider based on vehicle availability, distance,
driver status, and expected time of arrival.

Lyft uses location data from the driver's phone to track their progress, match
them with other ride requests, and provide real-time updates on traffic
conditions. They have optimized their processors to handle and aggregate these
data streams for an enhanced customer experience.

Image Source

2. YouTube:

YouTube processes and stores a massive amount of data every hour due to the
more than 500 hours of video uploaded every minute, according to Statista.

YouTube must ensure high availability to support creators' content and provide
real-time data to viewers, including view counts, comments, subscribers, and
other metrics. YouTube supports live videos with real-time interaction between
content creators and viewers, requiring critical instant data transfer for
uninterrupted conversations.

Speaking of YouTube, the presenter in this video walks through how to create
an example data stream using PowerShell and Power BI:

Streaming data is the first step for any data-driven organization, fueling big data
ingestion, integration, and real-time analytics.
Batch vs Stream Data Processing

Image Source

Batch processing is a data processing technique where a set of data is


accumulated over time and processed in chunks, typically in periodic intervals.
Batch processing is suitable for the offline processing of large volumes of data
and can be resource-intensive. The data is processed in bulk, typically on a
schedule, and the results are stored for later use.

Stream processing, on the other hand, is a technique for processing data in real
time as it arrives. Stream processing is designed to handle continuous, high-
volume data flows and is optimized for low resource usage. The data is
processed as it arrives, allowing for real-time analysis and decision-making.

Stream processing often uses in-memory storage to minimize latency and


provide fast access to data.

In summary, batch processing is best suited for the offline processing of large
volumes of data, while stream processing is designed for the real-time
processing of high-volume data flows.

Let’s look at the differences between batch and stream processing in a more
concise manner.

Batch Processing Stream Processing


Processes data in chunks accumulated over
Processes data in real-time as it arrives
time
High latency Low latency
Designed to handle high-volume data
Can handle large volumes of data
flows
Resource-intensive Optimized for low resource usage
Suitable for offline processing Suitable for real-time data analysis
It may require significant storage resources Often uses in-memory storage
Typically processes data in periodic intervals Continuously processes data as it arrives

Practically, mainframe-generated data is typically processed in batch form.


Integrating this data into modern analytics systems can be time-consuming,
making it difficult to transform it into streaming data. However, stream
processing can be valuable for tasks such as fraud detection, as it can quickly
identify anomalies in transaction data in real time, allowing fraudulent
transactions to be stopped before they are completed.

What are the Benefits of Data Streaming?

Here are some benefits of data streaming:

 Stream Processing: Stream processing is one of the key benefits of data


streaming, as it allows for the real-time processing and analysis of data as
it is generated. Stream processing systems can handle high volumes of
data, and are able to process data quickly and with low latency, making
them well-suited for big data applications.
 High Returns: By processing data in real-time, organizations are able to
make timely and informed decisions, which can lead to increased
efficiency, improved customer experiences, and even cost savings. For
example, in the financial industry, data streaming can be used to detect
fraudulent transactions in real-time, which can prevent losses and protect
customer information. In retail, data streaming can be used to track
inventory in real-time, which can help businesses to optimize their supply
chain and reduce costs.
 Lesser Infrastructure Cost: In traditional data processing, large
amounts of data are typically collected and stored in data warehouses,
which can be costly in terms of storage and hardware expenses. However,
with stream processing, data is processed in real-time as it is generated,
which eliminates the need to store large volumes of data. This can greatly
reduce the cost of storage and hardware, as organizations don’t need to
maintain large data warehouses.

Simplify Data Streaming with Hevo’s No-code Data Pipeline


Hevo is a No-code Data Pipeline that offers a fully managed solution for your
fully automated pipeline to set up data integration from 150+ data
sources including 40+ Free Sources and will let you directly load data to your
data warehouse. It will automate your data flow in minutes without writing any
line of code. Its fault-tolerant architecture makes sure that your data is secure
and consistent. Hevo provides you with a truly efficient and fully-automated
solution to manage data in real-time and always have analysis-ready data.

1.7. The architectural blueprint:


With an understanding of real-time and streaming systems we can now turn our
attention to the architectural blueprint. Throughout our journey we are going to
follow an architectural blueprint that will enable us to talk about all streaming
systems in a generic way—our pattern language. Figure 1.5 depicts the
architecture.

Figure 1.5. The streaming data architectural blueprint

Although our architecture calls out the different tiers, remember these tiers are
not hard and rigid, as you may have seen in other architectures. We will call
them tiers, but we will use them more like LEGO pieces, allowing us to design
the correct solution for the problem at hand. Our tiers don’t prescribe a
deployment scenario. they are in many cases distributed across different
physical locations.

Let’s take our examples how Twitter’s service maps to our architecture:

 Collection tier— When a user posts a tweet, it is collected by the Twitter


services.
 Message queuing tier— Undoubtedly, Twitter runs data centers in
locations across the globe, and conceivably the collection of a tweet
doesn’t happen in the same location as the analysis of the tweet.
 Analysis tier— Although I’m sure a lot of processing is done to those 140
characters, suffice it to say, at a minimum for our examples, Twitter
needs to identify the followers of a tweet.
 Long-term storage tier— Even though we’re not going to discuss this
optional tier in depth in this book, you can probably guess that tweets
going back in time imply that they’re stored in a persistent data store.
 In-memory data store tier— The tweets that are mere seconds old are
most likely held in an in-memory data store.
 Data access— All Twitter clients need to be connected to Twitter to
access the service.

the exercise of decomposing the other two examples and see how they fit our
streaming architecture:

 FlightAware— A streaming system that processes the most recent flight


status data and allows a client to request the latest data for particular
airports or flights.
 NASDAQ Real Time Quotes— A streaming system that processes the
price quotes of all stocks and allows clients to request the latest quote for
particular stocks.

1.8. Security for streaming systems:


Security is important in many cases, but it can be overlaid on this architecture
naturally. Figure 1.6 shows how security can be applied to this architecture.

Figure 1.6. The architectural blueprint with security identified

How do we scale?

From a high level, there are two common ways of scaling a service: vertically
and horizontally.
Vertical scaling lets you increase the capacity of your existing hardware
(physical or virtual) or software by adding resources. A restaurant is a good
example of the limitations of vertical scaling. When you enter a restaurant, you
may see a sign that tells you the maximum occupancy. As more patrons come
in, more tables may be set up and more chairs added to accommodate the crowd
—this is scaling vertically. But when the maximum capacity is reached, you
can’t seat any more customers. In the end, the capacity is limited by the size of
the restaurant. In the computing world, adding more memory, CPUs, or hard
drives to your server are examples of vertical scaling. But as with the restaurant,
you’re limited by the maximum capacity of the system, physical or virtual.

Horizontal scaling approaches the problem from a different angle. Instead of


continuing to add resources to a server, you add servers. A highway is a good
example of horizontal scaling. Imagine a two-lane highway that was originally
constructed to handle 2,000 vehicles an hour. Over time more homes and
commercial buildings are built along the highway, resulting in a load of 8,000
vehicles per hour. As you might imagine (and perhaps have experienced), the
results are terrible traffic jams during rush hour and overall unpleasant
commutes. To alleviate these issues, more lanes are added to the highway—now
it is horizontally scaled and can handle the traffic. But it would be even more
efficient if it could expand (add lanes) and contract (remove lanes) based on
traffic demands. At an airport security checkpoint, when there are few travelers
TSA closes down screening lines, and when the volume increases they open
lines up. If you’re hosting your service with one of the major cloud providers
(Google, AWS, Microsoft Azure), you may be able to take advantage of this
elasticity—a feature they often call auto-scaling. The basic idea is that as
demand for your service increases, servers are automatically added, and as
demand decreases, servers are removed.

In modern-day system design, our goal is to have horizontal scaling—but that


doesn’t mean that we won’t use vertical scaling too. Vertical scaling is often
employed to determine an ideal resource configuration for a service, and then
the service is scaled out. when the topic of scaling comes up, the focus will be
on horizontal, not vertical scaling.

Figure 1.7. Architectural blueprint with emphasis on the first tier


We’re going to take on the tiers one at a time, starting from the left with the
collection tier. Don’t let the lack of emphasis on the message queuing tier
in figure 1.7 bother you—in certain cases where it serves a collection role, I’ll
talk about it and clear up any confusion. Now, on to our first tier, the collection
tier—our entry point for bringing data into our streaming, in-the-moment
system.

You might also like