Chapter 1-1
Chapter 1-1
TOPICS:
What is Streaming?
The term "streaming" is used to describe continuous, never-ending data streams
with no beginning or end, that provide a constant feed of data that can be
utilized/acted upon without needing to be downloaded first.
Similarly, data streams are generated by all types of sources, in various formats
and volumes. From applications, networking devices, and server log files, to
website activity, banking transactions, and location data, they can all be
aggregated to seamlessly gather real-time information and analytics from a
single source of truth.
Image Source
Data Streaming is a technology that allows continuous transmission of data in
real-time from a source to a destination. Rather than waiting on the complete
data set to get collected, you can directly receive and process data when it is
generated. A continuous flow of data i.e. a data stream, is made of a series of
data elements ordered in time. The data in this streamdenotes an event or
change in the business that is useful to know about and analyze in real time.
Data streaming is the continuous transfer of data from one or more sources at a
steady, high speed for processing into specific outputs.
Data Streaming
As an example, the video that you see on YouTube is a Data Stream of the
video being played by your mobile device. As more and more devices connect
to the Internet, Streaming Data helps businesses access content immediately
rather than waiting for the whole entity to be downloaded.
With the advent of the Internet of Things(IoT), personal health monitoring and
home security systems have also seen a great demand in the market. For
instance, multiple health sensors are available that continuously provide metrics
such as heartbeat, blood pressure, or oxygen levels allowing you to have a
timely analysis of your health. Similarly, home security sensors can also detect
and report any unusual activity at your residence or even save that data for
identifying harder-to-detect patterns later.
In previous years, legacy infrastructure was much more structured because it
only had a handful of sources that generated data. The entire system could be
architected in a way to specify and unify the data and data structures. With the
advent of stream processing systems, the way we process data has changed
significantly to keep up with modern requirements.
1. Storage: This layer should enable low cost, quick, replayable reads and
writes of large data streams by supporting strong consistency and record
ordering.
2. Processing: This layer consumes and runs computations on data from the
storage layer. It also directs the storage layer to remove data no longer
needed.
There’s a broader cloud architecture needed to execute streaming data to its
fullest potential. Stream processing systems like Apache Kafka can consume,
store, enrich, and analyze data in motion. And, a number of cloud service
companies offer the capability for you to build an “off-the-shelf” data stream.
However, these options may not meet your requirements or you may face
challenges working with your legacy databases or systems. The good news is
that there is a robust ecosystem of tools you can leverage, some of them open
source, to build your own “bespoke” data stream.
How to build your own data stream. Here we describe how streaming data
works and describe the data streaming technologies for each of the four key
steps to building your own data stream.
1. Aggregate all your data sources: using a CDC streaming tool from
relational databases or transactional systems which may be located on-premises
or in the cloud. You will then connect these sources to a stream processor.
2. Build a stream processor: using a tool such as Apache Kafka or Amazon
Kinesis. The data will typically be processed sequentially and incrementally on
a record-by-record basis but it can also be processed over sliding time windows.
Query the data stream itself as it’s streaming: using KSQL (now ksqlDB),
a streaming SQL engine for Apache Kafka. KSQL provides an interactive
SQL interface for you to process data in real time in Kafka without writing
code. It supports stream processing operations such as joins, aggregations,
sessionization, and windowing.
Store your streamed data: In this more traditional approach, you store the
message in a database or data warehouse and query after you’ve received and
stored it. Most companies choose to keep all their data given that the cost of
storage is low. Leading options for storing streaming data include Amazon
S3, Amazon Redshift, and Google Storage.
4. Output for analysis, alerts, real-time applications, data science, and
machine learning or AutoML: Once the streaming data has passed through the
query or store phase, it can output for multiple use cases:
The best BI and analytics tools support data stream integration for a variety
of streaming analytics use cases such as powering interactive data
visualizations and dashboards which alert you and help you respond to
changes in KPIs and metrics. These real-time alerts are especially helpful in
detecting fraud.
Streaming data can also trigger events in an application and/or system such
as an automated trading system to process a stock trade based on predefined
rules.
Data scientists can apply algorithms in-stream instead of waiting for data to
reside in a database. This allows you to query and create visualizations of
real-time data.
Machine learning and AutoML (automated machine learning) models can
benefit from incremental learning Python libraries such as Creme to stream
over a dataset in sequential order and interleave prediction and learning steps.
In batch processing, new data elements are collected in a group and the entire
group is processed at some future time. In contrast, a streaming data architecture
or stream processor handles data in motion and an extract, load and transform
(ELT) batch is treated as an event in a continuous stream of events. Streams of
enterprise data are fed into data streaming software, which then routes the
streams into storage and processing, and produces outputs, such as reports and
analytics.
See how data stream processing and batch processing fit into a customized big
data pipeline system.
While traditional solutions are built to ingest, process, and structure data before
it can be acted upon, streaming data architecture adds the ability to consume,
persist to storage, enrich, and analyze data in motion.
1.3. Requirements:
As such, applications working with data streams will always require two main
functions: storage and processing. Storage must be able to record large streams
of data in a way that is sequential and consistent. Processing must be able to
interact with storage, consume, analyze and run computation on the data.
This also brings up additional challenges and considerations when working with
legacy databases or systems. Many platforms and tools are now available to
help companies build streaming data applications.
Image Source
This video reviews the concept of data streaming and also provides an
introduction to batch processing, which will be examined later in this section:
Streaming media
Stock trading
Real-time analytics
Fraud detection
IT monitoring
Instant messaging
Geolocation
Inventory control
Social media feeds
Multiplayer video games
Ride-sharing
Weather data.
Data from local or remote sensors.
Transaction logs from financial systems.
Data from health monitoring devices.
Website activity logs.
Data comes in a steady, real-time stream, often with no beginning or end. Data
may be acted upon immediately, or later, depending on user requirements.
Streams are time stamped because they're often time-sensitive and lose value
over time. The streamed data is also often unique and not likely repeatable; it
originates from various sources and might have different formats and structures.
1.2. Examples:
Some real-life examples of streaming data include use cases in every industry,
including real-time stock trades, up-to-the-minute retail inventory management,
social media feeds, multiplayer games, and ride-sharing apps.
For example, when a passenger calls Lyft, real-time streams of data join
together to create a seamless user experience. Through this data, the application
pieces together real-time location tracking, traffic stats, pricing, and real-time
traffic data to simultaneously match the rider with the best possible driver,
calculate pricing, and estimate time to destination based on both real-time and
historical data.
Data streams capture critical real-time data, such as location, stock prices, IT
system monitoring, fraud detection, retail inventory, sales, and customer
activity.
The following companies use some of these data types to power their business
activity.
1. Lyft:
Lyft requires real-time data to match riders with drivers accurately, displaying
current vehicle availability and prices based on distance, demand, and traffic
conditions. This data needs to be instantly available to set accurate user
expectations.
After the rider selects a service level, Lyft uses additional GPS and traffic data
to match the best driver to the rider based on vehicle availability, distance,
driver status, and expected time of arrival.
Lyft uses location data from the driver's phone to track their progress, match
them with other ride requests, and provide real-time updates on traffic
conditions. They have optimized their processors to handle and aggregate these
data streams for an enhanced customer experience.
Image Source
2. YouTube:
YouTube processes and stores a massive amount of data every hour due to the
more than 500 hours of video uploaded every minute, according to Statista.
YouTube must ensure high availability to support creators' content and provide
real-time data to viewers, including view counts, comments, subscribers, and
other metrics. YouTube supports live videos with real-time interaction between
content creators and viewers, requiring critical instant data transfer for
uninterrupted conversations.
Speaking of YouTube, the presenter in this video walks through how to create
an example data stream using PowerShell and Power BI:
Streaming data is the first step for any data-driven organization, fueling big data
ingestion, integration, and real-time analytics.
Data overload. With so much data being processed in real time, it can be
difficult to identify the most relevant information. This can lead to
businesses becoming overwhelmed by the data volume and unable to make
meaningful decisions.
Cost. Data streaming can be expensive, particularly if businesses must invest
in new hardware and software to support it.
Data loss or corruption. With traditional data processing methods,
businesses may be able to recover lost data from backups or other sources.
However, with data streaming, there's a risk that data may be lost or
corrupted in real time, making it impossible to recover.
Overhead. Data streaming requires storage and processing elements, such as
a data warehouse or data lake, to prepare data for later use. The added
overhead associated with data streaming must be analyzed in terms of its
return on investment.
1.4. Streaming Processing Foundations:
What Is Stream Processing?
The technology reads and processes a data stream continually from input
sources, writes the results to an output stream, and can use multiple threads to
enable parallelism. Stream processing can therefore support many applications
that require real-time data analysis and decision-making, such as generating
reports or triggering responses with minimal latency.
Some tasks that this complex event processing method is commonly used for
are loan risk analysis, anti-fraud detection, sensor data monitoring, and target
marketing.
A stream processing framework will ingest streaming data from input sources,
analyze it, and write the results to output streams. The processor will typically
have the following four main components:
Input sources – where data is read from (examples include Kafka, Flume,
Social Media, and IoT Sensors).
Output streams – where the processed data is written to (e.g., HDFS,
Cassandra, and HBase).
Processing logic – defines how the data is processed (this can be done with
Java, Scala, or Python code).
State management – allows the processor to keep track of its progress and
maintain state information, which can be further used for exactly-once
processing (i.e., when the same output is generated regardless of how many
times the input stream is read).
The stream processing engine organizes data from the input source into short
batches and presents them as continuous data streams output to other
applications, simplifying the logic for developers who (re)combine data from
different sources and time scales – which are all relative when it comes to real-
time analysis.
The processing logic component is where most of the work is done, simplifying
the necessary tasks in data management for consistently and securely ingesting,
processing, and publishing data. This stage is where you define the
transformations that are applied to the data as it is consumed from a publish-
subscribe service before it is published back there or to other data storage.
For example, if the stream processor crashes, it can be restarted from the last
checkpoint and will then pick up where it left off. Likewise, if the input stream
is replayed, the output stream will be generated correctly, even though the data
has already been processed once.
Stream Processing:
Streaming the data is only half the battle. You also need to process that data to
derive insights.
Stream processing software is configured to ingest the continual data flow down
the pipeline and analyze that data for patterns and trends. Stream processing
may also include data visualization for dashboards and other interfaces so that
data personnel may also monitor these streams.
Image Source
Data streams and stream processing are combined to produce real-time or near
real-time insights. To accomplish this, stream processors need to offer low
latency so that analysis happens as quickly as data is received. A drop in
performance by the stream processor can lead to a backlog or data points being
missed, threatening data integrity.
Stream processing software needs to scale and be highly available. It should
handle spikes in traffic and have redundancies to prevent software crashes.
Crashes reduce your data quality since the stream is not analyzed for however
long the outage persists.
Traditional data pipelines extract, transform, and load data before it can be acted
upon. But given the wide variety of sources and the scale and velocity by which
the data is generated today, traditional data pipelines are not able to keep up for
near real-time or real-time processing.
If your organization deals with big data and produces a steady flow of real-time
data, a robust streaming data process will allow you to respond to situations
faster. Ultimately, this can help you:
Increase your customer satisfaction
Reduce Fraud and Other Losses: Being able to monitor every aspect of your
business in real-time keeps you aware of issues which can quickly result in
significant losses, such as fraud, security breaches, inventory outages, and
production issues: Real-time data streaming lets you respond quickly to, and
even prevent, these issues before they escalate.
Data streaming provides real-time insight by leveraging the latest internal and
external information to inform decision-making in day-to-day operations and
overall strategy.
Increase ROI
Reduce Losses
Data streaming not only supports customer retention but also prevents losses by
providing real-time intelligence on potential issues such as system outages,
financial downturns, and data breaches. This allows companies to proactively
mitigate the impact of these events.
Simplify your Data Streaming & Data Analysis with Hevo today!
There are various challenges that have to be considered while dealing with Data
Streams:
Unless the Data Stream is delivered in real-time, most of its benefits may not be
realized. With a variety of devices located at variable distances and generating
different volumes of data, network bandwidth must be sufficient to deliver this
data to its consumers.
Since data from the Data Stream is arriving continuously, a computer system
must have enough memory to store it and ensure that any part of the data is not
lost before it’s processed. Also, computer programs that process this data need
CPUs with more processing power as newer data may need to be interpreted in
the context of older data and it must be processed quickly before the next set of
data arrives.
Generally, each data packet received includes information about its source and
time of generation and must be processed sequentially. The processing should
be powerful enough to show upsells and suggestions in real-time, based on
users’ choices, browsing history, and current activity.
Handling data coming from various sources at varying speeds, having diverse
semantic meanings and interpretations, coupled with multifarious processing
needs is not an easy task.
4) Scalability
Consider the case of a tourist spot and related footfalls and ticketing data.
During peak hours and at random times during a given week, the footfalls
would increase sharply for a few hours leading to a big increase in the volume
of data being generated. When a server goes down, the log data being generated
increases manifold to include problems+cascading effects+events+symptoms,
etc.
5) Contextual Ordering
This is another issue that Streaming Data presents which is the need to keep
data packets in contextual order or logical sequences.
For example, during an online conference, it’s important that messages are
delivered in a sequence of occurrences, to keep the chat in context. If a
conversation is not in sequence, it will not make any sense.
As more and more processes are digitized and devices connect to the internet,
the diversity and quantum of the Data Stream keep increasing. This means that
the programs that handle it have to be updated frequently to handle different
kinds of data
Building applications that can handle & process Streaming Data in real-time
is challenging, taking into account many factors like ones stated above. Hence,
businesses can use tools like Hevo that help stream data to the desired
destination in real-time.
Data streaming opens a world of possibilities, but it also comes with challenges
to keep in mind as you incorporate real-time data into your applications.
1. Availability:
Data needs to be accessed and logged in a datastore for historical context. If you
can't view previous subscription periods, you may miss opportunities to offer
valuable products or services based on a customer's purchase history.
2. Timeliness:
Data streams must be constantly updated to avoid stale information and ensure
that the user's actions in one tab are reflected across all tabs.
3. Scalability:
To avoid data loss during spikes in volume or system outages, it's crucial to
build failsafes into your system and provision extra computing and storage
resources.
4. Ordering:
You will find event streaming being used in a broad selection of businesses,
such as media streaming, omnichannel retail experiences, ride-sharing, etc.
For example, when a passenger calls Lyft, not only does the application know
which driver to match them to, but it also knows how long it will take based on
real-time location and historical traffic data. It can also determine how much it
should cost based on both real-time and past data.
Data analysis
IoT applications
Internet of Things (IoT) devices are another use case for streaming data.
Sensors in vehicles, industrial equipment, and farm machinery send data to a
streaming application. The application monitors performance, detects potential
defects in advance, and automatically places a spare part order, preventing
equipment downtime.
Financial analysis
Financial institutions use stream data to track real-time changes in the stock
market, compute value at risk, and automatically rebalance portfolios based on
stock price movements. Another financial use case is fraud detection of credit
card transactions using real-time inferencing against streaming transaction data.
Real-time recommendations
Real estate applications track geolocation data from consumers’ mobile devices
and make real-time recommendations of properties to visit. Similarly,
advertising, food, retail, and consumer applications can integrate real-time
recommendations to give more value to customers.
Service guarantees
You can implement data stream processing to track and maintain service levels
in applications and equipment. For example, a solar power company has to
maintain power throughput for its customers or pay penalties. It implements a
streaming data application that monitors all panels in the field and schedules
service in real time. Thus, it can minimize each panel's periods of low
throughput and the associated penalty payouts.
Risk control
Live streaming and social platforms capture user behavior data in real time for
risk control over users' financial activity, such as recharge, refund, and rewards.
They view real-time dashboards to flexibly adjust risk strategies.
Real-time systems are classified as hard, soft, and near. The definitions
for hard and soft real-time are based on Hermann Kopetz’s book Real-Time
Systems (Springer, 2011). For near real-time the definition found in the
Portland Pattern Repository’s Wiki (http://c2.com/cgi/wiki?NearRealTime).
“Denoting or relating to a data-processing system that is slightly slower than
real-time.” Table 1.1 breaks out the common classifications of real-time
systems along with the prominent characteristics by which they differ.
You can identify hard real-time systems fairly easily. They are almost always
found in embedded systems and have very strict time requirements that, if
missed, may result in total system failure. The design and implementation of
hard real-time systems are well studied in the literature.
Someone you are following on Twitter posts a tweet, and moments later
you see the tweet in your Twitter client.
You are tracking flights around New York using the real-time Live Flight
Tracking service from FlightAware
(http://flightaware.com/live/airport/KJFK).
You are using the NASDAQ Real Time Quotes application
(www.nasdaq.com/quotes/real-time.aspx) to track your favorite stocks.
Although these systems are all quite different, figure 1.1 shows what they have
in common.
Figure 1.1. A generic real-time system with consumers
In each of the examples, is it reasonable to conclude that the time delay may
only last for seconds, no life is at risk, and an occasional delay for minutes
would not cause total system failure? If someone posts a tweet, and you see it
almost immediately, is that soft or near real-time? What about watching live
flight status or real-time stock quotes? Some of these can go either way: what if
there were a delay in the data due to slow Wi-Fi at the coffee shop or on the
plane? As you consider these examples, the line differentiating soft and near
real-time becomes blurry, at times disappears, is very subjective, and may often
depend on the consumer of the data.
Now let’s change our examples by taking the consumer out of the picture and
focusing on the services at hand:
We don’t know how these systems work internally, but the essence of what we
are asking is common to all of them. It can be stated as follows:
Is the process of receiving data all the way to the point where it is ready for
consumption a soft or near real-time process? this looks like figure 1.2.
Does focusing on the data processing and taking the consumers of the data out
of the picture change your answer? For example, how would you classify the
following?
If you classified them differently, why? Was it due to the lag or perceived lag in
seeing the tweet in your Twitter client? After a while, the line between whether
a system is soft or near real-time becomes quite blurry. Often people settle on
calling them real-time.
On the left-hand side of figure 1.3 we have the non-hard real-time service, or
the computation part of the system, and on the right-hand side we have the
clients, called the consumption side of the system.
The concept of streaming data eliminates the confusion of soft versus near and
real-time versus not real-time, allowing us to concentrate on designing systems
that deliver the information a client requests at the moment it is needed. but
from the standpoint of streaming, if you can split each one up and recognize the
streaming data service and streaming client.
Someone you are following on Twitter posts a tweet, and moments later
you see the tweet in your Twitter client.
You are tracking flights around New York using the real-time Live Flight
Tracking service from FlightAware.
You are using the NASDAQ Real Time Quotes application to track your
favorite stocks.
You got to think and focus on what and how a service makes its data available
to clients at the moment they need it. The system is an in-the-moment system—
any system that delivers the data at the point in time when it is needed. We
don’t know how these systems work behind the scenes, we are going to learn to
assemble systems that use open source technologies to consume, process, and
present data streams.
Data streams capture critical real-time data, such as location, stock prices, IT
system monitoring, fraud detection, retail inventory, sales, and customer
activity.
The following companies use some of these data types to power their business
activity.
1. Lyft:
Lyft requires real-time data to match riders with drivers accurately, displaying
current vehicle availability and prices based on distance, demand, and traffic
conditions. This data needs to be instantly available to set accurate user
expectations.
After the rider selects a service level, Lyft uses additional GPS and traffic data
to match the best driver to the rider based on vehicle availability, distance,
driver status, and expected time of arrival.
Lyft uses location data from the driver's phone to track their progress, match
them with other ride requests, and provide real-time updates on traffic
conditions. They have optimized their processors to handle and aggregate these
data streams for an enhanced customer experience.
Image Source
2. YouTube:
YouTube processes and stores a massive amount of data every hour due to the
more than 500 hours of video uploaded every minute, according to Statista.
YouTube must ensure high availability to support creators' content and provide
real-time data to viewers, including view counts, comments, subscribers, and
other metrics. YouTube supports live videos with real-time interaction between
content creators and viewers, requiring critical instant data transfer for
uninterrupted conversations.
Speaking of YouTube, the presenter in this video walks through how to create
an example data stream using PowerShell and Power BI:
Streaming data is the first step for any data-driven organization, fueling big data
ingestion, integration, and real-time analytics.
Batch vs Stream Data Processing
Image Source
Stream processing, on the other hand, is a technique for processing data in real
time as it arrives. Stream processing is designed to handle continuous, high-
volume data flows and is optimized for low resource usage. The data is
processed as it arrives, allowing for real-time analysis and decision-making.
In summary, batch processing is best suited for the offline processing of large
volumes of data, while stream processing is designed for the real-time
processing of high-volume data flows.
Let’s look at the differences between batch and stream processing in a more
concise manner.
Although our architecture calls out the different tiers, remember these tiers are
not hard and rigid, as you may have seen in other architectures. We will call
them tiers, but we will use them more like LEGO pieces, allowing us to design
the correct solution for the problem at hand. Our tiers don’t prescribe a
deployment scenario. they are in many cases distributed across different
physical locations.
Let’s take our examples how Twitter’s service maps to our architecture:
the exercise of decomposing the other two examples and see how they fit our
streaming architecture:
How do we scale?
From a high level, there are two common ways of scaling a service: vertically
and horizontally.
Vertical scaling lets you increase the capacity of your existing hardware
(physical or virtual) or software by adding resources. A restaurant is a good
example of the limitations of vertical scaling. When you enter a restaurant, you
may see a sign that tells you the maximum occupancy. As more patrons come
in, more tables may be set up and more chairs added to accommodate the crowd
—this is scaling vertically. But when the maximum capacity is reached, you
can’t seat any more customers. In the end, the capacity is limited by the size of
the restaurant. In the computing world, adding more memory, CPUs, or hard
drives to your server are examples of vertical scaling. But as with the restaurant,
you’re limited by the maximum capacity of the system, physical or virtual.