Whitepaper Streaming Data Solutions On Aws With Amazon Kinesis
Whitepaper Streaming Data Solutions On Aws With Amazon Kinesis
AWS
First Published September 13, 2017
© 2021 Amazon Web Services, Inc. or its affiliates. All rights reserved.
Contents
Introduction ..........................................................................................................................1
Real-time and near-real-time application scenarios........................................................1
Difference between batch and stream processing ..........................................................2
Stream processing challenges ............................................................................................2
Streaming data solutions: examples ...................................................................................2
Scenario 1: Internet offering based on location...............................................................3
Processing streams of data with AWS Lambda ..............................................................5
Summary ..........................................................................................................................6
Scenario 2: Near-real-time data for security teams ............................................................6
Amazon Kinesis Data Firehose .......................................................................................7
Summary ........................................................................................................................12
Scenario 3: Preparing clickstream data for data insights processes ...............................13
AWS Glue and AWS Glue streaming ............................................................................14
Amazon DynamoDB.......................................................................................................15
Amazon SageMaker and Amazon SageMaker service endpoints ...............................16
Inferring data insights in real time ..................................................................................16
Summary ........................................................................................................................17
Scenario 4: Device sensors real-time anomaly detection and notifications ....................17
Amazon Kinesis Data Analytics .....................................................................................19
Summary ........................................................................................................................21
Scenario 5: Real time telemetry data monitoring with Apache Kafka ..............................22
Amazon Managed Streaming for Apache Kafka (Amazon MSK) .................................23
Amazon EMR with Spark Streaming .............................................................................25
Summary ........................................................................................................................27
Conclusion .........................................................................................................................28
Contributors .......................................................................................................................28
Document versions ............................................................................................................28
Abstract
Data engineers, data analysts, and big data developers are looking to process and
analyze their data in real-time, so their companies can learn about what their
customers, applications, and products are doing right now, and react promptly. This
whitepaper describes how services such as Amazon Kinesis Data Streams, Amazon
Kinesis Data Firehose, Amazon EMR, Amazon Kinesis Data Analytics, Amazon
Managed Streaming for Apache Kafka (Amazon MSK), and other services can be used
to implement real-time applications, and provides common design patterns using these
services.
Amazon Web Services Streaming Data Solutions on AWS
Introduction
Businesses today receive data at massive scale and speed due to the explosive growth
of data sources that continuously generate streams of data. Whether it is log data from
application servers, clickstream data from websites and mobile apps, or telemetry data
from Internet of Things (IoT) devices, it all contains information that can help you learn
about what your customers, applications, and products are doing right now.
Having the ability to process and analyze this data in real-time is essential to do things
such as continuously monitor your applications to ensure high service uptime and
personalize promotional offers and product recommendations. Real-time and near-real-
time processing can also make other common use cases, such as website analytics and
machine learning, more accurate and actionable by making data available to these
applications in seconds or minutes instead of hours or days.
Common near-real-time use cases include analytics on data stores for data science and
machine learning (ML). You can use streaming data solutions to continuously load real-
time data into your data lakes. You can then update ML models more frequently as new
data becomes available, ensuring accuracy and reliability of the outputs. For example,
Zillow uses Kinesis Data Streams to collect public record data and multiple listing
service (MLS) listings, and then provide home buyers and sellers with the most up-to-
date home value estimates in near-real-time. ZipRecruiter uses Amazon MSK for their
event logging pipelines, which are critical infrastructure components that collect, store,
and continually process over six billion events per day from the ZipRecruiter
employment marketplace.
1
Amazon Web Services Streaming Data Solutions on AWS
• You have to build a system that can cost-effectively collect, prepare, and transmit
data coming simultaneously from thousands of data sources.
• You need to fine-tune the storage and compute resources so that data is batched
and transmitted efficiently for maximum throughput and low latency.
• You have to deploy and manage a fleet of servers to scale the system so you
can handle the varying speeds of data you are going to throw at it.
Version upgrade is a complex and costly process. After you have built this platform, you
have to monitor the system and recover from any server or network failures by catching
up on data processing from the appropriate point in the stream, without creating
duplicate data. You also need a dedicated team for infrastructure management. All of
this takes valuable time and money and, at the end of the day, most companies just
never get there and must settle for the status quo and operate their business with
information that is hours or days old.
2
Amazon Web Services Streaming Data Solutions on AWS
discusses in detail how AWS real-time data streaming services are used to solve the
problem.
When implementing a solution with Kinesis Data Streams, you create custom data-
processing applications known as Kinesis Data Streams applications. A typical Kinesis
Data Streams application reads data from a Kinesis stream as data records.
Data put into Kinesis Data Streams is ensured to be highly available and elastic, and is
available in milliseconds. You can continuously add various types of data such as
clickstreams, application logs, and social media to a Kinesis stream from hundreds of
thousands of sources. Within seconds, the data will be available for your Kinesis
Applications to read and process from the stream.
3
Amazon Web Services Streaming Data Solutions on AWS
Amazon Kinesis Data Streams is a fully managed streaming data service. It manages
the infrastructure, storage, networking, and configuration needed to stream your data at
the level of your data throughput.
• You can write code utilizing one of the AWS SDKs that are supported by multiple
popular languages.
• You can use the Amazon Kinesis Agent, a tool for sending data to Kinesis Data
Streams.
The Amazon Kinesis Producer Library (KPL) simplifies the producer application
development by enabling developers to achieve high write throughput to one or more
Kinesis data streams.
The KPL is an easy to use, highly configurable library that you install on your hosts. It
acts as an intermediary between your producer application code and the Kinesis
Streams API actions. For more information about the KPL and its ability to produce
events synchronously and asynchronously with code examples, see Writing to your
Kinesis Data Stream Using the KPL
There are two different operations in the Kinesis Streams API that add data to a stream:
PutRecords and PutRecord. The PutRecords operation sends multiple records to
your stream per HTTP request while, PutRecord submits one record per HTTP
request. To achieve higher throughput for most applications, use PutRecords.
For more information about these APIs, see Adding Data to a Stream. The details for
each API operation can be found in the Amazon Kinesis Streams API Reference.
Consumer applications for Kinesis Streams can be developed using the KCL, which
helps you consume and process data from Kinesis Streams. The KCL takes care of
4
Amazon Web Services Streaming Data Solutions on AWS
many of the complex tasks associated with distributed computing such as load
balancing across multiple instances, responding to instance failures, checkpointing
processed records, and reacting to resharding. The KCL enables you to focus on the
writing record-processing logic. For more information on how to build your own KCL
application, see Using the Kinesis Client Library.
You can subscribe Lambda functions to automatically read batches of records off your
Kinesis stream and process them if records are detected on the stream. AWS Lambda
periodically polls the stream (once per second) for new records and when it detects new
records, it invokes the Lambda function passing the new records as parameters. The
Lambda function is only run when new records are detected. You can map a Lambda
function to a shared-throughput consumer (standard iterator)
You can build a consumer that uses a feature called enhanced fan-out when you
require dedicated throughput that you do not want to contend with other consumers that
are receiving data from the stream. This feature enables consumers to receive records
from a stream with throughput of up to two MB of data per second per shard.
For most cases, using Kinesis Data Analytics, KCL, AWS Glue, or AWS Lambda should
be used to process data from a stream. However, if you prefer, you can create a
consumer application from scratch using the Kinesis Data Streams API. The Kinesis
Data Streams API provides the GetShardIterator and GetRecords methods to
retrieve data from a stream.
In this pull model, your code extracts data directly from the shards of the stream. For
more information about writing your own consumer application using the API, see
Developing Custom Consumers with Shared Throughput Using the AWS SDK for Java.
Details about the API can be found in the Amazon Kinesis Streams API Reference.
AWS Lambda integrates natively with Amazon Kinesis Data Streams. The polling,
checkpointing, and error handling complexities are abstracted when you use this native
5
Amazon Web Services Streaming Data Solutions on AWS
integration. This allows the Lambda function code to focus on business logic
processing.
By default, AWS Lambda invokes your function as soon as records are available in the
stream. To buffer the records for batch scenarios, you can implement a batch window
for up to five minutes at the event source. If your function returns an error, Lambda
retries the batch until processing succeeds or the data expires.
Summary
Company InternetProvider leveraged Amazon Kinesis Data Streams to stream user
details and location. The stream of record was consumed by AWS Lambda to enrich the
data with bandwidth options stored in the function’s library. After the enrichment, AWS
Lambda published the bandwidth options back to the application. Amazon Kinesis Data
Streams and AWS Lambda handled provisioning and management of servers, enabling
Company InternetProvider to focus more on business application development.
In an upcoming event, due to the high volume of attendees, ABC2Badge has been
requested by the event security team to gather data for the most concentrated areas of
the campus every 15 minutes. This will give the security team enough time to react and
disperse security personal proportionally to concentrated areas. Given this new
requirement from the security team and the inexperience of building a streaming
6
Amazon Web Services Streaming Data Solutions on AWS
Their current data warehouse solution is Amazon Redshift. While reviewing the features
of the Amazon Kinesis services, they recognized that Amazon Kinesis Data Firehose
can receive a stream of data records, batch the records based on buffer size and/or
time interval, and insert them into Amazon Redshift. They created a Kinesis Data
Firehose delivery stream and configured it so it would copy data to their Amazon
Redshift tables every five minutes. As part of this new solution, they used the Amazon
Kinesis Agent on their servers. Every five minutes, Kinesis Data Firehose loads data
into Amazon Redshift, where the business intelligence (BI) team is enabled to perform
its analysis and send the data to the security team every 15 minutes.
7
Amazon Web Services Streaming Data Solutions on AWS
encrypt the data before loading, minimizing the amount of storage used at the
destination and increasing security. It can also transform the source data using AWS
Lambda and deliver the transformed data to destinations. You configure your data
producers to send data to Kinesis Data Firehose, which automatically delivers the data
to the destination that you specify.
The agent can be installed on Linux or Windows-based servers such as web servers,
log servers, and database servers. Once the agent is installed, you simply specify the
log files it will monitor and the delivery stream it will send to. The agent will durably and
reliably send new data to the delivery stream.
Kinesis Data Firehose also runs with Kinesis Data Streams, CloudWatch Logs,
CloudWatch Events, Amazon Simple Notification Service (Amazon SNS), Amazon API
8
Amazon Web Services Streaming Data Solutions on AWS
Gateway, and AWS IoT. You can scalably and reliably send your streams of data, logs,
events, and IoT data directly into a Kinesis Data Firehose destination.
Kinesis Data Firehose has built-in data format conversion capability. With this, you can
easily convert your streams of JSON data into Apache Parquet or Apache ORC file
formats.
Data delivery
As a near-real-time delivery stream, Kinesis Data Firehose buffers incoming data. After
your delivery stream’s buffering thresholds have been reached, your data is delivered to
the destination you’ve configured. There are some differences in how Kinesis Data
Firehose delivers data to each destination, which this paper reviews in the following
sections.
9
Amazon Web Services Streaming Data Solutions on AWS
Amazon S3
Amazon S3 is object storage with a simple web service interface to store and retrieve
any amount of data from anywhere on the web. It’s designed to deliver 99.999999999%
durability, and scale past trillions of objects worldwide.
Data delivery to your S3 bucket might fail for various reasons. For example, the bucket
might not exist anymore, or the AWS Identity and Access Management (IAM) role that
Kinesis Data Firehose assumes might not have access to the bucket. Under these
conditions, Kinesis Data Firehose keeps retrying for up to 24 hours until the delivery
succeeds. The maximum data storage time of Kinesis Data Firehose is 24 hours. If data
delivery fails for more than 24 hours, your data is lost.
Amazon Redshift
Amazon Redshift is a fast, fully managed data warehouse that makes it simple and
cost-effective to analyze all your data using standard SQL and your existing BI tools. It
allows you to run complex analytic queries against petabytes of structured data using
sophisticated query optimization, columnar storage on high-performance local disks,
and massively parallel query running.
10
Amazon Web Services Streaming Data Solutions on AWS
For the Amazon ES destination, you can specify a retry duration (0–7200 seconds)
when creating a delivery stream. Kinesis Data Firehose retries for the specified time
duration, and then skips that particular index request. The skipped documents are
delivered to your S3 bucket in the elasticsearch_failed/ folder, which you can use
for manual backfill.
Amazon Kinesis Data Firehose can rotate your Amazon ES index based on a time
duration. Depending on the rotation option you choose (NoRotation, OneHour,
11
Amazon Web Services Streaming Data Solutions on AWS
For data delivery frequency, each service provider has a recommended buffer size.
Work with your service provider for more information on their recommended buffer size.
For data delivery failure handling, Kinesis Data Firehose establishes a connection with
the HTTP endpoint first by waiting for a response from the destination. Kinesis Data
Firehose continues to establish connection, until the retry duration expires. After that,
Kinesis Data Firehose considers it a data delivery failure and backs up the data to your
S3 bucket.
Summary
Kinesis Data Firehose can persistently deliver your streaming data to a supported
destination. It’s a fully-managed solution, requiring little or no development. For
Company ABC2Badge, using Kinesis Data Firehose was a natural choice. They were
already using Amazon Redshift as their data warehouse solution. Because their data
sources continuously wrote to transaction logs, they were able to leverage the Amazon
Kinesis Agent to stream that data without writing any additional code. Now that
company ABC2Badge has created a stream of sensor records and are receiving these
records via Kinesis Data Firehose, they can use this as the basis for the security team
use case.
12
Amazon Web Services Streaming Data Solutions on AWS
Fast Sneakers does not want to introduce additional overhead into the project with new
infrastructure to maintain. They want to be able to split the development to the
appropriate parties, where the data engineers can focus on data transformation and
their data scientists can work on their ML functionality independently.
To react quickly and automatically adjust prices according to demand, Fast Sneakers
streams significant events (like click-interest and purchasing data), transforming and
augmenting the event data and feeding it to a ML model. Their ML model is able to
determine if a price adjustment is required. This allows Fast Sneakers to automatically
modify their pricing to maximize profit on their products.
This architecture diagram shows the real-time streaming solution Fast Sneakers created
utilizing Kinesis Data Streams, AWS Glue, and DynamoDB Streams. By taking
advantage of these services, they have a solution that is elastic and reliable without
13
Amazon Web Services Streaming Data Solutions on AWS
needing to spend time on setting up and maintaining the supporting infrastructure. They
can spend their time on what brings value to their company by focusing on a streaming
extract, transform, load (ETL) job and their machine learning model.
To better understand the architecture and technologies that are used in their workload,
the following are some details of the services used.
Utilizing AWS Glue, you can create a consumer application with an AWS Glue
streaming ETL job. This enables you to utilize Apache Spark and other Spark-based
modules writing to consume and process your event data. The next section of this
document goes into more depth about this scenario.
To work with Amazon Kinesis Data Streams in AWS Glue streaming ETL jobs, it is best
practice to define your stream in a table in an AWS Glue Data Catalog database. You
define a stream-sourced table with the Kinesis stream, one of the many formats
supported (CSV, JSON, ORC, Parquet, Avro or a customer format with Grok). You can
manually enter a schema, or you can leave this step to your AWS Glue job to determine
during runtime of the job.
14
Amazon Web Services Streaming Data Solutions on AWS
DynamicFrames are distributed tables that support nested data such as structures and
arrays. Each record is self-describing, designed for schema flexibility with semi-
structured data. A record in a DynamicFrame contains both data and the schema
describing the data. Both Apache Spark DataFrames and DynamicFrames are
supported in your ETL scripts, and you can convert them back and forth.
DynamicFrames provide a set of advanced transformations for data cleaning and ETL.
By using Spark Streaming in your AWS Glue Job, you can create streaming ETL jobs
that run continuously, and consume data from streaming sources like Amazon Kinesis
Data Streams, Apache Kafka, and Amazon MSK. The jobs can clean, merge, and
transform the data, then load the results into stores including Amazon S3, Amazon
DynamoDB, or JDBC data stores.
AWS Glue processes and writes out data in 100-second windows, by default. This
allows data to be processed efficiently, and permits aggregations to be performed on
data arriving later than expected. You can configure the window size by adjusting it to
accommodate the speed in response vs the accuracy of your aggregation. AWS Glue
streaming jobs use checkpoints to track the data that has been read from the Kinesis
Data Stream. For a walkthrough on creating a streaming ETL job in AWS Glue you can
refer to Adding Streaming ETL Jobs in AWS Glue
Amazon DynamoDB
Amazon DynamoDB is a key-value and document database that delivers single-digit
millisecond performance at any scale. It's a fully managed, multi-Region, multi-active,
durable database with built-in security, backup and restore, and in-memory caching for
internet-scale applications. DynamoDB can handle more than ten trillion requests per
day, and can support peaks of more than 20 million requests per second.
15
Amazon Web Services Streaming Data Solutions on AWS
AWS Lambda so that you can create triggers—pieces of code that automatically
respond to events in DynamoDB streams. With triggers, you can build applications that
react to data modifications in DynamoDB tables.
When a stream is enabled on a table, you can associate the stream Amazon Resource
Name (ARN) with a Lambda function that you write. Immediately after an item in the
table is modified, a new record appears in the table's stream. AWS Lambda polls the
stream and invokes your Lambda function synchronously when it detects new stream
records.
By utilizing the AWS SDK, you can invoke a SageMaker endpoint passing content type
information along with content and then receive real-time predictions based on the data
passed. This enables you to keep the design and development of your ML models
separated from your code that performs actions on the inferred results.
This enables your data scientists to focus on ML, and the developers who are using the
ML model to focus on how they use it in their code. For more information on how to
invoke an endpoint in SageMaker, see InvokeEnpoint in the Amazon SageMaker API
Reference.
16
Amazon Web Services Streaming Data Solutions on AWS
By utilizing Apache Spark, Spark streaming, and DynamicFrames in their AWS Glue
streaming ETL job, Fast Sneakers is able to extract data from either data stream and
transform it, merging data from the product and order tables. With the hydrated data
from the transformation, the datasets to get inference results from are submitted to a
DynamoDB table.
The DynamoDB Stream for the table triggers a Lambda function for each new record
written. The Lambda function submits the previously transformed records to a
SageMaker Endpoint with the AWS SDK to infer what, if any, price adjustments are
necessary for a product. If the ML model identifies an adjustment to the price is
required, the Lambda function writes the price change to the product in the catalog
DynamoDB table.
Summary
Amazon Kinesis Data Streams makes it easy to collect, process, and analyze real-time,
streaming data so you can get timely insights and react quickly to new information.
Combined with the AWS Glue serverless data integration service, you can create real-
time event streaming applications that prepare and combine data for ML.
Because both Kinesis Data Streams and AWS Glue services are fully managed, AWS
takes away the undifferentiated heavy lifting of managing infrastructure for your big data
platform, letting you focus on generating data insights based on your data.
Fast Sneakers can utilize real-time event processing and ML to enable their website to
make fully automated real-time price adjustments, to maximize their product stock. This
brings the most value to their business while avoiding the need to create and maintain a
big data platform.
17
Amazon Web Services Streaming Data Solutions on AWS
detect such conditions and generate alerts in real-time, ABC4Logistics implemented the
following architecture on AWS.
Data from device sensors is ingested by AWS IoT Gateway, where the AWS IoT rules
engine will make the streaming data available in Amazon Kinesis Data Streams. Using
Amazon Kinesis Data Analytics, ABC4Logistics can perform the real-time analytics on
streaming data in Kinesis Data Streams.
Using Kinesis Data Analytics, ABC4Logistics can detect if temperature readings from
the sensors deviate from the normal readings over a period of ten seconds, and ingest
the record onto another Kinesis Data Streams instance, identifying the anomalous
records. Amazon Kinesis Data Streams then invokes AWS Lambda functions, which
can send the alerts to the driver and the fleet monitoring team through Amazon SNS.
Data in Kinesis Data Streams is also pushed down to Amazon Kinesis Data Firehose.
Amazon Kinesis Data Firehose persists this data in Amazon S3, allowing ABC4Logistics
to perform batch or near-real time analytics on sensor data. ABC4Logistics uses
Amazon Athena to query data in S3, and Amazon QuickSight for visualizations. For
long-term data retention, the S3 Lifecycle policy is used to archive data to Amazon S3
Glacier.
18
Amazon Web Services Streaming Data Solutions on AWS
With Amazon Kinesis Data Analytics, you can interactively query streaming data using
multiple options, including Standard SQL, Apache Flink applications in Java, Python
and Scala, and build Apache Beam applications using Java to analyze data streams.
These options provide you with flexibility of using a specific approach depending on the
complexity level of streaming application and source/target support. The following
section discusses Kinesis Data Analytics for Flink Applications option.
With Amazon Kinesis Data Analytics for Apache Flink, you can author and run code
against streaming sources to perform time series analytics, feed real-time dashboards,
and create real-time metrics without managing the complex distributed Apache Flink
environment. You can use the high-level Flink programming features in the same way
that you use them when hosting the Flink infrastructure yourself.
Kinesis Data Analytics for Apache Flink enables you to create applications in Java,
Scala, Python or SQL to process and analyze streaming data. A typical Flink application
reads the data from the input stream or data location or source, transforms/filters or
joins data using operators or functions, and stores the data on output stream or data
location, or sink.
The following architecture diagram shows some of the supported sources and sinks for
the Kinesis Data Analytics Flink application. In addition to the pre-bundled connectors
for source/sink, you can also bring in custom connectors to a variety of other
source/sinks for Flink Applications on Kinesis Data Analytics.
19
Amazon Web Services Streaming Data Solutions on AWS
Apache Flink application on Kinesis Data Analytics for real-time stream processing
Developers can use their preferred IDE to develop Flink applications and deploy them
on Kinesis Data Analytics from AWS Management Console or DevOps tools.
Using Studio notebook, you have the ability to develop your Flink Application code in a
notebook environment, view results of your code in real time, and visualize it within your
notebook. You can create a Studio Notebook powered by Apache Zeppelin and Apache
Flink with a single click from Kinesis Data Streams and Amazon MSK console, or
launch it from Kinesis Data Analytics Console.
Once you develop the code iteratively as part of the Kinesis Data Analytics Studio, you
can deploy a notebook as a Kinesis data analytics application, to run in streaming mode
continuously, reading data from your sources, writing to your destinations, maintaining
long-running application state, and scaling automatically based on the throughput of
your source streams. Earlier, customers used Kinesis Data Analytics for SQL
Applications for such interactive analytics of real-time streaming data on AWS.
20
Amazon Web Services Streaming Data Solutions on AWS
Kinesis Data Analytics for SQL applications is still available, but for new projects, AWS
recommends that you use the new Kinesis Data Analytics Studio. Kinesis Data
Analytics Studio combines ease of use with advanced analytical capabilities, which
makes it possible to build sophisticated stream processing applications in minutes.
For making the Kinesis Data Analytics Flink application fault-tolerant, you can make use
of checkpointing and snapshots, as described in the Implementing Fault Tolerance in
Kinesis Data Analytics for Apache Flink.
Kinesis Data Analytics Flink applications are useful for writing complex streaming
analytics applications such as applications with exactly-one semantics of data
processing, checkpointing capabilities, and processing data from data sources such as
Kinesis Data Streams, Kinesis Data Firehose, Amazon MSK, Rabbit MQ, and Apache
Cassandra including Custom Connectors.
After processing streaming data in the Flink application, you can persist data to various
sinks or destinations such as Amazon Kinesis Data Streams, Amazon Kinesis Data
Firehose, Amazon DynamoDB, Amazon Elasticsearch Service, Amazon Timestream,
Amazon S3, and so on. The Kinesis Data Analytics Flink application also provides sub-
second performance guarantees.
You can use the Apache Beam framework with your Kinesis data analytics application
to process streaming data. Kinesis data analytics applications that use Apache Beam
use Apache Flink runner to run Beam pipelines.
Summary
By making use of the AWS streaming services Amazon Kinesis Data Streams, Amazon
Kinesis Data Analytics, and Amazon Kinesis Data Firehose, ABC4Logistics: can detect
anomalous patterns in temperature readings and notify the driver and the fleet
management team in real-time, preventing major accidents such as complete vehicle
breakdown or fire.
21
Amazon Web Services Streaming Data Solutions on AWS
ABC1Cabs use Kibana dashboards for business metrics, debugging, alerting, and
creating other dashboards. They are interested in Amazon MSK, Amazon EMR with
Spark Streaming, and Amazon ES with Kibana dashboards. Their requirement is to
reduce admin overhead of maintaining Apache Kafka and Hadoop clusters, while using
familiar open-source software and APIs to orchestrate their data pipeline. The following
architecture diagram shows their solution on AWS.
Real-time processing with Amazon MSK and Stream processing using Apache Spark Streaming
on EMR and Amazon Elasticsearch Service with Kibana for dashboards
The cab IoT devices collect telemetry data and send to a source hub. The source hub is
configured to send data in real time to Amazon MSK. Using the Apache Kafka producer
library APIs, Amazon MSK is configured to stream the data into an Amazon EMR
cluster. The Amazon EMR cluster has a Kafka client and Spark Streaming installed to
be able to consume and process the streams of data.
Spark Streaming has sink connectors which can write data directly to defined indexes of
Elasticsearch. Elasticsearch clusters with Kibana can be used for metrics and
dashboards. Amazon MSK, Amazon EMR with Spark Streaming, and Amazon ES with
Kibana dashboards are all managed services, where AWS manages the
undifferentiated heavy lifting of infrastructure management of different clusters, which
enables you to build your application using familiar open-source software with few
clicks. The next section takes a closer look at these services.
22
Amazon Web Services Streaming Data Solutions on AWS
You can use Kafka as a streaming data store to decouple applications from producer
and consumers and enable reliable data transfer between the two components. While
Kafka is a popular enterprise data streaming and messaging platform, it can be difficult
to set up, scale, and manage in production.
Amazon MSK takes care of these managing tasks and makes it easy to set up,
configure, and run Kafka, along with Apache Zookeeper, in an environment following
best practices for high availability and security. You can still use Kafka's control-plane
operations and data-plane operations to manage producing and consuming data.
Because Amazon MSK runs and manages open-source Apache Kafka, it makes it easy
for customers to migrate and run existing Apache Kafka applications on AWS without
needing to make changes to their application code.
Scaling
Amazon MSK offers scaling operations so that user can scale the cluster actively while
its running. When creating an Amazon MSK cluster, you can specify the instance type
of the brokers at cluster launch. You can start with a few brokers within an Amazon
MSK cluster. Then, using the AWS Management Console or AWS CLI, you can scale
up to hundreds of brokers per cluster.
Alternatively, you can scale your clusters by changing the size or family of your Apache
Kafka brokers. Changing the size or family of your brokers gives you the flexibility to
adjust your MSK cluster’s compute capacity for changes in your workloads. Use the
Amazon MSK Sizing and Pricing spreadsheet (file download) to determine the correct
number of brokers for your Amazon MSK cluster. This spreadsheet provides an
estimate for sizing an Amazon MSK cluster and the associated costs of Amazon MSK
compared to a similar, self-managed, EC2-based Apache Kafka cluster.
After creating the MSK cluster, you can increase the amount of EBS storage per broker
with exception of decreasing the storage. Storage volumes remain available during this
23
Amazon Web Services Streaming Data Solutions on AWS
scaling-up operation. It offers two types of scaling operations: Auto Scaling and Manual
Scaling.
The storage utilization threshold helps Amazon MSK to trigger an automatic scaling
operation. To increase storage using manual scaling, wait for the cluster to be in the
ACTIVE state. Storage scaling has a cooldown period of at least six hours between
events. Even though the operation makes additional storage available right away, the
service performs optimizations on your cluster that can take up to 24 hours or more.
Configuration
Amazon MSK provides a default configuration for brokers, topics, and Apache
Zookeeper nodes. You can also create custom configurations and use them to create
new MSK clusters or update existing clusters. When you create an MSK cluster without
specifying a custom MSK configuration, Amazon MSK creates and uses a default
configuration. For a list of default values, see this Apache Kafka Configuration.
For monitoring purposes, Amazon MSK gathers Apache Kafka metrics and sends them
to Amazon CloudWatch, where you can view them. The metrics that you configure for
your MSK cluster are automatically collected and pushed to CloudWatch. Monitoring
consumer lag enables you to identify slow or stuck consumers that aren't keeping up
with the latest data available in a topic. When necessary, you can then take remedial
actions, such as scaling or rebooting those consumers.
24
Amazon Web Services Streaming Data Solutions on AWS
Apache Flink can also be used for scenarios where data requires mapping or
transformation actions before submission to the destination cluster. Apache Flink
provides connectors for Apache Kafka with sources and sinks that can read data
from one Apache Kafka cluster and write to another. Apache Flink can be run on
AWS by launching an Amazon EMR cluster or by running Apache Flink as an
application using Amazon Kinesis Data Analytics.
• AWS Lambda — With support for Apache Kafka as an event source for AWS
Lambda, customers can now consume messages from a topic via a Lambda
function. The AWS Lambda service internally polls for new records or messages
from the event source, and then synchronously invokes the target Lambda
function to consume these messages. Lambda reads the messages in batches
and provides the message batches to your function in the event payload for
processing. Consumed messages can then be transformed and/or written directly
to your destination Amazon MSK cluster.
Amazon EMR provides the capabilities of Spark and can be used to start Spark
streaming to consume data from Kafka. Spark Streaming is an extension of the core
Spark API that enables scalable, high-throughput, fault-tolerant stream processing of
live data streams.
You can create an Amazon EMR cluster using the AWS Command Line Interface (AWS
CLI) or on the AWS Management Console and select Spark and Zeppelin in advanced
25
Amazon Web Services Streaming Data Solutions on AWS
configurations while creating the cluster. As shown in the following architecture diagram,
data can be ingested from many sources such as Apache Kafka and Kinesis Data
Streams, and can be processed using complex algorithms expressed with high-level
functions such as map, reduce, join and window. For more information, see
Transformations on DStreams.
Processed data can be pushed out to filesystems, databases, and live dashboards.
By default, Apache Spark Streaming has a micro-batch run model. However, since
Spark 2.3 came out, Apache has introduced a new low-latency processing mode
called Continuous Processing, which can achieve end-to-end latencies as low as one
millisecond with at-least-once guarantees.
Without changing the Dataset/DataFrames operations in your queries, you can choose
the mode based on your application requirements. Some of the benefits of Spark
Streaming are:
• It can recover both lost work and operator state (such as sliding windows) out of
the box, without any extra code on your part.
• By running on Spark, Spark Streaming lets you reuse the same code for batch
processing, join streams against historical data, or run ad-hoc queries on the
stream state and build powerful interactive applications, not just analytics.
26
Amazon Web Services Streaming Data Solutions on AWS
• After the data stream is processed with Spark Streaming, Elasticsearch Sink
Connector can be used to write data to the Amazon ES cluster, and in turn,
Amazon ES with Kibana dashboards can be used as consumption layer.
Kibana is an open-source data visualization and exploration tool used for log and time-
series analytics, application monitoring, and operational intelligence use cases. It offers
powerful and easy-to-use features such as histograms, line graphs, pie charts, heat
maps, and built-in geospatial support.
Kibana provides tight integration with Elasticsearch, a popular analytics and search
engine, which makes Kibana the default choice for visualizing data stored in
Elasticsearch. Amazon ES provides an installation of Kibana with every Amazon ES
domain. You can find a link to Kibana on your domain dashboard on the Amazon ES
console.
Summary
With Apache Kafka offered as a managed service on AWS, you can focus on
consumption rather than on managing the coordination between the brokers, which
usually requires a detailed understanding of Apache Kafka. Features such as high
availability, broker scalability, and granular access control are managed by the Amazon
MSK platform.
Spark Streaming on Amazon EMR can help real-time analytics of streaming data, and
publishing on Kibana on Amazon Elasticsearch Service for the visualization layer.
27
Amazon Web Services Streaming Data Solutions on AWS
Conclusion
This document reviewed several scenarios for streaming workflows. In these scenarios,
streaming data processing provided the example companies with the ability to add new
features and functionality.
By analyzing data as it gets created, you will gain insights into what your business is
doing right now. AWS streaming services enable you to focus on your application to
make time-sensitive business decisions, rather than deploying and managing the
infrastructure
Contributors
The following individuals and organizations contributed to this document:
Document versions
Date Description
September Updated for technical accuracy
01, 2021
28