0% found this document useful (0 votes)
132 views16 pages

SPA Group 20

Uploaded by

boony862000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views16 pages

SPA Group 20

Uploaded by

boony862000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Submission for: Stream Processing and Analytics

(Merged - CCZG556/SSZG556) (S1-24)

Under the Guidance:

Prof. SURYA PRAKASH GOTETI


([email protected])

Submission on: Questions on AWS with Amazon Kinesis

Group Details:
1 2023mt03090 Ajinkya Sawant
3 2023mt03093 Sanjeev Saxena GROUP 20
4 2023MT03122 Venkatasai Visweswar K
5 2023mt03113 Sunil Behera

Contribution by each member


Contribution Members
1 Question 1: Use case scenario Sanjeev Saxena
2 Question 2: Critical challenge Ajinkya Sawant
3 Question 3: General architecture used Sunil Behera
4 Question 4: Tools / platforms that can be used Venkatasai Visweswar K

Introduction:
You are appointed as a Streaming Analytics expert for a firm looking to use
the solutions/ platforms available from the Streaming Analytics space. As
the firm's maturity level in the big data space is at very nascent stage, you
need to help them to understand how Streaming Analytics is helpful in their
several use cases and further on identifying the various options of tools and
platforms those can be leveraged for this activity.

Amazon Web Service (AWS) is a leading player in the space of cloud


computing. They have developed a special cloud service named “Amazon
Kinesis” exclusively for handling various streaming analytics use cases in a
very simpler manner. To introduce these services to the world, they have
also prepared a nice documentation – part of which also contains the white
paper. You can refer to this white paper which can help you while
interacting with the client.

Q1. You need to introduce the client with several examples where streaming
analytics has already been used. For that purpose, you need to formulate one
example of each type of Real-time application scenario mentioned in the white
paper.

• The example should be different from the ones discussed in the document
• Narration should have brief description of the use case scenario
• short explanation about how it can leverage streaming analytics solutions /
platforms
• justification about how it falls under the category

Answers / Discussion

The following scenarios demonstrate how streaming analytics solutions, such as Amazon
Kinesis, can be used to process real-time data streams and make critical decisions, providing
immediate insights and actions across various industries and use cases.
1. Real-Time Monitoring for Predictive Maintenance

Use Case Scenario:

A manufacturing plant runs a variety of machinery that constantly produces data from embedded
sensors, including temperature, vibration, and operational speed. Predictive maintenance seeks to
avoid unplanned downtime by anticipating when machines will fail and scheduling repairs before
they break.

How Streaming Analytics Can Help:

Streaming analytics can continuously ingest sensor data and apply machine learning models in
real-time to detect anomalies or deviations from normal machine behavior. By analyzing the
patterns and identifying early signs of wear and tear, the system can predict when a machine
might fail and trigger maintenance requests in real-time.

Justification:

This falls under the Predictive Maintenance category, where real-time data processing enables
organizations to proactively address equipment issues before they turn into costly downtime or
major breakdowns.

2. Dynamic Pricing in E-Commerce

Use Case Scenario:

An e-commerce platform dynamically adjusts prices of products based on supply, demand,


competitor pricing, and user behavior. For example, a popular item might see its price fluctuate
in real-time based on how many users are actively viewing it or how many are purchasing it.

How Streaming Analytics Can Help:

By analyzing real-time data streams from user interactions (clicks, views, purchases) as well as
external factors like competitor pricing updates, streaming analytics can help the platform adjust
product prices in real-time to maximize profit and stay competitive in a fast-moving market.

Justification:

This use case fits under the Real-Time Pricing Optimization category, as it leverages streaming
data to adjust product prices dynamically based on multiple live factors, ensuring up-to-the-
minute pricing that reflects market conditions.
3. Real-Time Fraud Detection in Financial Transactions

Use Case Scenario:

A financial services company aims to reduce fraudulent transactions in online banking. With
transactions happening in real-time, fraud detection systems need to instantly identify unusual
patterns that could indicate fraudulent behavior, such as unusual transaction locations, irregular
purchase amounts, or rapid transfers between accounts.

How Streaming Analytics Can Help:

Streaming analytics can analyze transaction data as it happens, comparing it to historical patterns
and flagging any outliers or suspicious behavior in real-time. This allows the system to
automatically block suspicious transactions or trigger alerts for manual review.

Justification:

This example belongs to the Real-Time Fraud Detection category, where analyzing data in
real-time is critical for preventing fraud, minimizing losses, and ensuring the safety of
customers’ accounts.

4. Live Content Recommendations in Streaming Services

Use Case Scenario:

A video streaming service wants to personalize content recommendations for users based on
their current viewing behavior. As users watch shows or movies, the system suggests relevant
content in real-time, based on their preferences and what similar users are watching.

How Streaming Analytics Can Help:

Streaming analytics allows the platform to analyze user activity, such as how much of a video
they’ve watched, whether they pause or rewind, and what genres they prefer, to generate real-
time recommendations that improve user engagement and increase viewing time.

Justification:

This falls under the Real-Time Personalization category, where streaming analytics are used to
instantly analyze user behavior and tailor content recommendations to keep users engaged with
the platform.
5. Traffic Management for Smart Cities

Use Case Scenario:

A city’s transportation department uses smart traffic lights and IoT-enabled vehicles to manage
traffic flow and reduce congestion in real-time. The system collects data from traffic cameras,
sensors embedded in roads, and GPS devices in vehicles to adjust signal timings dynamically.

How Streaming Analytics Can Help:

By analyzing real-time traffic data, such as vehicle speeds, traffic density, and accident reports,
streaming analytics can adjust traffic light timings, reroute vehicles, and inform drivers of
optimal routes to reduce traffic congestion and improve travel times.

Justification:

This fits into the Real-Time Decision-Making category because the system processes vast
amounts of real-time data to make immediate adjustments to traffic signals and routing decisions,
significantly impacting how vehicles move through the city.

Q2. You are in a meeting with the firm’s management who are a little bit
concerned about the challenges associated with streaming analytics. The white
paper describes a few challenges faced while adapting streaming analytics. To
assist the client

• Briefly narrate the four critical challenges in your own words


• Identify the different tools that can be used to resolve / mitigate those
challenges
• Address how each of the challenge is resolved with the tools / platforms
identified

Answer/ Discussion:

The whitepaper outlines several key challenges that organizations face when adopting streaming
analytics. These include:
1. Building Custom Data Pipelines: When you want to build custom streaming data
pipelines, it becomes tedious to set them up and requires dedicated resources.
Organizations need to develop systems that can collect, process, and transmit data from
diverse sources in real-time. Also tuning storage and computer resources for low-latency,
high-throughput ways of working is expertise- and capital-intensive.

2. Scaling and Managing Infrastructure: For systems to respond quickly to varying data
rates, there must be some way for them to automatically scale up or down in real-time. It
means orchestrating thousands of servers and making sure that the data spikes are
handled seamlessly. It is difficult to scale efficiently without compromising on
performance.

3. System Monitoring and Failure Recovery: Ensuring system uptime and reliability is
important since it can lead to loss or inaccuracy of data if the system fails. To comply
with these guidelines, organizations need to monitor their systems, in near real-time
(99%), and could recover from any network or server failures, without data duplication or
lost data. This makes streaming platforms design and operation more complex

In the meeting with the firm’s management, we can explain the four critical challenges
associated with streaming analytics in a simplified manner based on the insights from the white
paper:

1. Handling Data at Ultra High Speed and Volume: Real-time analytics means analyzing
and acting on incoming data as it arrives in vast quantities, often on a continuous basis.
This is different from batch processing where analysis and action on data can be deferred.
In this case, processing capabilities must accommodate millions of event contributions
per second. Traditional data processing architecture gets easily bottlenecked with this
speed and volume.

2. Data Processing in Real Time: The streaming data changes continuously, and as such, the
organization that deals with it in real-time cannot afford to sit and wait for some other
time to analyze any data that comes within. Quite different tools and techniques are
employed to produce the relevant results in as short as possible a period. The other option
of developing a real time pipeline does not come cheap and quite hard as well mainly
because of the need to recruit specialists to manage storage, computer power, and
network.

3. Infrastructure Cost-Effectiveness: Data introduction rates can range so much that


organizations will have to build data processing and storage systems able to handle both
the average and peak conditions. It should be noted that if demand is very volatile,
adequate handling capacity must be in place so that no delay occurs when high volumes
of data are experienced but at the same time costs must be kept in check when there is
low demand. A critical challenge is ensuring elasticity without being unduly wasteful or
inefficient.

4. Mitigating Data Integrity Issues Results and Outage of Streaming: In the case of
streaming data, degradation of systems or the network may result in the loss of some data
or cause the same data to be generated more than once which affects how the data is
analyzed and the decisions that are made. Persistence mechanisms must be in place for
monitoring and real-time catching up with a failure, the data stream, and maintaining data
correctness and integrity in that. This results in more sophisticated system design and
operational issues.

Identify the different tools that can be used to resolve / mitigate those
challenges

Here is how each challenge can be mitigated using specific tools:

1. Handling Data at High Speed and Volume


a. Amazon Kinesis Streams: Designed to ingest and process massive amounts of
real-time data, Kinesis Streams can handle millions of events per second. It allows
multiple producers (e.g., IoT devices, applications, log data) to feed data into
streams simultaneously and supports custom processing pipelines for real-time
analytics.
b. Amazon Kinesis Firehose: For easier integration, Firehose automatically captures,
transforms, and loads streaming data into AWS data stores like Amazon S3,
Redshift, or Elasticsearch. It can scale automatically to accommodate varying data
volumes without needing manual intervention.

2. Real-Time Data Processing Complexity


a) Amazon Kinesis Analytics: Simplifies real-time processing by allowing you to
write SQL queries on streaming data. This reduces the complexity of custom
application development, as businesses can use familiar SQL to filter, aggregate,
and transform streaming data in real time.
b) AWS Lambda: This serverless computer service can be integrated with Kinesis
Streams or Firehose to trigger custom actions in response to specific data events.
For instance, Lambda can be used to process and react to streaming data without
managing underlying infrastructure, making real-time processing more
manageable.
3. Scaling Infrastructure Efficiently
a. Kinesis Firehose: This service scales automatically to handle data spikes, ensuring
that the infrastructure can manage fluctuating volumes of data efficiently.
Businesses do not need to manually provide or manage resources, which helps
reduce operational overhead.
b. Amazon Auto Scaling: For more complex applications built on top of Kinesis or
other AWS services, Auto Scaling ensures that the necessary compute resources
(EC2 instances, DynamoDB tables, etc.) scale up or down based on real-time
demand, optimizing costs and performance.

4. Managing System Failures and Data Accuracy


a. Amazon Kinesis Streams: Ensures data durability and elasticity by replicating
data across multiple availability zones. This helps recover from failures quickly
while maintaining data consistency.
b. Amazon CloudWatch: Provides real-time monitoring and alerting for Kinesis
services and other AWS components, enabling businesses to quickly identify and
respond to any issues such as data backlogs or failures. It also tracks metrics to
ensure data processing and recovery remain on track.
c. Amazon S3: Can be used as an intermediate storage for raw streaming data with
high durability. Kinesis Firehose automatically stores data to S3, ensuring that
data is not lost during processing and can be retrieved in case of failure.

Address how each of the challenge is resolved with the tools / platforms
identified

Here is how each identified challenge in streaming analytics is resolved using the Amazon
Kinesis tools and other AWS services:
1. Handling Data at High Speed and Volume
a. Tools:
i. Amazon Kinesis Streams: Handles large-scale data ingestion by allowing
multiple producers to continuously send data into the stream. It can
process millions of records per second, making it ideal for scenarios with
massive data throughput like IoT, application logs, and website
clickstreams.
ii. Amazon Kinesis Firehose: Automatically scales based on data volume,
eliminating the need for manual scaling. It ingests real-time data and
delivers it to storage or analytics services like Amazon S3, Redshift, or
Elasticsearch. Firehose can also batch, compress, and encrypt data,
making it efficient for handling high-volume data streams.
b. Resolution: Kinesis Streams and Firehose absorb high-speed data streams, scale
to match the incoming data, and ensure no data loss or latency issues, even as data
volumes fluctuate.

2. Real-Time Data Processing Complexity


a. Tools:
i. Amazon Kinesis Analytics: Simplifies real-time data processing by
enabling SQL queries on streaming data. Businesses can use SQL to
aggregate, filter, and analyze real-time data without building complex
custom applications. This allows immediate insights without advanced
programming.
ii. AWS Lambda: Can be integrated with Kinesis Streams or Firehose to
process individual records as they are ingested. Lambda’s serverless
nature means that it automatically scales to handle the incoming data,
running code only when needed. This allows for real-time event handling,
such as sending alerts or triggering downstream processes.
b. Resolution: Kinesis Analytics simplifies real-time processing by allowing
businesses to use familiar SQL tools, while Lambda automates actions in response
to real-time data without managing infrastructure.

3. Scaling Infrastructure Efficiently


a. Tools:
i. Kinesis Firehose: Automatically scales with data flow, adjusting resources
based on the volume of incoming data without manual intervention. It
manages both high and low throughput scenarios, allowing seamless
scalability.
ii. Amazon Auto Scaling: For more complex applications (e.g., EC2
instances or DynamoDB databases), Auto Scaling adjusts the number of
resources based on real-time demand. This ensures that the application can
handle peak data volumes while optimizing cost by scaling down during
quiet periods.
b. Resolution: Firehose and Auto Scaling eliminate the need for manual resource
provisioning, ensuring that the infrastructure adjusts dynamically to meet real-
time data demands, maintaining performance and reducing costs during idle
times.

4. Managing System Failures and Data Accuracy


a. Tools:
i. Amazon Kinesis Streams: Ensures data durability by replicating data
across multiple availability zones. This redundancy helps recover data in
the event of a failure, ensuring the integrity of the data stream.
ii. Amazon CloudWatch: Provides monitoring for Kinesis services, alerting
you to data processing issues, backlogs, or failures in real time.
CloudWatch metrics help ensure data processing continuity and recovery
after failures.
iii. Amazon S3: When used with Firehose, S3 acts as a durable, long-term
storage solution for raw streaming data. This ensures that even if data
processing fails, the raw data is archived and can be replayed or
reprocessed without loss.
b. Resolution: Kinesis Streams’ replication ensures data is not lost, while
CloudWatch enables real-time monitoring and alerts. Firehose's integration with
S3 ensures raw data is backed up, and systems can recover or reprocess data in
case of failure without duplications or gaps in the data.

In summary, the tools identified—Kinesis Streams, Firehose, Analytics, Lambda, Auto Scaling,
CloudWatch, and S3—collectively address the core challenges of high-speed data ingestion,
real-time processing complexity, efficient scaling, and system reliability.

Q3. The white paper discusses three different use cases which the toll station
company has addressed using streaming data. But the solution is described in terms
of various cloud services offered by AWS. The client does not have knowledge
about cloud computing and AWS. In fact, all three use cases can be very well
addressed with a general architecture used in big data analytics and streaming
analytics. You need to work upon helping clients to understand those common
architectures.

• Identify the architecture that can be fitted well for capturing all three use cases
• Convert the final architecture diagram provided by AWS team into an architecture
diagram based upon your answer to earlier question
• Take care that all three cases should be vividly coming out of the architecture diagram, if
required add brief description about each flow

Identify the architecture that can be fitted well for capturing all three use cases
To address the client's needs without diving too deep into specific AWS services, it’s beneficial to
discuss a general architectural framework for streaming data that can address the three use cases
outlined in the white paper.

The key to this architecture is designing a system that can handle continuous data ingestion, real-
time processing, storage for analytics, and immediate response mechanisms. Below, I’ll outline a
generalized architecture suitable for these needs without tying it specifically to AWS or any other
cloud provider's services.

General Architecture for Streaming Analytics

Data Ingestion Layer


Captures and collects data streams from various sources like sensors, applications, or user
interactions. In the toll station scenario, this includes data from toll booths.
Components: Data producers (toll booth sensors) send data to a messaging system capable of
handling high throughput and low-latency data ingestion.

Stream Processing Layer


Processes data in real-time to perform transformations, aggregations, filtering, and more. This
allows immediate actions based on the streaming data, such as calculating billing amounts or
generating alerts.
Components: Stream processing engines that can process continuous streams and execute complex
analytics and decision-making algorithms on-the-fly. This layer is crucial for real-time analytics
and immediate response mechanisms.

Data Storage Layer


Provides storage for processed and raw data for longer-term analysis, historical data analysis, or
incremental processing. It supports both real-time and batch processing needs.
Components: A combination of databases and data warehouses/lakes. Real-time processed data
can be pushed to NoSQL databases for quick retrieval, while batch-processed or historical data
can be stored in data lakes or warehousing solutions that support massive data aggregations and
complex queries.

Response and Notification Layer


Sends notifications and triggers actions based on specific data conditions or thresholds being met.
This can include sending alerts about system performance, billing issues, or operational warnings.
Components: An event-driven approach wherein specific data patterns or outputs from the stream
processing layer trigger actions—like sending SMS, emails, or push notifications. This could also
interface with an orchestration system for automating responses based on the insights derived from
the streaming data.
Analytics and Dashboard Layer
Provides business intelligence and insights through dashboards, reports, and ad-hoc queries to help
business users make informed decisions.
Components: BI tools that integrate well with the data storage layer to visualize and analyze data
in meaningful ways. This includes real-time dashboards that reflect the current state of the system
and operations.

Convert the final architecture diagram provided by AWS team into an architecture diagram
based upon your answer to earlier question

—--------------------------------------------------------------------------------------------------------------

As described above, the idea is to generalize the AWS-specific solutions into a cloud-agnostic
architecture while still addressing the streaming data needs of the Toll Company's Use Cases

Fig 1: Implementing a broad architecture

Data Ingestion Layer: Toll stations will have sensors installed at each toll booth to capture vehicle
data, which is then sent to a central stream processing system.

Stream Processing Layer: This system processes the data in real-time to apply business rules,
such as calculating spending or identifying operational anomalies, and forwards relevant data to
both the storage layer and the response layer immediately.

Data Storage Layer: Processed data that needs to be analyzed later is stored in a data lake for
deep analysis or in a NoSQL database for quick retrieval and ongoing operations.

Response and Notification Layer: Based on the processed stream, notifications like SMS alerts
or application pop-ups can be triggered to inform customers and operators.
Analytics and Dashboard Layer: Business analysts and operational managers can access real-
time dashboards powered by a BI tool that pulls data from the storage layer.
This general architecture provides a platform-agnostic framework useful for discussing potential
solutions with clients unfamiliar with cloud specifics but interested in the capabilities of big data
and streaming analytics.

Take care that all three cases should be vividly coming out of the architecture diagram, if
required add brief description about each flow

—--------------------------------------------------------------------------------------------------------------

To make the architecture more descriptive for the three specific use cases of the toll company—
real-time billing notifications, operational alerts, and data reporting for analysis—let’s refine the
diagram to explicitly outline how each flow works within this generalized architecture. I'll include
a brief description for each case to clarify how data travels and is processed through each layer.

Fig 2: Enhanced Generalized Architecture Diagram for Toll Station Streaming Analytics

This architecture showcases how each component contributes to managing and utilizing real-time
data for operational efficiency and business intelligence. It provides a technology-agnostic
framework that can be adapted to any specific toll management system requirements while
emphasizing data-driven decision support across the organization.

Real-time Billing and Notifications

Flow: As vehicles pass through toll booths, sensors capture transactions and send data to the
ingestion layer.

Processing: The stream processing layer calculates toll charges in real-time. If a customer’s
spending exceeds a predefined limit, this information is processed and passed to the response layer.
Action: The response and notification layer triggers an alert/notification to the customer indicating
that their billing threshold has been exceeded.

Result Viewing: Simultaneously, transaction details are stored in the NoSQL database for
immediate access and in the data lake for historical analysis. Business analysts can view billing
information through real-time dashboards.

Operational Alerts
Flow: Operational data such as vehicle counts per half-hour are collected continuously and fed
into the ingestion layer.

Processing: Real-time stream processing counts vehicles. If the count is below a predetermined
threshold, the event is flagged.

Action: The response and notification layer triggers an operational alert to the operations team for
immediate action.

Result Viewing: Data is stored similarly as in the billing use case, allowing for operational
performance analysis and long-term trend monitoring through dashboards.

Data Analysis for Reporting

Flow: All collected data from toll transactions and operational metrics feeds into the data storage
layer.

Processing: Data is aggregated and transformed for analytical purposes within the stream
processing engine and stored in a data warehouse or data lake.

Action: Analysts use BI tools to generate reports and gain insights, performing ad-hoc queries and
viewing historical data trends.

Result Viewing: The insights help inform business strategy and operational efficiency
improvements, supported by data visualized on dashboards.

Q4. The client is now impressed with the capabilities of AWS and how it is
streamlining the application development and deployment. But they also want to
discover more on the open-source tools / platforms that can be leveraged. As a
result, you need to identify the open-source tools for each use case. For each of the
use case
• Identify the tools / platforms that can be used to solve it
• Draw a solution diagram using the tools identified in earlier question the flow should
come out clearly from the solution diagram

Answer:
Use Case 1: Real-time data ingestion and processing
Goal: Load data into the data warehouse with no more than a 30-minute delay for real-time
analysis.
Tools: Kafka, Flink, Hadoop, Hive
Solution Diagram:
Toll sensors → Kafka → Flink → PostgreSQL→ Business Intelligence

Use Case 2: Real-time analytics and alerting


Goal: Notify customers within 10 minutes if their cumulative toll surpasses a set threshold.
Tools: Kafka, Flink, RabbitMQ
Solution Diagram:
Toll sensors → Kafka → Flink → RabbitMQ → Notification API (push to customer mobile app)

Use Case 3: Other Threshold Alerts


Goal: Alert the operations team if traffic falls below a predefined threshold within each 30-
minute period.
Tools: Kafka, Flink, Grafana (with Prometheus)
Solution Diagram:
Toll sensors → Kafka → Flink → Grafana (alert to operations team)

Overall Open-source Architecture Solution Diagram


Implementation Detail:
• Kafka: Acts as a real-time data ingestion layer, capturing customer transactions.
• Flink: Processes the incoming stream of data from Kafka. Performs transformations and
aggregations to generate insights. Writes processed data to PostgreSQL for analytical
queries and reporting.
• PostgreSQL: Serves as a robust data warehouse for storing structured data. It is
optimized for complex analytical queries and reporting.
• Grafana: For Metrics Collection, Log Aggregation & Dashboard.
• Notification Service: The push notifications can be mobile notifications, email
notifications or SMS notifications.

You might also like