22CS911-DEC Unit 5
22CS911-DEC Unit 5
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
22CS911
DATA ENGINEERING IN
CLOUD
Department:
Computer Science and Engineering
Batch/Year: 2022-2026/III
Created by:
Dr. RAJALAKSHMI D ASP/CSE,RMDEC
Mrs. JASMINE GILDA A ASP/CSE,RMKEC
Mr. JEMIN V M ASP/CSE,RMKCET
Date: 18-01-2025
Table of Contents
Sl. Topics Page
No. No.
1. Contents 5
2. Course Objectives 6
6. CO-PO/PSO Mapping 14
Unit I : INTRODUCTION 8
Introduction to data Engineering - The Data Engineering Life Cycle - Data Engineering
and Data Science - Data-Driven Organizations: Data-driven decisions - The data
pipeline - The role of the data engineer in data-driven organizations - Modern data
strategies - The Elements of Data: The five Vs of data - volume, velocity, variety,
veracity, and value. Demo: Accessing and Analyzing Data by Using Amazon S3.
Unit II : SECURE AND SCALABLE DATA PIPELINES 10
The evolution of data architectures - Modern data architecture on AWS - Modern data
architecture pipeline: Ingestion and storage - Processing and consumption -
Streaming analytics pipeline - Security of analytics workloads - Scaling - Creating a
scalable infrastructure and components. ETL and ELT comparison - Data wrangling.
Unit III : STORING AND ORGANIZING DATA 9
Comparing batch and stream ingestion - Batch ingestion processing - Purpose-built
ingestion tools - AWS Glue for batch ingestion processing - Kinesis for stream
processing - Scaling considerations for batch processing and stream processing -
Storage in the modern data architecture - Data lake storage - Data warehouse storage
- Purpose-built databases - Storage in support of the pipeline - Securing storage.
Unit IV : PROCESSING BIG DATA AND DATA FOR ML 10
Big data processing concepts - Apache Hadoop - Apache Spark - Amazon EMR -
Managing your Amazon EMR clusters - Apache Hudi - The ML lifecycle - Collecting
data - Applying labels to training data with known targets - Preprocessing data -
Feature engineering - Developing a model - Deploying a model - ML infrastructure on
AWS - SageMaker - Amazon CodeWhisperer - AI/ML services on AWS.
Unit V : DATA ANALYSIS AND VISUALIZATION 8
Analyzing and Visualizing Data: Considering factors that influence tool selection -
Comparing AWS tools and services - Selecting tools for a gaming analytics use case.
Automating the Pipeline: Automating infrastructure deployment - CI/CD - Automating
with Step Functions
Course Outcomes
Course Outcomes
CO# COs K Level
CO1 Understand data engineering, pipelines & access data in the cloud. K2
CO2 Build secure & scalable data pipelines using AWS services K4
CO3 Choose the right data storage & secure your data pipelines. K3
CO4 Process big data for machine learning with cloud tools K3
K6 Evaluation
K5 Synthesis
K4 Analysis
K3 Application
K2 Comprehension
K1 Knowledge
CO – PO/PSO Mapping
CO – PO /PSO Mapping Matrix
CO1 3 3 2 1 1 - - - 2 - - 2 3 2 -
CO2 3 2 2 2 2 - - - 2 - - 2 3 2 -
CO3 3 3 2 2 2 - - - 2 - - 2 3 2 -
CO4 3 2 2 2 2 - - - 2 - - 2 3 3 -
CO5 3 2 2 2 2 - - - 2 - - 2 3 2 -
CO6 2 2 1 1 1 - - - 2 - - 2 3 2 -
Lecture Plan
Unit V
Lecture Plan – Unit 5
Sl. Topic Number Propos Actual CO Taxo Mode
No of ed Date Lecture nomy of
. Periods Date Level Deliver
y
Chalk
Analyzing and and
1 1 CO5 K2
Visualizing Data Board
Analyzing and Visualizing Data: Considering factors that influence tool selection -
Comparing AWS tools and services - Selecting tools for a gaming analytics use case.
Automating the Pipeline: Automating infrastructure deployment - CI/CD -
Automating with Step Functions
Unit 5 DATA ANALYSIS AND VISUALIZATION
Data visualization is a crucial aspect of data analysis. It lets you represent data in a
graphical format that is easy to understand and interpret to the human mind. With the
ever-increasing rise of big data, the demand for data visualization tools has doubled
significantly:
Data visualization has endless potential. It’s not limited to simply analyzing data;
however, it is essential for marketers, organizations, or business leaders who need to
drive crucial insights into their data to take their business to the next level.
Cost is where everything comes down. It’s necessary to consider the cost parameter
while choosing the best data visualization tools. Different tools have different pricing;
thus, analyzing the cost-effectiveness is essential.
This is an essential factor to consider when choosing a data visualization tool. It should
be compatible with your data sources. Some devices may work best only with specific
data formats, while others may have limitations on the size of data they can handle.
Before choosing a data visualization tool, ensure it is compatible with your data sources
and size.
Unit 5 DATA ANALYSIS AND VISUALIZATION
3. Platform Licensing
The licensing of your data visualization tools can differ. Some agencies offer free
products with limitations on features and ask them to upgrade for enhanced features.
The paid tools are priced per subscription, sometimes based on the number of users and
others on monthly fees.
4. Ease of Use
When choosing a data visualization, opting for a user-friendly tool is necessary. This will
help you to get better insights without any hassle. Also, make sure to evaluate the skill
sets of your team. Look for tools that offer an easy-to-use interface and top-class
features, making it easy to represent the data.
The next thing to consider is the purpose and audience of your visualization. Different
goals and audiences need different types of visualizations. Say, for example, you want to
educate your audience about a specific topic, and you need a data visualization tool
that’s informative and delivers a clear message.
Evaluating the features and functionalities of data visualization tools is essential. These
features include data import, manipulation, analysis, sharing, and more.
• Amazon VPC (Virtual Private Cloud): Provides isolated cloud resources for
setting up custom network configurations.
• Alternatives: AWS Transit Gateway (for connecting multiple VPCs), AWS Direct
Connect (for dedicated network connection).
• Amazon Route 53: A scalable domain name system (DNS) web service.
• Best for: Delivering static and dynamic web content globally with low latency.
• Alternatives: AWS Global Accelerator (for global traffic routing), Route 53 (for
DNS-based routing).
9. Developer Tools
• AWS CodePipeline: A continuous integration and delivery service for fast and
reliable application updates.
Amazon Kinesis Data Streams: A real-time streaming data service for capturing player events
such as gameplay metrics, in-game actions, and transactions.
• Why use it? It enables you to ingest large volumes of data in real-time, which is crucial for
gaming environments that generate high-frequency events.
• Alternatives: Amazon Managed Streaming for Apache Kafka (MSK) for event streaming,
especially if you are already using Kafka in your stack.
Amazon Kinesis Firehose: Delivers real-time streaming data to destinations such as Amazon
S3, Redshift, and Elasticsearch with automatic scaling.
• Why use it? It simplifies the process of capturing, transforming, and loading data into analytics
services without managing infrastructure.
2. Data Storage
Amazon S3 (Simple Storage Service): Ideal for storing raw data, logs, and processed
analytics data. S3 provides high availability and durability, and it integrates well with other AWS
analytics tools.
• Why use it? Its scalability, low cost, and integration with analytics tools make it ideal for
storing the large datasets typically generated by gaming events.
• Why use it? It supports fast querying and scaling of data warehousing operations, which is
ideal for running complex analytics across large data sets.
Amazon DynamoDB: A fully managed NoSQL database service, suitable for low-latency data
storage related to player profiles, leaderboards, and in-game transactions.
• Why use it? DynamoDB’s high availability and low-latency performance are ideal for handling
player-related data that require quick lookups.
Unit 5 DATA ANALYSIS AND VISUALIZATION
3. Data Processing
AWS Glue: A fully managed ETL service to prepare, clean, and transform raw
gameplay data for analytics and reporting.
• Why use it? Glue simplifies data preparation workflows by automating the ETL
process, which is critical when dealing with large datasets.
• Alternatives: Amazon EMR (for big data processing using frameworks like Hadoop or
Spark).
Amazon Athena: An interactive query service that makes it easy to analyze data in
Amazon S3 using standard SQL.
• Why use it? Athena is serverless, so it’s great for running ad-hoc queries on your
gaming data stored in S3 without the need for a complex data warehouse.
Amazon EMR (Elastic MapReduce): A big data platform for running large-scale
distributed data processing tasks, such as processing game logs, in-game analytics,
and behavior modeling.
• Why use it? Ideal for complex, large-scale data processing tasks that require
Hadoop, Spark, or Presto.
Amazon QuickSight: A business intelligence tool that helps you visualize and analyze
data to gain insights on game performance, player retention, and in-game behavior.
• Why use it? It allows you to create interactive dashboards and reports that can be
shared with game development teams and stakeholders.
• Alternatives: Tableau on AWS (for more advanced analytics), Power BI (if you are in
a Microsoft ecosystem).
Amazon Redshift Spectrum: Allows you to query data directly in Amazon S3 using
Redshift. It enables you to analyze data stored in your data lake along with data in
Redshift without moving it.
• Why use it? It provides powerful analytics without needing to load all your data into
Redshift, reducing cost and complexity.
Unit 5 DATA ANALYSIS AND VISUALIZATION
5. Machine Learning and AI
Amazon SageMaker: For building, training, and deploying machine learning models
to predict player behavior, churn rate, or recommend personalized in-game content.
• Why use it? SageMaker offers built-in algorithms and auto-scaling infrastructure,
enabling you to run large-scale models quickly. Use it to predict player lifetime value
or optimize in-game monetization strategies.
• Alternatives: AWS Lambda (for deploying pre-trained models for quick predictions on
real-time data streams).
• Why use it? It’s specifically designed for personalized experiences, so you can use it
to provide targeted recommendations to players based on their actions and
preferences.
Amazon Rekognition: A computer vision service that could be used for player
emotion detection through in-game avatars or analyzing game screenshots/videos.
• Why use it? Great for adding AI-driven features to games that need image or video
analysis for gameplay improvement.
Amazon CloudWatch: To monitor your game servers, track performance metrics, and
set up automated alarms for service health or player activity spikes.
• Why use it? CloudWatch is great for getting real-time operational insights and
ensuring your game servers are running optimally.
AWS CloudTrail: To monitor and log all activities and API calls within your AWS
environment, ensuring you have full visibility into what’s happening within your gaming
infrastructure.
• Why use it? It helps ensure security and compliance by logging all interactions with
AWS services.
Unit 5 DATA ANALYSIS AND VISUALIZATION
7. Scaling and Load Balancing
• Why use it? GameLift automates server scaling based on player demand, ensuring
optimal performance and reducing costs.
• Alternatives: EC2 Auto Scaling (for custom server management and scaling).
AWS Elastic Load Balancing (ELB): Distributes incoming traffic across multiple
targets, ensuring your gaming servers are highly available.
• Why use it? Load balancing is critical in preventing downtime or performance issues
during heavy player traffic spikes.
8. Security
AWS Shield: Protects your gaming servers from Distributed Denial of Service (DDoS)
attacks.
• Why use it? Games are often targeted by DDoS attacks, and Shield helps mitigate
those risks.
• Alternatives: AWS WAF (Web Application Firewall) for more granular protection of
web-based game portals.
AWS Cognito: Provides user authentication, authorization, and user management for
player accounts.
• Why use it? Cognito is useful for managing player sign-ups, log-ins, and access
controls within your game.
Unit 5 DATA ANALYSIS AND VISUALIZATION
5.4 Automating the Pipeline: Automating infrastructure
deployment
Streamlining deployment pipelines and automation is crucial for efficient software
development processes. It involves automating the steps required to build, test, and
deploy software applications, reducing manual effort, and minimizing the risk of errors.
In this essay, we will explore the benefits of streamlining deployment pipelines and
automation, along with best practices to optimize the software delivery process.
Speed and Agility: With streamlined deployment pipelines, software changes can be
deployed rapidly and frequently. This allows teams to adopt agile development
practices, respond quickly to customer needs, and deliver new features and updates in
a timely manner.
Unit 5 DATA ANALYSIS AND VISUALIZATION
Best Practices for Streamlining Deployment Pipelines and Automation
1. Infrastructure as Code
Adopt infrastructure as code (IaC) practices to define and manage your deployment
infrastructure. Tools like Terraform or AWS CloudFormation allow you to codify your
infrastructure, enabling consistent and reproducible deployments across different
environments.
3. Automated Testing
Integrate automated testing into your deployment pipeline to validate the functionality
and quality of your software. Automated tests, such as unit tests, integration tests, and
end-to-end tests, help identify issues early in the development process, preventing
regressions and improving overall software quality.
Implement robust monitoring and alerting systems to track the performance and health
of your deployed applications. Real-time monitoring helps identify and address issues
promptly, minimizing downtime and ensuring a positive user experience.
Unit 5 DATA ANALYSIS AND VISUALIZATION
The first step in building a CI/CD pipeline for infrastructure automation is to define
infrastructure as code (IaC). This involves using configuration files or templates to
define the desired state of infrastructure resources, such as virtual machines, networks,
and storage. Popular IaC tools include Terraform, AWS CloudFormation, and Azure
Resource Manager (ARM).
For example, using Terraform, we can define a simple infrastructure configuration file
(main.tf) as follows:
provider "aws" {
region = "us-west-2"
ami = "ami-abc123"
instance_type = "t2.micro"
This code defines an AWS provider and a single EC2 instance resource with a specific
AMI and instance type.
Unit 5 DATA ANALYSIS AND VISUALIZATION
CI/CD Pipeline Tools
Once we have defined our IaC, we need to choose a CI/CD pipeline tool to automate
the deployment and management of our infrastructure. Popular options include
Jenkins, GitLab CI/CD, and CircleCI. For this example, we will use GitLab CI/CD.
To configure a GitLab CI/CD pipeline, we create a .gitlab-ci.yml file in the root of our
repository. This file defines the stages, jobs, and scripts that make up our pipeline.
This pipeline has a single stage (deploy) and a single job (deploy). The job runs two
scripts: terraform init to initialize the Terraform working directory, and terraform apply -
auto-approve to apply the infrastructure configuration.
To integrate our CI/CD pipeline with version control, we need to configure our pipeline
to trigger on changes to our IaC configuration files. In GitLab CI/CD, we can do this by
specifying a trigger section in our .gitlab-ci.yml file:
trigger:
- main
This configuration tells GitLab CI/CD to trigger the pipeline on changes to the main
branch of our repository.
Unit 5 DATA ANALYSIS AND VISUALIZATION
To ensure that our infrastructure is deployed correctly, we need to include testing and
validation steps in our pipeline. This can involve running scripts to verify the state of
our infrastructure resources, such as checking the status of EC2 instances or validating
network configurations.
For example, we can add a test stage to our pipeline to run a script that verifies the
status of our EC2 instance:
stages:
- deploy
- test
deploy:
stage: deploy
script:
- terraform init
- terraform apply -auto-approve
only:
- main
test:
stage: test
script:
- aws ec2 describe-instances --instance-ids $(terraform output -json | jq -
r '.instance_id')
only:
- main
This test stage runs an AWS CLI command to describe the EC2 instance, using the
instance ID output from Terraform.
Conclusion
You can use Apache Airflow or AWS step functions to build workflows. One way of
implementing workflows could be AWS step functions.
AWS Step Functions are a low-code, visual workflow service used to orchestrate AWS
services, automate business processes, and build server-less applications.
Unit 5 DATA ANALYSIS AND VISUALIZATION
Key features
• Automated scaling
State machines
The workflows you build with Step Functions are called state machines, and each step
of your workflow is called a state.
2. AWS SAM
3. AWS CDK
Unit 5 DATA ANALYSIS AND VISUALIZATION
A state machine can consist of multiple states like pass, wait, task, etc.
Use cases
You create a workflow that runs a group of Lambda functions (steps) in a specific order.
One Lambda function’s output passes to the next Lambda function’s input. The last
step in your workflow gives a result. With Step Functions, you can see how each step
in your workflow interacts with the other, so you can make sure that each step
performs its intended function.
Function orchestration
Branching
Using a Choice state, you can have Step Functions make decisions based on the Choice
state’s input.
Unit 5 DATA ANALYSIS AND VISUALIZATION
Human in the loop
When you apply for a credit card, your application might be reviewed by a human.
Because step functions can run up to one year, a state machine can wait for human
approval and proceed to the next state only after approval or rejection.
Parallel Processing
Once you are sure of the number of branches, then you should use parallel processing.
You cannot modify the number of branches in runtime.
E.g. a customer converts a video file into five different display resolutions, so viewers
can watch the video on multiple devices. Using a Parallel state, Step Functions inputs
the video file, so the Lambda function can process it into the five display resolutions at
the same time.
Dynamic parallelism
This is similar to Parallel Processing except branches are created dynamically based on
the input, say for example a customer orders three items, and you need to prepare
each item for delivery. You then check each item’s availability, gather each item, and
package each item for delivery. Using a Map state, Step Functions has Lambda process
each of your customer’s items in parallel. Once all of your customer’s items are
packaged for delivery, Step Functions goes to the next step in your workflow, which is
to send your customer a confirmation email with tracking information.
Unit 5 DATA ANALYSIS AND VISUALIZATION
Prerequisite
You should have adequate knowledge of AWS IAM, AWS CDK, and AWS Lambda.
We chose AWS CDK to create a step machine. The reason being we use CDK
extensively to use other AWS services.
Below is the example of a step machine that invokes a Lambda and passes “Hello
World!” as an input to Succeed state. If you notice carefully, Lambda code is written
in java-script whereas step machine is defined in Python. That means you can use
your choice of language in Lambda to write your business logic.
hello_step_stack.py
class HelloStepStack(cdk.Stack):
hello_function = lambda_.Function(
self,
"MyLambdaFunction",
code=lambda_.Code.from_inline("""
}"""),
Unit 5 DATA ANALYSIS AND VISUALIZATION
runtime=lambda_.Runtime.NODEJS_12_X,
handler="index.handler",
timeout=cdk.Duration.seconds(25)
)
definition = tasks.LambdaInvoke(
self,
"MyLambdaTask",
lambda_function=hello_function
).next(sfn.Succeed(
self, "GreetedWorld"
))
state_machine = sfn.StateMachine(
self,
"MyStateMachineWithLambda",
definition=definition
)
app.py
#!/usr/bin/env python3
import aws_cdk as cdk
import HelloStepStack
app = cdk.App()
HelloStepStack(app, "HelloStepStack")
app.synth()
You can use AWS SDK to start an execution of a state machine.
run.py
import boto3
boto3.setup_default_session()
step_functions = boto3.client('stepfunctions')
response = step_functions.start_execution(
stateMachineArn=<stateMachineArn>,
name='first_execution',
)
Unit 5 DATA ANALYSIS AND VISUALIZATION
As we already know we can integrate and run other AWS services in any state of the
workflow. For example, if we want to run a huge migration that takes around 10 hours
to complete, we can use AWS Fargate to run this task.
Unit 5 DATA ANALYSIS AND VISUALIZATION
step_functions_poc_stack.py
import aws_cdk as cdk
from aws_cdk import aws_stepfunctions, aws_ecs, aws_iam,
aws_stepfunctions_tasks
class StepFunctionsPocStack(cdk.Stack):
def __init__(self, scope: cdk.App, construct_id: str, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)
cluster = aws_ecs.Cluster(self, "EcsCluster")
start_state = aws_stepfunctions.Pass(self, "MyStartState")
run_task = aws_stepfunctions_tasks.EcsRunTask(
self,
"RunFargate",
integration_pattern=aws_stepfunctions.IntegrationPattern.RUN_JOB,
cluster=cluster,
task_definition=<fargate_task_definition>,
assign_public_ip=True,
launch_target=aws_stepfunctions_tasks.EcsFargateLaunchTarget(
platform_version=aws_ecs.FargatePlatformVersion.LATEST
)
)
definition = start_state.next(run_task)
aws_stepfunctions.StateMachine(
self,
"SimpleStateMachine",
definition=definition
)
Unit 5 DATA ANALYSIS AND VISUALIZATION
As you can see in the execution history, after the TaskSubmitted event, the state machine
waited for the task to exit.
Our customers can download historic data of their prospects. These reports have huge
amounts of data and may take a lot of time to generate if not divided into sub-tasks.
Currently, we use Celery to generate these reports but there are a few problems with it :
Conclusion
Whenever you think you need to perform a few tasks sequentially, in parallel or you have
an ETL workflow and you want to pay only for the processing, step functions are a handy
choice because of their graphical representation and easy retry mechanism.
Assignments
Assignments
1. Research how AWS CodePipeline, CodeBuild, and CodeDeploy can automate the
CI/CD process.
2. Design a pipeline that integrates these services for deploying a data analysis and
visualization application.
3. Write a report explaining your pipeline design and the benefits of CI/CD in this
context.
Expected Output:
A detailed report describing the CI/CD pipeline design, its components, and how it
benefits data analysis automation.
Part A – Q & A
Unit - V
PART - A Questions
1. Define data analysis. (CO5, K1)
Answer: Data analysis refers to the process of examining, transforming, and modeling data
to discover useful information and support decision-making.
Answer: Data visualization is the graphical representation of data using visual elements like
charts, graphs, and maps to make complex data more accessible and understandable.
3. List two factors influencing the selection of data analysis tools. (CO5, K1)
Answer: Scalability and ease of integration with other systems are two factors influencing
the selection of data analysis tools.
Answer: Structured data is organized in a predefined format like tables, while unstructured
data lacks a specific format and includes text, images, and videos.
Answer: Amazon QuickSight is a cloud-based business intelligence service that helps create
interactive dashboards and visualizations to analyze data.
Answer: AWS Athena is a serverless query service that allows users to analyze data stored
in Amazon S3 using SQL without requiring a database.
8. List two real-time data ingestion services offered by AWS. (CO5, K1)
Answer: Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka
(MSK) are two real-time data ingestion services provided by AWS.
Answer: ETL (Extract, Transform, Load) is a process in which data is extracted from
sources, transformed into a format suitable for analysis, and loaded into a target data
storage system like a database.
PART - A Questions
10. What is AWS Glue used for? (CO5, K1)
Answer: AWS Glue is a fully managed ETL service that prepares data for analytics by
automating the data integration tasks like extraction, transformation, and loading.
Answer: Amazon Redshift is a cloud data warehouse service designed for fast querying and
analyzing large datasets.
12. Differentiate between Amazon Redshift and Amazon Athena. (CO5, K2)
Answer: Amazon Redshift is a data warehouse service used for large-scale structured data
analysis, while Amazon Athena is a serverless query service that queries data directly from
Amazon S3 using SQL.
13. What are the benefits of using AWS CodePipeline for gaming analytics?
(CO5, K2)
Answer: AWS CodePipeline automates the continuous integration and continuous delivery
(CI/CD) process, allowing faster deployment of game analytics updates, testing, and real-
time monitoring.
Answer: AWS Step Functions is a serverless orchestration service that helps coordinate
multiple AWS services into serverless workflows for automating tasks.
15. Explain how AWS Kinesis works in data analytics. (CO5, K2)
Answer: AWS Kinesis ingests, processes, and analyzes real-time streaming data, enabling
applications to react quickly to new information, such as gaming events or sensor data.
Answer: Amazon S3 is an object storage service that stores large volumes of data, including
raw and processed data, logs, and backups, for use in analytics applications.
17. List two AWS services for automating infrastructure deployment. (CO5, K1)
Answer: AWS CloudFormation and AWS CDK (Cloud Development Kit) are two services for
automating infrastructure deployment.
PART - A Questions
18. How does Amazon QuickSight support business intelligence? (CO5, K2)
19. What is the function of Amazon Kinesis Data Firehose? (CO5, K1)
Answer: Amazon Kinesis Data Firehose is used to capture, transform, and load streaming
data into destinations like Amazon S3, Redshift, and Elasticsearch for analytics.
20. Differentiate between AWS Lambda and Amazon EC2 for data processing.
(CO5, K2)
Answer: CI/CD stands for Continuous Integration and Continuous Deployment, a process
that automates the building, testing, and deployment of code.
22. How can AWS Step Functions automate workflows? (CO5, K2)
Answer: AWS Step Functions can automate workflows by coordinating various services and
tasks, providing retry mechanisms, and handling errors, making complex processes easier
to manage.
Answer: Amazon Redshift Spectrum allows you to query data directly from Amazon S3
without moving it into Redshift, enabling large-scale analytics across both Redshift and S3-
stored data.
25. List two benefits of using Amazon DynamoDB for gaming analytics. (CO5,
K1)
Answer: Amazon DynamoDB provides low-latency, scalable storage for player data and
leaderboards, and it offers automatic scaling based on demand.
Part B – Questions
Part-B Questions
Q. Questions CO K Level
No. Level
Objective:
Design and implement a data pipeline that collects, processes, analyzes, and visualizes gaming
analytics data using AWS services.
Tasks:
Objective:
Design a CI/CD pipeline for deploying a gaming analytics application on AWS.
Tasks:
2. Deploy an application that processes and analyzes gaming data stored in Amazon S3.
Objective:
Develop a framework to evaluate and select AWS tools for gaming analytics based on factors like
scalability, cost, latency, and ease of use.
Tasks:
1. Research and evaluate AWS tools such as Amazon Redshift, Amazon S3, Amazon EMR, AWS
Glue, and Amazon Athena.
2. Develop a scoring model to compare the tools for gaming analytics use cases.
Mini-Project 4: Automating a Data Workflow with AWS Step Functions (CO5, K4)
Objective:
Design and implement an automated data workflow using AWS Step Functions for a gaming
analytics pipeline.
Tasks:
1. Create a state machine using Step Functions to orchestrate tasks like data collection,
transformation, and visualization.
2. Integrate services like Amazon S3, AWS Glue, Amazon Athena, and Amazon QuickSight.
3. Ensure error handling and monitoring are built into the workflow.
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.