0% found this document useful (0 votes)

15 views68 pages

22CS911-DEC Unit 5

Uploaded by

keercs064

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views68 pages

22CS911-DEC Unit 5

Uploaded by

keercs064

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Please read this disclaimer before proceeding:

This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
22CS911
DATA ENGINEERING IN
CLOUD
Department:
Computer Science and Engineering
Batch/Year: 2022-2026/III
Created by:
Dr. RAJALAKSHMI D ASP/CSE,RMDEC
Mrs. JASMINE GILDA A ASP/CSE,RMKEC
Mr. JEMIN V M ASP/CSE,RMKCET
Date: 18-01-2025
Table of Contents
Sl. Topics Page
No. No.

1. Contents 5

2. Course Objectives 6

3. Pre Requisites (Course Name with Code) 8

4. Syllabus (With Subject Code, Name, LTPC details) 10

5. Course Outcomes (6) 12

6. CO-PO/PSO Mapping 14

Lecture Plan (S.No., Topic, No. of Periods, Proposed date, 16

7. Actual Lecture Date, pertaining CO, Taxonomy level, Mode of
Delivery)
8. Activity based learning 18

Lecture Notes ( with Links to Videos, e-book reference, PPTs, 20

9.
Quiz and any other learning materials )
Assignments (For higher level learning and Evaluation - 44
10.
Examples: Case study, Comprehensive design, etc.,)
11. Part A Q & A (with K level and CO) 48

12. Part B Qs (with K level and CO) 52

Supportive online Certification courses (NPTEL, Swayam, 54

13.
Coursera, Udemy, etc.,)
14. Real time Applications in day to day life and to Industry 56

Contents beyond the Syllabus ( COE related Value added 58

15.
courses)
16. Assessment Schedule ( Proposed Date & Actual Date) 60

17. Prescribed Text Books & Reference Books 62

18. Mini Project 64

Course Objectives
Course Objectives
 To grasp the fundamentals of data engineering,
emphasizing cloud-based data access.

 To construct robust and secure data pipelines using AWS

services.

 To select and implement appropriate data storage solutions

while prioritizing pipeline security.

 To utilize cloud tools for handling extensive data for

machine learning purposes.

 To efficiently analyze, visualize and automate data pipelines

to streamline operations.
Pre Requisites
Pre Requisites

SUBJECT CODE: 22CS201

SUBJECT NAME: DATA STRUCTURES

SUBJECT CODE: 22IT202

SUBJECT NAME DATABASE MANAGEMENT SYSTEMS
Syllabus
Syllabus

22CS911 DATA ENGINEERING IN CLOUD L T P C

3 0 0 3

Unit I : INTRODUCTION 8
Introduction to data Engineering - The Data Engineering Life Cycle - Data Engineering
and Data Science - Data-Driven Organizations: Data-driven decisions - The data
pipeline - The role of the data engineer in data-driven organizations - Modern data
strategies - The Elements of Data: The five Vs of data - volume, velocity, variety,
veracity, and value. Demo: Accessing and Analyzing Data by Using Amazon S3.
Unit II : SECURE AND SCALABLE DATA PIPELINES 10
The evolution of data architectures - Modern data architecture on AWS - Modern data
architecture pipeline: Ingestion and storage - Processing and consumption -
Streaming analytics pipeline - Security of analytics workloads - Scaling - Creating a
scalable infrastructure and components. ETL and ELT comparison - Data wrangling.
Unit III : STORING AND ORGANIZING DATA 9
Comparing batch and stream ingestion - Batch ingestion processing - Purpose-built
ingestion tools - AWS Glue for batch ingestion processing - Kinesis for stream
processing - Scaling considerations for batch processing and stream processing -
Storage in the modern data architecture - Data lake storage - Data warehouse storage
- Purpose-built databases - Storage in support of the pipeline - Securing storage.
Unit IV : PROCESSING BIG DATA AND DATA FOR ML 10
Big data processing concepts - Apache Hadoop - Apache Spark - Amazon EMR -
Managing your Amazon EMR clusters - Apache Hudi - The ML lifecycle - Collecting
data - Applying labels to training data with known targets - Preprocessing data -
Feature engineering - Developing a model - Deploying a model - ML infrastructure on
AWS - SageMaker - Amazon CodeWhisperer - AI/ML services on AWS.
Unit V : DATA ANALYSIS AND VISUALIZATION 8
Analyzing and Visualizing Data: Considering factors that influence tool selection -
Comparing AWS tools and services - Selecting tools for a gaming analytics use case.
Automating the Pipeline: Automating infrastructure deployment - CI/CD - Automating
with Step Functions
Course Outcomes
Course Outcomes
CO# COs K Level
CO1 Understand data engineering, pipelines & access data in the cloud. K2

CO2 Build secure & scalable data pipelines using AWS services K4

CO3 Choose the right data storage & secure your data pipelines. K3

CO4 Process big data for machine learning with cloud tools K3

CO5 Analyze & visualize data and automate data pipelines. K4

CO6 Apply best practices in data governance, compliance, and ethics

throughout the data engineering process, ensuring responsible K4
handling and usage of data

Knowledge Level Description

K6 Evaluation

K5 Synthesis

K4 Analysis

K3 Application

K2 Comprehension

K1 Knowledge
CO – PO/PSO Mapping
CO – PO /PSO Mapping Matrix

CO PO PO PO PO PO PO PO PO PO PO PO PO PSO PSO PS0

# 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3

CO1 3 3 2 1 1 - - - 2 - - 2 3 2 -

CO2 3 2 2 2 2 - - - 2 - - 2 3 2 -

CO3 3 3 2 2 2 - - - 2 - - 2 3 2 -

CO4 3 2 2 2 2 - - - 2 - - 2 3 3 -

CO5 3 2 2 2 2 - - - 2 - - 2 3 2 -

CO6 2 2 1 1 1 - - - 2 - - 2 3 2 -
Lecture Plan
Unit V
Lecture Plan – Unit 5
Sl. Topic Number Propos Actual CO Taxo Mode
No of ed Date Lecture nomy of
. Periods Date Level Deliver
y
Chalk
Analyzing and and
1 1 CO5 K2
Visualizing Data Board

Considering factors Chalk

2 that influence tool 1 CO5 K2 and
selection Board
Comparing AWS PPT and
3 1 CO5 K2
tools and services Video
Selecting tools for a
PPT and
4 gaming analytics 1 CO5 K3
Video
use case.
Automating the PPT and
5 1 CO5 K3
Pipeline Video
Automating
PPT and
6 infrastructure 1 CO5 K3
Video
deployment
Chalk
7 CI/CD 1 CO5 K2 and
Board
Automating with PPT and
8 1 CO5 K3
Step Functions. Video
Activity Based Learning
Unit V
Activity Based Learning
Data Pipeline Relay Challenge
Objective:
Students will build a cloud data pipeline step by step, using teamwork to complete
tasks at different stations.
Instructions for the Activity
1. Form Teams:
o Get into teams of 5-6 members.
2. Move Through Stations:
o There are 5 stations set up around the room, each representing a step in
the data pipeline:
 Storage: Organize data (like in Amazon S3).
 Data Transformation: Clean and prepare data (like using AWS Glue).
 Query and Analysis: Write queries to find insights (like with Amazon
Athena).
 Visualization: Create charts to show trends (like with Amazon
QuickSight).
 Automation: Design a flowchart to connect all the steps (like with Step
Functions).
3. Complete Tasks:
o At each station, complete the given task before moving to the next.
o Work together as a team to stay on track.
4. Scoring:
o Teams will be scored on:
 How well the tasks are completed.
 How creatively the solution is presented.
 How effectively the team works together.
5. Once all stations are done, your team will showcase your pipeline to the class.
Lecture Notes – Unit 5
Unit V : DATA ANALYSIS AND VISUALIZATION 8

Analyzing and Visualizing Data: Considering factors that influence tool selection -
Comparing AWS tools and services - Selecting tools for a gaming analytics use case.
Automating the Pipeline: Automating infrastructure deployment - CI/CD -
Automating with Step Functions
Unit 5 DATA ANALYSIS AND VISUALIZATION

5.1 Analyzing and Visualizing Data: Considering factors

that influence tool selection

Data visualization is a crucial aspect of data analysis. It lets you represent data in a
graphical format that is easy to understand and interpret to the human mind. With the
ever-increasing rise of big data, the demand for data visualization tools has doubled
significantly:

Why is Data Visualization Important?

Data visualization is essential as it helps to communicate and represent complex data

relationships in easy-to-understand formats. This can be in graphs, charts, plots, or even
animators. A noteworthy point is that data visualization effectively finds the trends and
patterns that may not be evident in the raw data.

Data visualization has endless potential. It’s not limited to simply analyzing data;
however, it is essential for marketers, organizations, or business leaders who need to
drive crucial insights into their data to take their business to the next level.

Factors to Consider While Choosing Data Visualization Tools:

1. Cost of the Data Visualization Tool

Cost is where everything comes down. It’s necessary to consider the cost parameter
while choosing the best data visualization tools. Different tools have different pricing;
thus, analyzing the cost-effectiveness is essential.

2. Data Source Compatibility

This is an essential factor to consider when choosing a data visualization tool. It should
be compatible with your data sources. Some devices may work best only with specific
data formats, while others may have limitations on the size of data they can handle.

Before choosing a data visualization tool, ensure it is compatible with your data sources
and size.
Unit 5 DATA ANALYSIS AND VISUALIZATION
3. Platform Licensing

The licensing of your data visualization tools can differ. Some agencies offer free
products with limitations on features and ask them to upgrade for enhanced features.
The paid tools are priced per subscription, sometimes based on the number of users and
others on monthly fees.

4. Ease of Use

When choosing a data visualization, opting for a user-friendly tool is necessary. This will
help you to get better insights without any hassle. Also, make sure to evaluate the skill
sets of your team. Look for tools that offer an easy-to-use interface and top-class
features, making it easy to represent the data.

5. Purpose and Audience of Your Visualization

The next thing to consider is the purpose and audience of your visualization. Different
goals and audiences need different types of visualizations. Say, for example, you want to
educate your audience about a specific topic, and you need a data visualization tool
that’s informative and delivers a clear message.

6. Features and Functionality

Evaluating the features and functionalities of data visualization tools is essential. These
features include data import, manipulation, analysis, sharing, and more.

5.2 Comparing AWS tools and services

AWS (Amazon Web Services) offers a wide array of tools and services across multiple
domains, including computing, storage, database, machine learning, analytics, and
security. Here's a comparison of some key AWS tools and services based on their
categories:
Unit 5 DATA ANALYSIS AND VISUALIZATION
1. Compute Services
• EC2 (Elastic Compute Cloud): Provides resizable compute capacity in the cloud,
offering virtual servers for running applications.
• Best for: Running custom applications, full control over compute resources.
• Alternatives: AWS Lambda (for serverless architecture), AWS Fargate (for
container management).
• AWS Lambda: A serverless compute service that runs code in response to events
without provisioning or managing servers.
• Best for: Event-driven workloads, reducing infrastructure management.
• Alternatives: EC2 (for more control and custom configurations), AWS ECS/Fargate
(for container-based workloads).
2. Storage Services
• S3 (Simple Storage Service): Object storage service that offers scalability,
availability, and security for any amount of data.
• Best for: Storing large datasets, backup, disaster recovery.
• Alternatives: Amazon EFS (Elastic File System) for file storage, Amazon Glacier
for archival storage.
• Amazon EBS (Elastic Block Store): Block storage service for use with EC2,
providing persistent storage for VMs.
• Best for: Applications requiring high-performance storage, such as databases.
• Alternatives: Amazon S3 (for object storage), Amazon EFS (for file-based
storage).
3. Database Services
• Amazon RDS (Relational Database Service): Fully managed relational database
service supporting engines like MySQL, PostgreSQL, and Oracle.
• Best for: Managed relational databases with less overhead.
• Alternatives: Amazon Aurora (for high-performance databases), Amazon
DynamoDB (for NoSQL).
• Amazon DynamoDB: A fully managed NoSQL database service for fast and flexible
data storage.
• Best for: Key-value and document-based applications, real-time processing.
• Alternatives: Amazon RDS (for structured, relational data), Amazon Neptune
(for graph databases).
Unit 5 DATA ANALYSIS AND VISUALIZATION
4. Machine Learning Services
• Amazon SageMaker: A fully managed service for building, training, and deploying
machine learning models at scale.
• Best for: End-to-end machine learning development, from experimentation to
deployment.
• Alternatives: AWS Lambda (for lightweight inference), EC2 with custom ML
environments.
• AWS Rekognition: A service for image and video analysis using machine learning.
• Best for: Applications needing image recognition, facial analysis, or object
detection.
• Alternatives: Amazon Textract (for text extraction), Amazon Polly (for text-to-
speech).
5. Analytics Services
• Amazon Redshift: A fully managed data warehouse service for fast querying of large
datasets.
• Best for: Large-scale data analysis, business intelligence, and complex queries.
• Alternatives: Amazon RDS (for transactional databases), Amazon EMR (for big
data processing).
• AWS Glue: A fully managed ETL (extract, transform, load) service for preparing data
for analytics.
• Best for: Data preparation for analytics, ETL pipelines.
• Alternatives: AWS Lambda (for simple ETL tasks), Amazon EMR (for big data
processing).
6. Security Services
• AWS Identity and Access Management (IAM): Service for securely managing
access to AWS services and resources.
• Best for: Fine-grained access control, authentication, and authorization.
• Alternatives: AWS Organizations (for multi-account access), AWS KMS (for key
management).
• AWS Shield: Managed Distributed Denial of Service (DDoS) protection service.
• Best for: Mitigating DDoS attacks on applications running on AWS.
• Alternatives: AWS WAF (Web Application Firewall for Layer 7 protection).
Unit 5 DATA ANALYSIS AND VISUALIZATION
7. Networking Services

• Amazon VPC (Virtual Private Cloud): Provides isolated cloud resources for
setting up custom network configurations.

• Best for: Building secure and customizable cloud-based networks.

• Alternatives: AWS Transit Gateway (for connecting multiple VPCs), AWS Direct
Connect (for dedicated network connection).

• Amazon Route 53: A scalable domain name system (DNS) web service.

• Best for: Domain registration, DNS routing, and health checking.

• Alternatives: Amazon CloudFront (for content delivery), AWS Elastic Load

Balancing (for distributing traffic).

8. Content Delivery Services

• Amazon CloudFront: A global content delivery network (CDN) for distributing

content with low latency.

• Best for: Delivering static and dynamic web content globally with low latency.

• Alternatives: AWS Global Accelerator (for global traffic routing), Route 53 (for
DNS-based routing).

9. Developer Tools

• AWS CodePipeline: A continuous integration and delivery service for fast and
reliable application updates.

• Best for: Automating release pipelines.

• Alternatives: AWS CodeDeploy (for automating code deployments), AWS

CodeBuild (for building and testing code).

• AWS CloudFormation: Provides a common language for modeling and

provisioning AWS resources in an automated and secure manner.

• Best for: Infrastructure as code (IaC), managing and deploying resources.

• Alternatives: AWS CDK (Cloud Development Kit for higher-level programming

constructs), Terraform (third-party IaC tool).
Unit 5 DATA ANALYSIS AND VISUALIZATION

5.3 Selecting tools for a gaming analytics use case

For a gaming analytics use case, you will need a set of AWS tools and services that can handle
real-time data ingestion, storage, processing, and analysis of large datasets. Additionally, you
might require machine learning tools for predictive analysis, player behavior insights, and
personalization. Here’s a breakdown of AWS services that would be suitable for gaming analytics:

1. Data Ingestion and Streaming

Amazon Kinesis Data Streams: A real-time streaming data service for capturing player events
such as gameplay metrics, in-game actions, and transactions.

• Why use it? It enables you to ingest large volumes of data in real-time, which is crucial for
gaming environments that generate high-frequency events.

• Alternatives: Amazon Managed Streaming for Apache Kafka (MSK) for event streaming,
especially if you are already using Kafka in your stack.

Amazon Kinesis Firehose: Delivers real-time streaming data to destinations such as Amazon
S3, Redshift, and Elasticsearch with automatic scaling.

• Why use it? It simplifies the process of capturing, transforming, and loading data into analytics
services without managing infrastructure.

2. Data Storage

Amazon S3 (Simple Storage Service): Ideal for storing raw data, logs, and processed
analytics data. S3 provides high availability and durability, and it integrates well with other AWS
analytics tools.

• Why use it? Its scalability, low cost, and integration with analytics tools make it ideal for
storing the large datasets typically generated by gaming events.

Amazon Redshift: A data warehouse service designed for high-performance queries on

massive datasets, ideal for analyzing gameplay data, player statistics, and in-game economy
trends.

• Why use it? It supports fast querying and scaling of data warehousing operations, which is
ideal for running complex analytics across large data sets.

• Alternatives: Amazon RDS (for smaller-scale relational database analytics).

Amazon DynamoDB: A fully managed NoSQL database service, suitable for low-latency data
storage related to player profiles, leaderboards, and in-game transactions.

• Why use it? DynamoDB’s high availability and low-latency performance are ideal for handling
player-related data that require quick lookups.
Unit 5 DATA ANALYSIS AND VISUALIZATION
3. Data Processing

AWS Glue: A fully managed ETL service to prepare, clean, and transform raw
gameplay data for analytics and reporting.

• Why use it? Glue simplifies data preparation workflows by automating the ETL
process, which is critical when dealing with large datasets.

• Alternatives: Amazon EMR (for big data processing using frameworks like Hadoop or
Spark).

Amazon Athena: An interactive query service that makes it easy to analyze data in
Amazon S3 using standard SQL.

• Why use it? Athena is serverless, so it’s great for running ad-hoc queries on your
gaming data stored in S3 without the need for a complex data warehouse.

Amazon EMR (Elastic MapReduce): A big data platform for running large-scale
distributed data processing tasks, such as processing game logs, in-game analytics,
and behavior modeling.

• Why use it? Ideal for complex, large-scale data processing tasks that require
Hadoop, Spark, or Presto.

4. Analytics and Insights

Amazon QuickSight: A business intelligence tool that helps you visualize and analyze
data to gain insights on game performance, player retention, and in-game behavior.

• Why use it? It allows you to create interactive dashboards and reports that can be
shared with game development teams and stakeholders.

• Alternatives: Tableau on AWS (for more advanced analytics), Power BI (if you are in
a Microsoft ecosystem).

Amazon Redshift Spectrum: Allows you to query data directly in Amazon S3 using
Redshift. It enables you to analyze data stored in your data lake along with data in
Redshift without moving it.

• Why use it? It provides powerful analytics without needing to load all your data into
Redshift, reducing cost and complexity.
Unit 5 DATA ANALYSIS AND VISUALIZATION
5. Machine Learning and AI

Amazon SageMaker: For building, training, and deploying machine learning models
to predict player behavior, churn rate, or recommend personalized in-game content.

• Why use it? SageMaker offers built-in algorithms and auto-scaling infrastructure,
enabling you to run large-scale models quickly. Use it to predict player lifetime value
or optimize in-game monetization strategies.

• Alternatives: AWS Lambda (for deploying pre-trained models for quick predictions on
real-time data streams).

Amazon Personalize: A fully managed service that provides real-time personalization

and recommendation for in-game offers, items, or upgrades.

• Why use it? It’s specifically designed for personalized experiences, so you can use it
to provide targeted recommendations to players based on their actions and
preferences.

Amazon Rekognition: A computer vision service that could be used for player
emotion detection through in-game avatars or analyzing game screenshots/videos.

• Why use it? Great for adding AI-driven features to games that need image or video
analysis for gameplay improvement.

6. Monitoring and Logging

Amazon CloudWatch: To monitor your game servers, track performance metrics, and
set up automated alarms for service health or player activity spikes.

• Why use it? CloudWatch is great for getting real-time operational insights and
ensuring your game servers are running optimally.

• Alternatives: AWS X-Ray (for debugging and tracing application performance).

AWS CloudTrail: To monitor and log all activities and API calls within your AWS
environment, ensuring you have full visibility into what’s happening within your gaming
infrastructure.

• Why use it? It helps ensure security and compliance by logging all interactions with
AWS services.
Unit 5 DATA ANALYSIS AND VISUALIZATION
7. Scaling and Load Balancing

Amazon GameLift: A dedicated service for deploying, operating, and scaling

multiplayer game servers in the cloud.

• Why use it? GameLift automates server scaling based on player demand, ensuring
optimal performance and reducing costs.

• Alternatives: EC2 Auto Scaling (for custom server management and scaling).

AWS Elastic Load Balancing (ELB): Distributes incoming traffic across multiple
targets, ensuring your gaming servers are highly available.

• Why use it? Load balancing is critical in preventing downtime or performance issues
during heavy player traffic spikes.

8. Security

AWS Shield: Protects your gaming servers from Distributed Denial of Service (DDoS)
attacks.

• Why use it? Games are often targeted by DDoS attacks, and Shield helps mitigate
those risks.

• Alternatives: AWS WAF (Web Application Firewall) for more granular protection of
web-based game portals.

AWS Cognito: Provides user authentication, authorization, and user management for
player accounts.

• Why use it? Cognito is useful for managing player sign-ups, log-ins, and access
controls within your game.
Unit 5 DATA ANALYSIS AND VISUALIZATION
5.4 Automating the Pipeline: Automating infrastructure
deployment
Streamlining deployment pipelines and automation is crucial for efficient software
development processes. It involves automating the steps required to build, test, and
deploy software applications, reducing manual effort, and minimizing the risk of errors.
In this essay, we will explore the benefits of streamlining deployment pipelines and
automation, along with best practices to optimize the software delivery process.

The Importance of Streamlining Deployment Pipelines and Automation

Streamlining deployment pipelines and automation offers several advantages for

software development teams, including:

Efficiency: Automating repetitive tasks, such as building and deploying software,

significantly reduces the time and effort required for manual execution. It eliminates
the need for manual intervention, enabling developers to focus on more critical aspects
of the development process.

Consistency: Automated deployment pipelines ensure consistent and reliable software

delivery. By removing human error from the equation, teams can achieve a higher level
of consistency in their deployments, leading to more stable and predictable software
releases.

Speed and Agility: With streamlined deployment pipelines, software changes can be
deployed rapidly and frequently. This allows teams to adopt agile development
practices, respond quickly to customer needs, and deliver new features and updates in
a timely manner.
Unit 5 DATA ANALYSIS AND VISUALIZATION
Best Practices for Streamlining Deployment Pipelines and Automation

To optimize deployment pipelines and automation, consider the following best

practices:

1. Infrastructure as Code

Adopt infrastructure as code (IaC) practices to define and manage your deployment
infrastructure. Tools like Terraform or AWS CloudFormation allow you to codify your
infrastructure, enabling consistent and reproducible deployments across different
environments.

2. Continuous Integration and Continuous Deployment

Implement continuous integration (CI) and continuous deployment (CD) processes to

automate the build, test, and deployment stages of your software. CI/CD pipelines
enable developers to automatically build, test, and deploy changes, ensuring that code
is thoroughly tested and ready for production.

3. Automated Testing

Integrate automated testing into your deployment pipeline to validate the functionality
and quality of your software. Automated tests, such as unit tests, integration tests, and
end-to-end tests, help identify issues early in the development process, preventing
regressions and improving overall software quality.

4. Monitoring and Alerting

Implement robust monitoring and alerting systems to track the performance and health
of your deployed applications. Real-time monitoring helps identify and address issues
promptly, minimizing downtime and ensuring a positive user experience.
Unit 5 DATA ANALYSIS AND VISUALIZATION

5.5 Building a CI/CD Pipeline for Infrastructure

Automation
As organizations adopt cloud-native architectures and DevOps practices, the need for
efficient infrastructure automation becomes increasingly important. A key aspect of this
is building a Continuous Integration/Continuous Deployment (CI/CD) pipeline that
automates the provisioning, deployment, and management of infrastructure resources.
In this blog, we will explore the technical details of building a CI/CD pipeline for
infrastructure automation, focusing on the tools and technologies used in the process.

Infrastructure as Code (IaC)

The first step in building a CI/CD pipeline for infrastructure automation is to define
infrastructure as code (IaC). This involves using configuration files or templates to
define the desired state of infrastructure resources, such as virtual machines, networks,
and storage. Popular IaC tools include Terraform, AWS CloudFormation, and Azure
Resource Manager (ARM).

For example, using Terraform, we can define a simple infrastructure configuration file
(main.tf) as follows:

provider "aws" {

region = "us-west-2"

resource "aws_instance" "example" {

ami = "ami-abc123"

instance_type = "t2.micro"

This code defines an AWS provider and a single EC2 instance resource with a specific
AMI and instance type.
Unit 5 DATA ANALYSIS AND VISUALIZATION
CI/CD Pipeline Tools

Once we have defined our IaC, we need to choose a CI/CD pipeline tool to automate
the deployment and management of our infrastructure. Popular options include
Jenkins, GitLab CI/CD, and CircleCI. For this example, we will use GitLab CI/CD.

GitLab CI/CD Pipeline Configuration

To configure a GitLab CI/CD pipeline, we create a .gitlab-ci.yml file in the root of our
repository. This file defines the stages, jobs, and scripts that make up our pipeline.

Here is an example .gitlab-ci.yml file:

stages:
- deploy
deploy:
stage: deploy
script:
- terraform init
- terraform apply -auto-approve
only:
- main

This pipeline has a single stage (deploy) and a single job (deploy). The job runs two
scripts: terraform init to initialize the Terraform working directory, and terraform apply -
auto-approve to apply the infrastructure configuration.

Integration with Version Control

To integrate our CI/CD pipeline with version control, we need to configure our pipeline
to trigger on changes to our IaC configuration files. In GitLab CI/CD, we can do this by
specifying a trigger section in our .gitlab-ci.yml file:
trigger:
- main

This configuration tells GitLab CI/CD to trigger the pipeline on changes to the main
branch of our repository.
Unit 5 DATA ANALYSIS AND VISUALIZATION

Testing and Validation

To ensure that our infrastructure is deployed correctly, we need to include testing and
validation steps in our pipeline. This can involve running scripts to verify the state of
our infrastructure resources, such as checking the status of EC2 instances or validating
network configurations.

For example, we can add a test stage to our pipeline to run a script that verifies the
status of our EC2 instance:
stages:
- deploy
- test
deploy:
stage: deploy
script:
- terraform init
- terraform apply -auto-approve
only:
- main
test:
stage: test
script:
- aws ec2 describe-instances --instance-ids $(terraform output -json | jq -
r '.instance_id')
only:
- main

This test stage runs an AWS CLI command to describe the EC2 instance, using the
instance ID output from Terraform.

Conclusion

Building a CI/CD pipeline for infrastructure automation is a critical step in adopting

DevOps practices and achieving efficient platform engineering. By using tools like
Terraform and GitLab CI/CD, we can automate the provisioning, deployment, and
management of infrastructure resources, ensuring consistency, reliability, and efficiency
across our platform.
Unit 5 DATA ANALYSIS AND VISUALIZATION
5.6 Automating workflows using AWS Step Functions
A Workflow is a sequence of tasks that processes a set of data. You can think of
workflow as the path that describes how tasks go from being undone to done.
Workflows manage failures, retries, parallelization, service integrations, and
observability so developers can focus on higher-value business logic.

You can use Apache Airflow or AWS step functions to build workflows. One way of
implementing workflows could be AWS step functions.

What are AWS Step Functions?

AWS Step Functions are a low-code, visual workflow service used to orchestrate AWS
services, automate business processes, and build server-less applications.
Unit 5 DATA ANALYSIS AND VISUALIZATION

Key features

• Built-in retry mechanism

• Multiple AWS services can be used in a state machine

• Supports long-running tasks (for up to one year)

• Automated scaling

• Pay per use

State machines

The workflows you build with Step Functions are called state machines, and each step
of your workflow is called a state.

How to define state machines?

State machines are defined using a JSON-based Amazon States Language.

You can define state machines in many ways-

1. Step Functions’ graphical console

2. AWS SAM

3. AWS CDK
Unit 5 DATA ANALYSIS AND VISUALIZATION

A state machine can consist of multiple states like pass, wait, task, etc.
Use cases

Sequential execution of tasks

You create a workflow that runs a group of Lambda functions (steps) in a specific order.
One Lambda function’s output passes to the next Lambda function’s input. The last
step in your workflow gives a result. With Step Functions, you can see how each step
in your workflow interacts with the other, so you can make sure that each step
performs its intended function.

Function orchestration
Branching

Using a Choice state, you can have Step Functions make decisions based on the Choice
state’s input.
Unit 5 DATA ANALYSIS AND VISUALIZATION
Human in the loop

When you apply for a credit card, your application might be reviewed by a human.
Because step functions can run up to one year, a state machine can wait for human
approval and proceed to the next state only after approval or rejection.

Parallel Processing
Once you are sure of the number of branches, then you should use parallel processing.
You cannot modify the number of branches in runtime.

E.g. a customer converts a video file into five different display resolutions, so viewers
can watch the video on multiple devices. Using a Parallel state, Step Functions inputs
the video file, so the Lambda function can process it into the five display resolutions at
the same time.

Dynamic parallelism

This is similar to Parallel Processing except branches are created dynamically based on
the input, say for example a customer orders three items, and you need to prepare
each item for delivery. You then check each item’s availability, gather each item, and
package each item for delivery. Using a Map state, Step Functions has Lambda process
each of your customer’s items in parallel. Once all of your customer’s items are
packaged for delivery, Step Functions goes to the next step in your workflow, which is
to send your customer a confirmation email with tracking information.
Unit 5 DATA ANALYSIS AND VISUALIZATION

Creating a step machine

Prerequisite

You should have adequate knowledge of AWS IAM, AWS CDK, and AWS Lambda.

We chose AWS CDK to create a step machine. The reason being we use CDK
extensively to use other AWS services.

Below is the example of a step machine that invokes a Lambda and passes “Hello
World!” as an input to Succeed state. If you notice carefully, Lambda code is written
in java-script whereas step machine is defined in Python. That means you can use
your choice of language in Lambda to write your business logic.

hello_step_stack.py

import aws_cdk as cdk

from aws_cdk import aws_lambda as lambda_

from aws_cdk import aws_stepfunctions as sfn

from aws_cdk import aws_stepfunctions_tasks as tasks

class HelloStepStack(cdk.Stack):

def init(self, scope: cdk.App, construct_id: str, **kwargs) -> None:

super().init(scope, construct_id, **kwargs)

hello_function = lambda_.Function(

self,

"MyLambdaFunction",

code=lambda_.Code.from_inline("""

exports.handler = (event, context, callback) => {

callback(null, "Hello World!");

}"""),
Unit 5 DATA ANALYSIS AND VISUALIZATION
runtime=lambda_.Runtime.NODEJS_12_X,
handler="index.handler",
timeout=cdk.Duration.seconds(25)
)
definition = tasks.LambdaInvoke(
self,
"MyLambdaTask",
lambda_function=hello_function
).next(sfn.Succeed(
self, "GreetedWorld"
))
state_machine = sfn.StateMachine(
self,
"MyStateMachineWithLambda",
definition=definition
)

app.py
#!/usr/bin/env python3
import aws_cdk as cdk
import HelloStepStack
app = cdk.App()
HelloStepStack(app, "HelloStepStack")
app.synth()
You can use AWS SDK to start an execution of a state machine.

run.py
import boto3
boto3.setup_default_session()
step_functions = boto3.client('stepfunctions')
response = step_functions.start_execution(
stateMachineArn=<stateMachineArn>,
name='first_execution',
)
Unit 5 DATA ANALYSIS AND VISUALIZATION

How it works at Unibuddy

1. We did a proof-of-concept to check if we can use step functions for long-running

tasks

As we already know we can integrate and run other AWS services in any state of the
workflow. For example, if we want to run a huge migration that takes around 10 hours
to complete, we can use AWS Fargate to run this task.
Unit 5 DATA ANALYSIS AND VISUALIZATION

step_functions_poc_stack.py
import aws_cdk as cdk
from aws_cdk import aws_stepfunctions, aws_ecs, aws_iam,
aws_stepfunctions_tasks
class StepFunctionsPocStack(cdk.Stack):
def __init__(self, scope: cdk.App, construct_id: str, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)
cluster = aws_ecs.Cluster(self, "EcsCluster")
start_state = aws_stepfunctions.Pass(self, "MyStartState")
run_task = aws_stepfunctions_tasks.EcsRunTask(
self,
"RunFargate",
integration_pattern=aws_stepfunctions.IntegrationPattern.RUN_JOB,
cluster=cluster,
task_definition=<fargate_task_definition>,
assign_public_ip=True,
launch_target=aws_stepfunctions_tasks.EcsFargateLaunchTarget(
platform_version=aws_ecs.FargatePlatformVersion.LATEST
)
)
definition = start_state.next(run_task)
aws_stepfunctions.StateMachine(
self,
"SimpleStateMachine",
definition=definition
)
Unit 5 DATA ANALYSIS AND VISUALIZATION

As you can see in the execution history, after the TaskSubmitted event, the state machine
waited for the task to exit.

2. Breaking a large task into a smaller task using Map state

Our customers can download historic data of their prospects. These reports have huge
amounts of data and may take a lot of time to generate if not divided into sub-tasks.

Currently, we use Celery to generate these reports but there are a few problems with it :

1. We are unable to track the state of individual tasks.

2. Long-running tasks have the potential to get killed.

3. No retry mechanism for individual tasks.

By using Step Functions Map state, based on date

range input we can dynamically break a large task into
smaller tasks. Each instance of GenerateSubReport will
create a report for the smaller date range and pass the
output to MergeSubReports. Once all the subtasks have
performed their work, we can merge the output in the
MergeSubReports Lambda function, and then in the
next step, we can notify the user.

Conclusion

Whenever you think you need to perform a few tasks sequentially, in parallel or you have
an ETL workflow and you want to pay only for the processing, step functions are a handy
choice because of their graphical representation and easy retry mechanism.
Assignments
Assignments

Assignment 1 (CO5, K1 - Remembering)

Question: Exploring AWS Data Analysis Tools
Task:
List five AWS services commonly used for analyzing and visualizing data.
Instructions:
1.Use the AWS documentation or website to research data analysis and visualization
services.
2.Write a brief description of each service, including its primary use case.
Expected Output:
A list of five AWS services with short descriptions (e.g., Amazon QuickSight, AWS Glue,
Amazon Athena, etc.).

Assignment 2 (CO5, K2 - Understanding)

Question: Comparing AWS Tools for Gaming Analytics
Task:
Explain how AWS services like Amazon Athena, AWS Glue, and Amazon QuickSight can
be used for a gaming analytics use case.
Instructions:
1.Research the role of each tool in analyzing data.
2.Write a short explanation of how these services could work together to analyze player
activity and generate insights.
Expected Output:
A description of each service’s role and how they integrate in a gaming analytics use
case.
Assignments
Assignment 3 (CO5, K3 - Applying)
Question: Setting Up an Automated Pipeline with Step Functions
Task:
Create a basic workflow design for automating a data pipeline using AWS Step Functions.
Instructions:
1.Identify the steps in a simple data pipeline (e.g., data collection, transformation, storage,
and visualization).
2.Create a flowchart showing how AWS Step Functions can automate the process using
services like Amazon S3, AWS Glue, and Amazon QuickSight.
Expected Output:
A flowchart showing the steps and integration of AWS services in the automated pipeline.

Assignment 4 (CO5, K4 - Analyzing)

Question: Tool Selection for a Gaming Analytics Pipeline
Task:
Analyze the factors that influence the choice of AWS tools for building a gaming analytics
pipeline.
Instructions:
1.Consider factors such as scalability, cost, latency, and ease of use.
2.Research and compare at least three AWS services that could be used in the pipeline.
3.Write a detailed report analyzing the pros and cons of each service for gaming analytics.
Expected Output:
A report with a comparison of services, considering various factors and a recommendation for
the best toolset.
Assignments
Assignment 5 (CO5, K5 - Evaluating)

Question: Designing a CI/CD Pipeline for Data Analysis

Task:
Evaluate and design a CI/CD pipeline for automating infrastructure deployment for a data
analysis application.
Instructions:

1. Research how AWS CodePipeline, CodeBuild, and CodeDeploy can automate the
CI/CD process.

2. Design a pipeline that integrates these services for deploying a data analysis and
visualization application.

3. Write a report explaining your pipeline design and the benefits of CI/CD in this
context.
Expected Output:
A detailed report describing the CI/CD pipeline design, its components, and how it
benefits data analysis automation.
Part A – Q & A
Unit - V
PART - A Questions
1. Define data analysis. (CO5, K1)

Answer: Data analysis refers to the process of examining, transforming, and modeling data
to discover useful information and support decision-making.

2. What is data visualization? (CO5, K1)

Answer: Data visualization is the graphical representation of data using visual elements like
charts, graphs, and maps to make complex data more accessible and understandable.

3. List two factors influencing the selection of data analysis tools. (CO5, K1)

Answer: Scalability and ease of integration with other systems are two factors influencing
the selection of data analysis tools.

4. Differentiate between structured and unstructured data. (CO5, K2)

Answer: Structured data is organized in a predefined format like tables, while unstructured
data lacks a specific format and includes text, images, and videos.

5. Explain real-time data processing. (CO5, K2)

Answer: Real-time data processing involves continuously analyzing data as it is generated,

allowing for immediate insights and decisions based on live data streams.

6. What is Amazon QuickSight? (CO5, K1)

Answer: Amazon QuickSight is a cloud-based business intelligence service that helps create
interactive dashboards and visualizations to analyze data.

7. Define AWS Athena. (CO5, K1)

Answer: AWS Athena is a serverless query service that allows users to analyze data stored
in Amazon S3 using SQL without requiring a database.

8. List two real-time data ingestion services offered by AWS. (CO5, K1)

Answer: Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka
(MSK) are two real-time data ingestion services provided by AWS.

9. Explain the ETL process. (CO5, K2)

Answer: ETL (Extract, Transform, Load) is a process in which data is extracted from
sources, transformed into a format suitable for analysis, and loaded into a target data
storage system like a database.
PART - A Questions
10. What is AWS Glue used for? (CO5, K1)

Answer: AWS Glue is a fully managed ETL service that prepares data for analytics by
automating the data integration tasks like extraction, transformation, and loading.

11. What is the purpose of Amazon Redshift? (CO5, K1)

Answer: Amazon Redshift is a cloud data warehouse service designed for fast querying and
analyzing large datasets.

12. Differentiate between Amazon Redshift and Amazon Athena. (CO5, K2)

Answer: Amazon Redshift is a data warehouse service used for large-scale structured data
analysis, while Amazon Athena is a serverless query service that queries data directly from
Amazon S3 using SQL.

13. What are the benefits of using AWS CodePipeline for gaming analytics?
(CO5, K2)

Answer: AWS CodePipeline automates the continuous integration and continuous delivery
(CI/CD) process, allowing faster deployment of game analytics updates, testing, and real-
time monitoring.

14. Define AWS Step Functions. (CO5, K1)

Answer: AWS Step Functions is a serverless orchestration service that helps coordinate
multiple AWS services into serverless workflows for automating tasks.

15. Explain how AWS Kinesis works in data analytics. (CO5, K2)

Answer: AWS Kinesis ingests, processes, and analyzes real-time streaming data, enabling
applications to react quickly to new information, such as gaming events or sensor data.

16. What is the role of Amazon S3 in data analytics? (CO5, K1)

Answer: Amazon S3 is an object storage service that stores large volumes of data, including
raw and processed data, logs, and backups, for use in analytics applications.

17. List two AWS services for automating infrastructure deployment. (CO5, K1)

Answer: AWS CloudFormation and AWS CDK (Cloud Development Kit) are two services for
automating infrastructure deployment.
PART - A Questions
18. How does Amazon QuickSight support business intelligence? (CO5, K2)

Answer: Amazon QuickSight allows businesses to create interactive dashboards, perform

ad-hoc analysis, and visualize data trends, enabling data-driven decision-making.

19. What is the function of Amazon Kinesis Data Firehose? (CO5, K1)

Answer: Amazon Kinesis Data Firehose is used to capture, transform, and load streaming
data into destinations like Amazon S3, Redshift, and Elasticsearch for analytics.

20. Differentiate between AWS Lambda and Amazon EC2 for data processing.
(CO5, K2)

Answer: AWS Lambda provides serverless computing, automatically scaling to handle

event-driven workloads, while Amazon EC2 offers virtual servers with full control over
compute resources, ideal for custom applications.

21. Define CI/CD. (CO5, K1)

Answer: CI/CD stands for Continuous Integration and Continuous Deployment, a process
that automates the building, testing, and deployment of code.

22. How can AWS Step Functions automate workflows? (CO5, K2)

Answer: AWS Step Functions can automate workflows by coordinating various services and
tasks, providing retry mechanisms, and handling errors, making complex processes easier
to manage.

23. Explain the use of Amazon Redshift Spectrum. (CO5, K2)

Answer: Amazon Redshift Spectrum allows you to query data directly from Amazon S3
without moving it into Redshift, enabling large-scale analytics across both Redshift and S3-
stored data.

24. What is the purpose of AWS CloudFormation? (CO5, K1)

Answer: AWS CloudFormation automates the provisioning and management of AWS

resources using templates, enabling infrastructure as code (IaC).

25. List two benefits of using Amazon DynamoDB for gaming analytics. (CO5,
K1)

Answer: Amazon DynamoDB provides low-latency, scalable storage for player data and
leaderboards, and it offers automatic scaling based on demand.
Part B – Questions
Part-B Questions
Q. Questions CO K Level
No. Level

Explain the process of selecting appropriate AWS tools

for a gaming analytics use case. Discuss the factors
1 influencing tool selection, and compare AWS Kinesis, CO5 K2
Redshift, and QuickSight for their roles in the analytics
pipeline.

Describe the steps involved in automating the

infrastructure deployment pipeline using AWS services
like CloudFormation and CodePipeline. Provide a
2 CO5 K4
detailed explanation of how CI/CD can be integrated
into this process and its benefits for gaming
applications.

Analyze the advantages and challenges of using AWS

Step Functions to automate workflows in a data
analytics pipeline. Provide examples of how Step
3 CO5 K4
Functions can be used to coordinate services like
Lambda, S3, and SageMaker in a gaming analytics
scenario.

Evaluate the role of machine learning in AWS

SageMaker for a gaming analytics use case. Discuss
4 how SageMaker's model training, deployment, and CO5 K3
monitoring capabilities can be leveraged to predict
player behavior and improve gameplay experiences.

Compare and contrast Amazon Athena and Redshift in

terms of their data processing and querying
5 capabilities. How would you decide which service is CO5 K3
more suitable for analyzing large datasets in a gaming
analytics use case?
Supportive online
Certification courses
(NPTEL, Swayam,
Coursera, Udemy, etc.,)
Supportive Online Certification Courses
Sl. Courses Platform
No.
1 Introduction to Data Engineering Using Azure NPTEL
2 Data Engineering using AWS Data Analytics Udemy
3 Cloud Data Engineering Coursera
4 Professional Certificate in Data Engineering edX
Real time Applications in
day to day life and to
Industry
Real Time Applications
1. Real-time Gaming Analytics Platform
Use Case:
Objective: Monitor player behavior, game performance, and in-game events in
real-time.
Tools & Technologies:
• AWS Kinesis: Collecting and streaming real-time data from in-game
events like player actions, in-app purchases, and player interactions.
• AWS Lambda: Processing and transforming the real-time game data as
it flows in, enabling immediate insights.
• AWS QuickSight or Amazon Redshift: Visualizing the real-time analytics
on player retention, levels, scores, or revenue streams.
• AWS Step Functions: Orchestrating workflows to trigger actions like
sending notifications when certain in-game thresholds are met (e.g., a
player reaches a milestone).
• AWS CloudWatch: Monitoring and logging game server performance or
detecting unusual player behavior (e.g., cheating or server downtime).
Real-time Element: Data from game sessions (events, player scores, etc.) is
processed in real-time, providing live dashboards for monitoring and quick
decision-making.

2. Real-time Social Media Sentiment Analysis

Use Case:
Objective: Analyze public sentiment from social media platforms about a brand or
product in real time, allowing for agile responses.
Tools & Technologies:
• AWS Kinesis Data Streams: To capture live data from social media feeds
(e.g., Twitter, Reddit) in real time.
• AWS Lambda: To process the incoming social media data, perform
sentiment analysis using NLP models (Natural Language Processing),
and generate sentiment scores (positive, negative, neutral).
• AWS Sagemaker: For machine learning model deployment and refining
sentiment analysis models.
• AWS QuickSight or Tableau (via AWS integration): Visualizing sentiment
trends in real-time, such as daily volume, user engagement, and
sentiment polarity.
• Step Functions: Automating workflows to trigger actions like generating
reports or alerting marketing teams when negative sentiment spikes.
Real-time Element: Social media data is ingested and analyzed continuously,
enabling live updates on sentiment changes and immediate actions if necessary.
Content Beyond Syllabus
Unit 5 Content Beyond Syllabus
Advanced Data Streaming Architectures
Beyond the Syllabus:
• Event-Driven Architectures: Real-time applications often rely on event-driven
architectures. Tools like Apache Kafka and AWS EventBridge can be integrated
with AWS services for more complex event-driven pipelines, enabling fine-grained
control over data streams and event processing.
• Complex Event Processing (CEP): Explore frameworks for processing complex
event patterns in real time (e.g., Apache Flink or AWS Kinesis Data Analytics for
SQL), enabling the detection of events that match predefined patterns (such as fraud
detection or anomaly detection in IoT).
Practical Application:
• A multi-source real-time data ingestion system combining Kinesis for streaming data
and Kafka for high-throughput message distribution.
• Real-time decision-making using AWS Lambda with custom CEP rules that trigger
automated actions based on event patterns.

Machine Learning Integration with Real-Time Analytics

Beyond the Syllabus:
• Real-Time Predictions with ML Models: Advanced use cases could involve running
machine learning models in real-time to process streaming data (e.g., sentiment
analysis, fraud detection, predictive maintenance). AWS SageMaker provides
capabilities for deploying models that can be invoked by AWS Lambda or Step
Functions.
• Model Retraining Pipelines: In dynamic environments (like gaming or fraud
detection), models need to be retrained with fresh data. Automating this pipeline using
AWS SageMaker Pipelines can help refresh models in real-time based on incoming
data, ensuring that predictions stay relevant.
Practical Application:
• Building a fraud detection system that continuously retrains its model with new
transaction data, improving the system's ability to detect new patterns of fraudulent
behavior.
• Creating a recommendation engine in a gaming app that personalizes user experiences
in real time based on their in-game behavior.
Assessment Schedule
(Proposed Date & Actual
Date)
Assessment Schedule
Assessment Proposed Date Actual Date Course Outcome Program
Tool Outcome
(Filled Gap)
Assessment I 28-01-2025 CO1, CO2

Assessment II 10-03-2025 CO3, CO4

Model 03-04-2025 CO1, CO2, CO3,
CO4, CO5,CO6
Prescribed Text Books &
Reference
Prescribed Text & Reference Books

Sl. Book Name & Author Book

No.
1 Martin Kleppman, “Data Engineering: Building Reliable Scalable Text Book
Data Systems”, O'Reilly Media, 2017
2 Wes McKinney, “Python for Data Analysis”, 2nd Edition, O'Reilly Text Book
Media, 2017.
3 Martin Kleppman, “Designing Data-Intensive Applications”, Reference
O'Reilly Media, 2017. Book
4 AWS Documentation (amazon.com) Reference
Link
5 AWS Skill Builder Reference
Link
6 AWS Academy Data Engineering Course - Reference
https://www.awsacademy.com/vforcesite/LMS_Login Course
Mini Project Suggestions
Mini Projects

Mini-Project 1: Building a Data Pipeline for Gaming Analytics (CO5, K5)

Objective:
Design and implement a data pipeline that collects, processes, analyzes, and visualizes gaming
analytics data using AWS services.

Tasks:

1. Collect synthetic player data using Amazon Kinesis or an equivalent tool.

2. Store the collected data in Amazon S3.

3. Use AWS Glue to transform and clean the data.

4. Query the cleaned data using Amazon Athena.

5. Create visualizations in Amazon QuickSight to display player activity trends.

Mini-Project 2: CI/CD Pipeline for Gaming Analytics Deployment (CO5, K4)

Objective:
Design a CI/CD pipeline for deploying a gaming analytics application on AWS.

Tasks:

1. Use AWS CodePipeline, CodeBuild, and CodeDeploy to create the pipeline.

2. Deploy an application that processes and analyzes gaming data stored in Amazon S3.

3. Automate the entire deployment pipeline using AWS Step Functions.

4. Integrate error handling and logging mechanisms to ensure smooth operations.

Mini Projects

Mini-Project 3: Tool Selection Framework for Gaming Analytics (CO5, K4)

Objective:
Develop a framework to evaluate and select AWS tools for gaming analytics based on factors like
scalability, cost, latency, and ease of use.

Tasks:

1. Research and evaluate AWS tools such as Amazon Redshift, Amazon S3, Amazon EMR, AWS
Glue, and Amazon Athena.

2. Develop a scoring model to compare the tools for gaming analytics use cases.

3. Prepare a report and presentation summarizing the findings and recommendations.

Mini-Project 4: Automating a Data Workflow with AWS Step Functions (CO5, K4)

Objective:
Design and implement an automated data workflow using AWS Step Functions for a gaming
analytics pipeline.

Tasks:

1. Create a state machine using Step Functions to orchestrate tasks like data collection,
transformation, and visualization.

2. Integrate services like Amazon S3, AWS Glue, Amazon Athena, and Amazon QuickSight.

3. Ensure error handling and monitoring are built into the workflow.

Demonstrate the workflow using sample data.

Mini Projects

Mini-Project 5: Real-Time Gaming Analytics Dashboard (CO5, K5)

Objective:
Create a real-time dashboard for gaming analytics using AWS services.
Tasks:
1. Use Amazon Kinesis to collect real-time gaming data.
2. Process the data using AWS Lambda or AWS Glue.
3. Store the processed data in Amazon S3 or Amazon Redshift.
4. Build a real-time dashboard in Amazon QuickSight to visualize metrics like active players,
revenue, or game performance.
Thank you

Disclaimer:

This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.

Robotics Quiz Reviewer
No ratings yet
Robotics Quiz Reviewer
9 pages
7HBW23 Programming QuickGuide
100% (1)
7HBW23 Programming QuickGuide
8 pages
Step by Step Guide For Data Engineering
No ratings yet
Step by Step Guide For Data Engineering
7 pages
CSS 2022 GSA MCQs General Science and Ability Paper Solved Quiz
No ratings yet
CSS 2022 GSA MCQs General Science and Ability Paper Solved Quiz
6 pages
TE Computer 2019 Course 22.06.2021-52-99
No ratings yet
TE Computer 2019 Course 22.06.2021-52-99
48 pages
Panasonic VRF AHU Catalog 190626 Lo-Res
No ratings yet
Panasonic VRF AHU Catalog 190626 Lo-Res
11 pages
Bda 1
No ratings yet
Bda 1
95 pages
6th Sem Syllabus
No ratings yet
6th Sem Syllabus
13 pages
INT323 Lec 0 1
No ratings yet
INT323 Lec 0 1
32 pages
Maret 12
No ratings yet
Maret 12
8 pages
MySQL Notes
No ratings yet
MySQL Notes
20 pages
Gitlab Ci/Cd: An Overview
No ratings yet
Gitlab Ci/Cd: An Overview
32 pages
Product Summary: Conforms To DIN 41494 OR Equivalent ISO Standards
No ratings yet
Product Summary: Conforms To DIN 41494 OR Equivalent ISO Standards
3 pages
CCS334 Bda
No ratings yet
CCS334 Bda
19 pages
E-Commerce Lab - Code 108 & 311 - BBA (G.) & BBA (B & I) - Sem. II
No ratings yet
E-Commerce Lab - Code 108 & 311 - BBA (G.) & BBA (B & I) - Sem. II
12 pages
Unit 1 - BD - Introduction To Big Data
100% (1)
Unit 1 - BD - Introduction To Big Data
90 pages
AWS Academy Data Engineering v1 Coures Outline (EN-US) 2022-11-01
No ratings yet
AWS Academy Data Engineering v1 Coures Outline (EN-US) 2022-11-01
6 pages
L1 - Introduction and Data EcoSystem
No ratings yet
L1 - Introduction and Data EcoSystem
42 pages
Intro To Arduino
No ratings yet
Intro To Arduino
38 pages
Efficient Scientific Programming with Spyder: Definitive Reference for Developers and Engineers
From Everand
Efficient Scientific Programming with Spyder: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Complete Data Engineering Roadmap With Resources
No ratings yet
Complete Data Engineering Roadmap With Resources
16 pages
Big Daa R18 Manual
No ratings yet
Big Daa R18 Manual
84 pages
7th Sem Syllabus
No ratings yet
7th Sem Syllabus
9 pages
5-Day KVCET Bootcamp - Data Analytics
No ratings yet
5-Day KVCET Bootcamp - Data Analytics
6 pages
Puneeth Report
No ratings yet
Puneeth Report
37 pages
UNIT 1 Merged
No ratings yet
UNIT 1 Merged
11 pages
Syllabus For Data Engineering
No ratings yet
Syllabus For Data Engineering
3 pages
21ai402 Data Analytics Unit-3
No ratings yet
21ai402 Data Analytics Unit-3
150 pages
CCS334 Updated 05-05-2025
No ratings yet
CCS334 Updated 05-05-2025
19 pages
TTP-245p 247 User Manual E
No ratings yet
TTP-245p 247 User Manual E
50 pages
Data Engineering
No ratings yet
Data Engineering
48 pages
Data Engineering 101 Learning Path
No ratings yet
Data Engineering 101 Learning Path
26 pages
Infographics and Maps
No ratings yet
Infographics and Maps
7 pages
Data Engineering For IoE V1.0
No ratings yet
Data Engineering For IoE V1.0
3 pages
Unit 1 - DA - Introduction To Big Data
No ratings yet
Unit 1 - DA - Introduction To Big Data
65 pages
Vector and Bitmap Images
No ratings yet
Vector and Bitmap Images
3 pages
Course Handout - 21CSE372P - Mastering Cloud Data Services and Analytics With AWS, Azure, and GCP - VF-1
No ratings yet
Course Handout - 21CSE372P - Mastering Cloud Data Services and Analytics With AWS, Azure, and GCP - VF-1
18 pages
Brochure SRT 4930 - en
No ratings yet
Brochure SRT 4930 - en
2 pages
HCMS Documentation
No ratings yet
HCMS Documentation
81 pages
Data Engineering Syllabus
No ratings yet
Data Engineering Syllabus
5 pages
Summer Internship Report On: Aws Data Engineering (Topic)
No ratings yet
Summer Internship Report On: Aws Data Engineering (Topic)
21 pages
D Report
No ratings yet
D Report
19 pages
FF - Echelon Oval Brochure 2021 09 24
No ratings yet
FF - Echelon Oval Brochure 2021 09 24
20 pages
Internship 1
No ratings yet
Internship 1
24 pages
Course Outline Big Data Analytics
No ratings yet
Course Outline Big Data Analytics
2 pages
Installation Testing
No ratings yet
Installation Testing
12 pages
Data Engineering
No ratings yet
Data Engineering
22 pages
Tux Paint 06
No ratings yet
Tux Paint 06
6 pages
Geetha Intern de
No ratings yet
Geetha Intern de
26 pages
Big Data and Analytics Q&a
No ratings yet
Big Data and Analytics Q&a
18 pages
MCA 3rd Semester Big Data Analytics Syllabus
No ratings yet
MCA 3rd Semester Big Data Analytics Syllabus
15 pages
DocScanner 20 Oct 2024 2-19 PM
No ratings yet
DocScanner 20 Oct 2024 2-19 PM
16 pages
Data Engineer Roadmap
No ratings yet
Data Engineer Roadmap
4 pages
Data Engineering Nanodegree Program Syllabus
33% (3)
Data Engineering Nanodegree Program Syllabus
15 pages
1 Intro
No ratings yet
1 Intro
33 pages
CSE 3002 Big Data Technologies - 7sem
No ratings yet
CSE 3002 Big Data Technologies - 7sem
19 pages
CIT 4401big Data Analytics Course Outline
No ratings yet
CIT 4401big Data Analytics Course Outline
5 pages
Syllabus BigData EN
No ratings yet
Syllabus BigData EN
6 pages
Data Engineering Notes
No ratings yet
Data Engineering Notes
4 pages
Getting Started with Oracle Data Integrator 11g: A Hands-On Tutorial
From Everand
Getting Started with Oracle Data Integrator 11g: A Hands-On Tutorial
David Hecksel
5/5 (2)
CS378 Cloud Computing Syllabus
No ratings yet
CS378 Cloud Computing Syllabus
12 pages
7th Sem Syallbus
No ratings yet
7th Sem Syallbus
10 pages
BCSG 0034
No ratings yet
BCSG 0034
2 pages
CS1101-DF-Unit 5 - Strings and Iterations
No ratings yet
CS1101-DF-Unit 5 - Strings and Iterations
7 pages
Python AWS Data Engineering Course - Master PySpark, Kafka, SQL
No ratings yet
Python AWS Data Engineering Course - Master PySpark, Kafka, SQL
3 pages
Fleeti Presentation
No ratings yet
Fleeti Presentation
14 pages
Data Engineering Course Outline
No ratings yet
Data Engineering Course Outline
3 pages
Co-Po Big Data Analytics
100% (1)
Co-Po Big Data Analytics
41 pages
Data Analyst & Data Engineer
No ratings yet
Data Analyst & Data Engineer
4 pages
Vernier Labquest 2 Manual Original
No ratings yet
Vernier Labquest 2 Manual Original
62 pages
Data Engineern - Bootcamp Brochure
No ratings yet
Data Engineern - Bootcamp Brochure
12 pages
IAP301 SE161501 Lab2docx
No ratings yet
IAP301 SE161501 Lab2docx
5 pages
Javell: Address: 23 A East Avenue, Linstead P.O., Jamaica Email: Telephone: (876) 484-8766 1876-416-8765
No ratings yet
Javell: Address: 23 A East Avenue, Linstead P.O., Jamaica Email: Telephone: (876) 484-8766 1876-416-8765
3 pages
Roadmap
No ratings yet
Roadmap
3 pages
2CS702-CPD-Odd 23 24
No ratings yet
2CS702-CPD-Odd 23 24
9 pages
LecturePlan CS201 20SMP-460
No ratings yet
LecturePlan CS201 20SMP-460
5 pages
Data Analytics Course Plan 2016
No ratings yet
Data Analytics Course Plan 2016
7 pages
Rahul Kumar Gupta
No ratings yet
Rahul Kumar Gupta
11 pages
CC ZG522 Course Handout
No ratings yet
CC ZG522 Course Handout
6 pages
Data Engineering Nanodegree Program Syllabus
No ratings yet
Data Engineering Nanodegree Program Syllabus
16 pages
10900320024-Arnab Basak-OE-EC506B-ECE-3A-24
No ratings yet
10900320024-Arnab Basak-OE-EC506B-ECE-3A-24
8 pages
Syllabus
No ratings yet
Syllabus
2 pages
HM616 HM618
No ratings yet
HM616 HM618
4 pages
ADB Bearing Sensor Tester
No ratings yet
ADB Bearing Sensor Tester
2 pages
Cecilia Asabre CV1
No ratings yet
Cecilia Asabre CV1
3 pages
Data Engineering Nanodegree Program Syllabus PDF
No ratings yet
Data Engineering Nanodegree Program Syllabus PDF
5 pages
Cspo GEM
No ratings yet
Cspo GEM
3 pages
GVP - CEMS Internship
No ratings yet
GVP - CEMS Internship
2 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)