Aws Intern Report
Aws Intern Report
Bachelor of Technology
in
ARTIFICIAL INTELLIGENCE & DATA SCIENCE
By
KESAGANI ASHA JYOTHI
Reg. No: 21H71A5404
Offered by
i
CERTIFICATE
This is to certify that the Internship Report entitled “AWS DATA ENGINEERING Virtual
Internship” submitted by KESAGANI ASHA JYOTHI (Reg. No: 21H71A5404) to the DVR
& Dr. HS MIC College of Technology in partial fulfillment of the requirements for the award
CERTIFICATE OF INTERNSHIP
ii
iii
ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of any task would be incomplete
without mentioning the people who made it possible and whose constant guidance and
engagement crown all the efforts with success. I thank our college management and respected
Sri D. PANDURANGA RAO, CEO for providing us the necessary infrastructure to carry out
the Internship.
I express my sincere thanks to Dr. T. Vamsee Kiran, Principal who has been a great source of
inspiration and motivation for the internship program.
I am thankful to the AICTE, EduSkills and Google for Developers for enabling me an
opportunity to carry out the internship in such a prestigious organization.
I take this opportunity to express our thanks to one and all who directly or indirectly helped me
in bringing this effort to present form.
Finally, my special thanks go to my family for their continuous support and help throughout and
for their continual support and encouragement for the completion of the Internship on time.
iv
CONTENTS
Title
Declaration
Certificate
Acknowledgement
Contents
Abstract
Chapter 1 : Introduction
v
ABSTRACT
AWS Data Engineering is the process of designing, building, and managing data pipelines
using Amazon Web Services (AWS). It focuses on efficiently handling large volumes of
structured and unstructured data by leveraging AWS’s cloud-based services. This field is
essential for businesses looking to process, store, and analyze data at scale, enabling data-driven
decision-making and real-time insights.
A key aspect of AWS Data Engineering is data ingestion, where raw data is extracted from
multiple sources, such as databases, APIs, and streaming services. Tools like Amazon Kinesis
and AWS Data Pipeline facilitate seamless real-time and batch data ingestion. Once collected,
data undergoes transformation and processing using AWS Glue, AWS Lambda, and Apache
Spark, ensuring it is structured, cleansed, and ready for analysis.
For data storage, AWS offers various solutions, including Amazon S3 for scalable object
storage, Amazon Redshift for data warehousing, Amazon DynamoDB for NoSQL storage, and
Amazon RDS for relational databases. These services enable organizations to store and retrieve
data efficiently while ensuring high availability and security. AWS Data Engineering also
involves automating data workflows through services like AWS Step Functions and AWS Glue
Workflows, which help orchestrate complex data processes with minimal manual intervention.
vi
CHAPTER 1
INTRODUCTION
1.1 Open Code Reference Log:
Open Code Reference Log Code Whisperer learns from open-source projects, and the code it
suggests might occasionally resemble code samples from the training data. With the reference
log, you can view references to code suggestions that are similar to the training data. When such
occurrences happen, Code Whisperer notifies you and provides repository and licensing
information. Use this information to make decisions about whether to use the code in your
project and properly attribute the source code as desired.
Code Whisperer code generation offers many benefits for software development organizations. It
accelerates application development for faster delivery of software solutions. Automating
repetitive tasks optimizes the use of developer time.
so developers can focus on more critical aspects of the project. Additionally, code generation
helps mitigate security vulnerabilities, safeguarding the integrity of the codebase. Code
Whisperer also protects open source intellectual property by providing the open source reference
tracker. Code Whisperer enhances code quality and reliability, leading to robust and efficient
1
applications. It supports an efficient response to evolving software threats, keeping the codebase
up to date with the latest security practices. Code Whisperer has the potential to increase
development speed, security, and the quality of software.
AWS Data Engineering is the process of designing, building, and managing scalable, cloud-
based data pipelines using Amazon Web Services (AWS). It involves collecting, transforming,
storing, and analyzing large volumes of data efficiently, enabling organizations to make data-
driven decisions. With AWS, data engineers can create high-performance, secure, and cost-
effective solutions for handling structured and unstructured data.
The core components of AWS Data Engineering include data ingestion, transformation, storage,
and analytics. Data ingestion is facilitated by tools like Amazon Kinesis, AWS Data Pipeline,
and AWS Glue, which help in streaming and batch data collection. Once the data is ingested, it
undergoes processing and transformation using services such as AWS Glue, AWS Lambda,
Apache Spark on Amazon EMR, and AWS Step Functions, ensuring it is clean and structured for
further analysis.
For storage and management, AWS provides multiple solutions based on data type and use case.
Amazon S3 is widely used for data lakes, while Amazon Redshift serves as a powerful data
warehouse for analytics. Amazon DynamoDB and Amazon RDS handle NoSQL and relational
database needs, respectively. These services offer scalability, high availability, and security for
managing large datasets.
AWS Data Engineering also focuses on automation, monitoring, and security. Engineers use
AWS Glue Workflows, AWS Step Functions, and AWS CloudWatch to automate and monitor
data pipelines. Security measures such as IAM roles, encryption, and AWS KMS (Key
Management Service) help protect sensitive data.
2
CHAPTER 2
DATA-DRIVEN ORGANIZATIONS
Filtering and retrieving data is a fundamental process in data engineering that involves selecting
specific subsets of data based on defined conditions. In AWS Data Engineering, this process is
crucial for optimizing performance, reducing storage costs, and improving data analysis
efficiency.
Data filtering allows users to extract only relevant data by applying conditions on attributes such
as time range, category, or numerical values. AWS provides multiple services to filter data
efficiently. For example, in Amazon S3, filtering can be done using S3 Select, which retrieves
only the required data from objects stored in a bucket instead of scanning the entire dataset.
Similarly, Amazon Athena allows SQL-based querying on data stored in S3, enabling users to
filter and retrieve data on demand.
3
In real-time data streaming, filtering can be performed using Amazon Kinesis Data Analytics,
which allows SQL-based transformations on streaming data before it is stored or processed
further. AWS also provides AWS Lambda for event-driven filtering, ensuring only relevant data
is processed. Efficient data filtering and retrieval help reduce query execution time, optimize
storage, and improve the overall performance of data processing pipelines. By leveraging AWS
tools, businesses can ensure faster insights and cost-effective data management.
Amazon S3 Lifecycle:
Amazon S3 Lifecycle policies help you manage your objects through two types of actions,
Transition and Expiration. In the architecture shown in Figure 1, we create an S3 Lifecycle
configuration rule that expires objects after ‘x’ days. It has a filter for an object tag of. You can
configure the value of ‘x’ based on your requirements.
If you are using an S3 bucket to store short-lived objects with unknown access patterns, you
might want to keep the objects that are still being accessed but delete the rest. This will let you
retain objects in your S3 bucket even after their expiry date as per the S3 lifecycle rules while
saving you costs by deleting objects that are not needed anymore. The following diagram shows
an architecture that considers the last accessed date of the object before deleting S3 objects.
4
This diagram represents an Amazon S3 data processing workflow, integrating multiple AWS
services to manage, automate, and optimize data storage and processing. The process begins with
an Amazon S3 Source Bucket, where raw data is stored. To enhance tracking and monitoring,
server access logs and inventory reports are sent to designated S3 target buckets.
Next, Amazon Event Bridge is used to trigger AWS Lambda, automating data processing tasks.
Lambda then interacts with various AWS components to streamline operations. It can execute
Amazon Athena queries to analyze stored data, store processed metadata in an S3 Manifest
Bucket, or initiate large-scale modifications using S3 Batch Operations.
5
CHAPTER 3
THE ELEMENTS OF DATA
Big Data refers to the vast amounts of structured, semi-structured, and unstructured data that are
generated at high velocity and volume from various sources such as social media, IoT devices,
business transactions, and sensors. Traditional data processing systems struggle to handle Big
Data due to its three primary characteristics—Volume, Velocity, and Variety. To process and
analyze this data efficiently, modern data architectures leverage cloud platforms, distributed
computing, and specialized tools like AWS, Hadoop, Apache Spark, and NoSQL databases.
Semi-Structured Schema
A semi-structured schema falls between structured and unstructured data formats. Unlike
structured data (which follows a fixed schema like relational databases), semi-structured data
does not strictly adhere to predefined schemas but still contains tags or markers to separate
elements. Examples of semi-structured data include JSON, XML, YAML, Avro, and Parquet.
AWS provides a scalable, secure, and HIPAA-compliant framework for healthcare data analytics,
integrating Big Data, AI, and Machine Learning (ML) to improve patient care and operational
efficiency. The framework starts with data ingestion, where healthcare data is collected from
multiple sources, such as Electronic Health Records (EHRs), IoT medical devices, medical
imaging, and clinical trials. AWS services like AWS Glue, Amazon Kinesis, and AWS Storage
Gateway enable efficient batch and real-time data ingestion.
Once collected, the data is stored in different AWS services based on its structure. Amazon S3
acts as a Data Lake, storing unstructured and semi-structured data, while Amazon RDS and
Aurora manage structured relational databases. For NoSQL-based records, Amazon DynamoDB
is used, and Amazon Redshift serves as a data warehouse for analytical queries. To ensure the
data is clean and ready for analysis, AWS Glue, AWS Lambda, and Kinesis handle data
transformation and processing in real-time or batch mode.
7
This image illustrates key components of Healthcare Data Analytics by showing how different
types of medical data are collected and analyzed for better patient care. The magnifying glass in
the center represents data analytics, used to process and interpret diverse healthcare data
sources.Security and compliance are critical in healthcare, and AWS ensures HIPAA, GDPR, and
HITECH compliance using AWS IAM for access control, AWS KMS for encryption, and AWS
CloudTrail for auditing. Data lifecycle management is optimized with Amazon S3 Lifecycle
Policies and Intelligent-Tiering, automatically archiving old medical records to lower-cost
storage while maintaining accessibility.
Overall, AWS provides a highly scalable, secure, and cost-effective healthcare data analytics
framework, enabling real-time insights, AI-driven diagnostics, and better patient outcomes while
ensuring regulatory compliance and operational efficiency.
This visualization emphasizes how integrating multiple healthcare data sources enables better
decision-making, predictive analytics, and personalized treatments. Healthcare organizations use
Big Data, AI, and Machine Learning to analyze these data points, detect anomalies, and improve
patient outcomes.
8
CHAPTER 4
DESIGN PRINCIPLES & PATTERNS FOR DATA PIPELINES
The AWS Well-Architected Framework is a set of best practices, principles, and guidelines
designed to help cloud architects build secure, high-performing, resilient, and efficient
infrastructure on AWS. It provides a structured approach to evaluate and optimize cloud
workloads based on six core pillars.
Six Pillars of the AWS Well-Architected Framework:
1. Operational Excellence
Focuses on running and monitoring systems efficiently to deliver business value.
Uses automation, continuous improvement, and well-defined operational procedures.
AWS services: AWS CloudFormation, AWS CloudWatch, AWS X-Ray, AWS
Systems Manager.
2. Security
Ensures the protection of data, systems, and assets through risk assessment and
mitigation.
Implements identity and access controls, encryption, and security monitoring.
AWS services: AWS IAM, AWS Shield, AWS WAF, AWS KMS, AWS CloudTrail.
3. Reliability
Builds resilient architectures that can withstand failures and recover quickly.
Includes fault tolerance, disaster recovery, and high availability strategies.
AWS services: Amazon S3, AWS Auto Scaling, Amazon Route 53, AWS Backup.
4. Performance Efficiency
Optimizes resources to ensure high performance and scalability.
Uses the latest technologies and continuously improves system architecture.
AWS services: Amazon EC2 Auto Scaling, Amazon CloudFront, AWS Lambda,
Amazon RDS.
5 Cost Optimization
Helps organizations maximize efficiency while minimizing costs.
9
Involves right-sizing resources, using reserved instances, and monitoring
expenditures.
AWS services: AWS Cost Explorer, AWS Budgets, Amazon S3 Intelligent-Tiering.
6. Sustainability
Focuses on reducing environmental impact by optimizing energy consumption.
Encourages resource-efficient workload design and minimal carbon footprint.
AWS services: AWS Graviton, AWS Auto Scaling, Amazon EC2 Spot Instances.
The AWS Well-Architected Framework provides best practices and guidelines to help
organizations design and build secure, high-performing, resilient, and efficient cloud
architectures. It is structured around six key pillars, each addressing critical aspects of cloud
system design and operation.
By following AWS Well-Architected principles, organizations can design, evaluate, and
optimize their cloud architectures for long-term success.
AWS services like Amazon Athena, Amazon QuickSight, and Amazon SageMaker allow users to
perform queries, generate reports, and apply machine learning models to gain predictive insights.
By implementing a robust data pipeline infrastructure, organizations can improve efficiency,
enhance scalability, and ensure seamless data-driven operations while leveraging the power of
cloud-based computing.
11
CHAPTER 5
SECURING & SCALING THE DATA PIPELINE
Data classification is the process of categorizing data based on its sensitivity, importance, and
compliance requirements. It helps organizations manage and protect data effectively by assigning
different security levels, such as public data (openly available information), internal data
(restricted to employees), and confidential or sensitive data (highly critical information like
financial records and personal data). AWS services like Amazon Macie assist in identifying and
protecting sensitive data, while AWS Glue Data Catalog helps manage metadata.
The security pillars in AWS focus on safeguarding data and cloud workloads through best
practices. Identity and Access Management (IAM) ensures that only authorized users can access
resources, using services like AWS IAM and AWS Cognito for role-based permissions. Data
protection and encryption secure data at rest and in transit using AWS Key Management Service
(KMS) and AWS Certificate Manager. Threat detection and monitoring involve continuous
security assessments with AWS Security Hub, AWS GuardDuty, and AWS CloudTrail to detect
anomalies and unauthorized activities. Compliance and governance ensure adherence to industry
regulations such as HIPAA, GDPR, and SOC 2, with tools like AWS Config and AWS Audit
Manager enforcing security policies. By implementing strong data classification and security
pillars, organizations can enhance data protection, prevent breaches, and ensure regulatory
compliance in the AWS cloud.
12
How AWS CDK Works:
1. Define Infrastructure in Code – Developers use their preferred programming language to
define AWS resources.
2. Synthesize to CloudFormation – AWS CDK compiles the code into an AWS CloudFormation
template.
3. Deploy to AWS – The generated CloudFormation stack is deployed to AWS, provisioning
resources automatically.
The AWS Cloud Development Kit (AWS CDK) is an open-source Infrastructure as Code (IaC)
framework that allows developers to define and provision cloud infrastructure using familiar
programming languages like TypeScript, Python, Java, C#, and Go. Instead of manually writing
AWS CloudFormation templates in YAML or JSON, AWS CDK enables developers to define
cloud resources using high-level programming constructs, making infrastructure management
more efficient, scalable, and maintainable.
AWS CDK works by converting application code into AWS CloudFormation templates, which
AWS then uses to provision resources. It introduces the concept of constructs, which are reusable
components that encapsulate AWS resources and configurations, allowing developers to define
complex architectures with minimal code. These constructs are categorized into three levels: L1
(low-level constructs) that map directly to CloudFormation resources, L2 (higher-level
constructs) that provide sensible defaults and abstraction, and L3 (opinionated stacks) that offer
complete architectural patterns
For example, a developer can use AWS CDK in TypeScript to define a serverless web
application with an Amazon S3 bucket, an AWS Lambda function, and an API Gateway, all
within a few lines of code. This significantly reduces development time compared to manually
configuring resources through the AWS Management Console or writing CloudFormation
templates from scratch.
13
One of the key benefits of AWS CDK is its ability to support multi-language development,
allowing teams to use their preferred programming languages while leveraging the power of
AWS infrastructure automation. It also integrates seamlessly with AWS CloudFormation,
ensuring that infrastructure deployments are repeatable, version-controlled, and safely managed.
Additionally, AWS CDK promotes best practices such as modularity, reusability, and automation,
making it a powerful tool for DevOps teams looking to streamline cloud deployments.
Overall, AWS CDK revolutionizes cloud infrastructure management by bringing software
development best practices into the world of cloud provisioning. By using AWS CDK,
organizations can accelerate cloud adoption, improve efficiency, and maintain a well-architected
infrastructure with minimal complexity.
14
CHAPTER 6
INGESTING & PREPARING DATA
15
ETL is typically used in traditional data warehouses, while ELT is preferred for big data and
cloud-based analytics.
Data ingestion is the process of collecting and importing raw data from various sources into a
centralized storage system for further processing and analysis. This is a critical first step in any
data pipeline, ensuring that structured, semi-structured, or unstructured data is efficiently
gathered and made available for transformation and analytics.
Data can be ingested in two main ways: batch ingestion and streaming ingestion. Batch ingestion
involves periodically collecting and processing large volumes of data at scheduled intervals,
making it suitable for traditional databases and data warehouses. In contrast, streaming ingestion
enables real-time data flow, allowing businesses to process and analyze data as it arrives, which
is essential for applications like fraud detection, IoT monitoring, and real-time analytics.
AWS provides a range of services to support data ingestion, including Amazon Kinesis for real-
time streaming, AWS Glue for batch processing, AWS DataSync for transferring large datasets,
and Amazon S3 for scalable storage. Once ingested, the data is ready for transformation,
analysis, and integration into data lakes, warehouses, or machine learning models.
By implementing an efficient data ingestion strategy, organizations can ensure seamless data
flow, improve decision-making, and enhance business intelligence capabilities.
16
Preparing data:
Preparing data is a crucial step in the data pipeline that ensures raw data is cleaned, structured,
and transformed for analysis, reporting, or machine learning applications. The preparation
process typically includes data cleaning, which involves filtering out errors, correcting
inconsistencies, and dealing with incomplete records. Next, data transformation converts data
into the required format through aggregation, normalization, or encoding. Additionally, data
enrichment combines multiple datasets to provide more meaningful insights by adding relevant
contextual information.
17
CHAPTER 7
INGESTING BY BATCH OR BY STREAM
AWS CloudTrail is a service that provides audit logging, governance, and compliance
monitoring by recording all API calls and user activities across AWS accounts. It captures details
such as who made the request, what actions were taken, when they occurred, and from where the
request originated. CloudTrail helps organizations track changes, detect unauthorized access,
troubleshoot issues, and ensure security best practices are followed.
CloudTrail logs events in Amazon S3, allowing users to analyze activity using AWS Athena or
integrate with AWS CloudWatch for real-time monitoring and alerts. It also supports multi-
region logging and event history tracking, making it easier to investigate security incidents and
maintain regulatory compliance.
2. Streaming Ingestion
Streaming ingestion involves real-time data collection and processing, making it ideal for time-
sensitive applications such as IoT monitoring, fraud detection, and real-time analytics. Unlike
19
batch processing, streaming ingestion enables immediate insights by continuously ingesting and
analyzing data as it arrives.
Choosing between batch and stream ingestion depends on the use case, latency requirements, and
infrastructure cost. While batch ingestion is suited for large-scale historical processing,
streaming ingestion is essential for real-time, event-driven applications.
The decision between batch and streaming ingestion depends on:
Latency requirements – If real-time processing is needed, use streaming.
Data volume and frequency – High-frequency data sources (IoT, financial transactions)
benefit from streaming, while periodic reports work well with batch processing.
Cost considerations – Batch ingestion is generally more cost-effective, whereas streaming
requires continuous processing and higher resource utilization.
Many modern data architectures use a hybrid approach, combining both batch and streaming
ingestion to balance performance, cost, and real-time analytics.
20
CHAPTER 8
STORING & ORGANIZING DATA
A data warehouse is a centralized system designed for storing, managing, and analyzing large
volumes of structured and semi-structured data. It serves as a repository that consolidates data
from multiple sources, such as transactional databases, CRM systems, and log files, enabling
efficient querying and reporting for business intelligence (BI) and analytics.
Data warehouses are optimized for read-heavy workloads and complex queries, allowing
organizations to gain insights through historical analysis, trend forecasting, and decision-making.
Unlike traditional databases, which focus on real-time transactions, data warehouses are
structured for analytical processing using OLAP (Online Analytical Processing), supporting
aggregations, indexing, and multidimensional analysis.
Key
A typical data warehouse architecture consists of multiple layers, including a data source layer
that collects raw data, an ETL (Extract, Transform, Load) process that cleans and organizes data,
a storage layer where structured data is housed, and a presentation layer that allows users to
21
access data through reporting and visualization tools. Data is usually modeled using star schema
or snowflake schema, where a fact table holds measurable business data and dimension tables
provide descriptive attributes for better analysis.
To enhance performance, data warehouses leverage columnar storage, indexing, and partitioning,
which allow faster query execution. AWS provides Amazon Redshift, a cloud-based data
warehousing solution that supports massively parallel processing (MPP), enabling businesses to
scale efficiently and analyze petabyte-scale datasets. Redshift integrates seamlessly with AWS
Glue for ETL, Amazon S3 for data lakes, and Amazon QuickSight for visualization, making it a
powerful tool for advanced analytics.
22
Columnar storage formats like Amazon Redshift, Apache Parquet, and ORC enhance query
speed and data compression.
The Data Warehouse Architecture depicted in the diagram outlines the structured flow of data
from various sources to business intelligence tools. Initially, data is gathered from operational
systems and external sources into transaction databases, where it is temporarily stored.
The ETL (Extract, Transform, Load) process then cleans, transforms, and loads this data into the
data warehouse database, ensuring consistency and accuracy. Within the data warehouse, data
can be further segmented into data marts, which store specific subsets of data for different
business functions.
23
CHAPTER 9
PROCESSING BIG DATA
24
Hadoop's distributed nature ensures fault tolerance, as data is replicated across nodes, minimizing
the risk of data loss. It is widely integrated with cloud platforms like AWS (Amazon EMR), Azure
HDInsight, and Google Cloud Dataproc, making it even more accessible for organizations. By
leveraging Hadoop, businesses can perform complex data analytics, log processing,
recommendation system modeling, fraud detection, and more with high efficiency and scalability.
Apache Spark is an open-source distributed computing system designed for fast data processing
and analytics. It provides a high-performance alternative to MapReduce by utilizing in-memory
computing, making it significantly faster for iterative computations and real-time analytics. Spark
is widely used for Big Data processing, machine learning, graph analytics, and stream processing.
Spark supports multiple programming languages, including Scala, Python (PySpark), Java, and R,
making it versatile for developers. It runs efficiently on Hadoop clusters, Kubernetes, standalone
mode, or cloud services like AWS EMR, Databricks, and Google Cloud Data.
With its high speed, scalability, and ease of integration, Apache Spark is a preferred choice for
real-time analytics, ETL workflows, AI/ML applications, and Big Data transformations across
industries.
25
The diagram illustrates the Apache Spark cluster architecture, which follows a Master-Slave
(Driver-Worker) model for distributed data processing. The Master Node contains the Driver
Program, which is responsible for managing the execution of tasks. Within the driver program,
the Spark Context plays a crucial role in coordinating the overall execution flow by
communicating with the cluster manager. The cluster manager can be YARN, Mesos, or a
Standalone Manager, and it is responsible for allocating resources to different worker nodes.
This architecture enables Spark to process Big Data workloads efficiently by distributing tasks
across multiple worker nodes. By leveraging in-memory computation and parallel execution,
Apache Spark provides a high-performance framework for handling large-scale data analytics
and machine learning tasks.
26
CHAPTER 10
PROCESSING DATA FOR ML
Data preprocessing is a fundamental step in machine learning that involves transforming raw
data into a clean, structured, and usable format for model training. Since real-world data is often
incomplete, inconsistent, and noisy, preprocessing ensures that machine learning models can
effectively learn patterns and make accurate predictions.
The preprocessing pipeline includes several essential tasks. Data cleaning handles missing
values, removes duplicates, and corrects errors, ensuring data consistency. Feature engineering
involves selecting or creating meaningful features to improve model performance. Normalization
and standardization scale numerical values to maintain consistency across different data ranges.
Encoding categorical variables converts non-numeric data into machine-readable formats while
handling outliers prevents extreme values from negatively impacting model performance.
Finally, data splitting divides the dataset into training, validation, and test sets to assess model
generalization.
Data preprocessing is a crucial step in the machine learning pipeline that enhances data quality
and ensures models can learn effectively. Raw data collected from various sources is often
incomplete, inconsistent, or noisy, which can lead to poor model performance. Preprocessing
involves a series of steps to clean, transform, and structure the data for optimal analysis and
learning.
27
The first step, data cleaning, deals with handling missing values, removing duplicates, and
correcting inconsistencies. Missing values can be addressed by techniques such as imputation
(filling in with mean, median, or mode) or removing incomplete records. Noise reduction
eliminates errors and irrelevant information, often using filtering techniques.
Next, data transformation ensures data is in a suitable format for machine learning models. This
includes feature scaling, where numerical values are normalized (scaling to a fixed range) or
standardized (converting to a normal distribution). Encoding categorical variables is another
transformation step, where non-numeric categories are converted into numerical representations
(e.g., one-hot encoding or label encoding).
Feature engineering is another critical step, where relevant features are created or selected to
improve model efficiency. Feature selection techniques, such as correlation analysis or
dimensionality reduction (PCA), help eliminate redundant or irrelevant features.
Finally, data splitting ensures that the dataset is properly divided into training, validation, and
test sets to evaluate model performance accurately. The dataset is usually split in a 70-20-10 or
80-10-10 ratio, ensuring models are tested on unseen data before deployment.
By performing comprehensive data preprocessing, machine learning models become more
robust, accurate, and capable of handling real-world scenarios effectively.
28
CHAPTER 11
ANALYZING & VISUALIZING DATA
Data visualization on AWS enables businesses to analyze and interpret large datasets using
cloud-based tools and services. AWS provides a range of services to facilitate data visualization,
helping users derive insights from raw data efficiently. These services support interactive
dashboards, real-time monitoring, and reporting functionalities for business intelligence (BI) and
data analytics.
One of the primary tools for data visualization on AWS is Amazon QuickSight, a scalable and
fully managed BI service. QuickSight allows users to create interactive dashboards, perform ad-
hoc analysis, and generate reports using various data sources, including Amazon S3, Amazon
Redshift, and AWS Glue. It also integrates with machine learning capabilities to provide
advanced data insights.
For real-time data visualization, Amazon CloudWatch helps monitor AWS resources and
applications, providing metrics, logs, and automated alerts. Similarly, AWS Grafana is a
powerful open-source visualization tool that integrates with AWS services like Amazon
Timestream and AWS IoT Analytics for real-time operational monitoring.
Organizations dealing with big data can use Amazon Redshift for data warehousing and
analytical querying, paired with visualization tools such as Tableau, Power BI, or Looker. AWS
Glue assists in preparing and transforming data before visualization, ensuring clean and
structured datasets.
29
It supports machine learning-powered insights, allowing users to detect anomalies,
forecast trends, and automate data analysis without requiring data science expertise.
One of Quick Sight’s key features is SPICE (Super-fast, Parallel, In-memory Calculation
Engine), which accelerates data retrieval and enhances performance by preloading data into
memory for quick access. This eliminates the need for continuous database queries, reducing
latency and improving dashboard responsiveness.
Quick Sight also integrates with AWS services such as CloudTrail, Athena, and Glue, providing
a seamless data analytics ecosystem. It supports embedded analytics, allowing businesses to
integrate Quick Sight dashboards directly into their applications or websites.
By using Amazon Quick Sight, businesses can create interactive reports, real-time visualizations
and share insights across teams without the complexities of traditional BI tools. Its serverless
nature ensures automatic scaling and pay-per-session pricing, making it a cost-efficient solution
for organizations looking to leverage data-driven decision-making.
30
CHAPTER 12
CONCLUSION
The AWS Data Engineering Virtual Internship provides a comprehensive learning experience for
individuals looking to build expertise in cloud-based data solutions. Throughout the program,
interns gain hands-on experience with key AWS services such as Amazon S3, AWS Glue,
Amazon Redshift, AWS Lambda, and Amazon Kinesis, enabling them to design and implement
efficient data pipelines. The internship covers critical aspects of data ingestion, transformation,
storage, and visualization, ensuring a strong foundation in cloud data engineering.
By working on real-world scenarios and projects, interns develop skills in ETL processes, big
data management, and analytics, preparing them for careers in data engineering and cloud
computing. The exposure to AWS best practices, security measures, and automation tools further
enhances their ability to build scalable and reliable data architectures.
Overall, the internship serves as a stepping stone for aspiring data engineers, equipping them
with practical knowledge and industry-relevant experience. It enables participants to understand
the power of AWS services in handling large-scale data workloads and helps them build the
expertise required for modern data-driven enterprises.
31