0% found this document useful (0 votes)
37 views37 pages

Aws Intern Report

The document is an internship report by Kesagani Asha Jyothi for a Bachelor of Technology in Artificial Intelligence and Data Science, focusing on AWS Data Engineering. It outlines the processes of designing and managing data pipelines using AWS services, emphasizing data ingestion, transformation, storage, and automation. The report also highlights the importance of data-driven decision-making and the integration of various AWS tools for efficient data management.

Uploaded by

reshmaramisetty1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views37 pages

Aws Intern Report

The document is an internship report by Kesagani Asha Jyothi for a Bachelor of Technology in Artificial Intelligence and Data Science, focusing on AWS Data Engineering. It outlines the processes of designing and managing data pipelines using AWS services, emphasizing data ingestion, transformation, storage, and automation. The report also highlights the importance of data-driven decision-making and the integration of various AWS tools for efficient data management.

Uploaded by

reshmaramisetty1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

AWS DATA ENGINEERING VIRTUAL INTERNSHIP

An Internship Report submitted in


partial fulfillment of the requirements
for the award of the degree of

Bachelor of Technology
in
ARTIFICIAL INTELLIGENCE & DATA SCIENCE
By
KESAGANI ASHA JYOTHI
Reg. No: 21H71A5404

Offered by

AICTE - EDUSKILLS – AWS

Department of Artificial Intelligence


DVR & Dr. HS

MIC College of Technology


(Autonomous)

Kanchikacherla –521180, NTR Dist., Andhra Pradesh


March - 2025

i
CERTIFICATE

This is to certify that the Internship Report entitled “AWS DATA ENGINEERING Virtual

Internship” submitted by KESAGANI ASHA JYOTHI (Reg. No: 21H71A5404) to the DVR

& Dr. HS MIC College of Technology in partial fulfillment of the requirements for the award

of the Degree of Bachelor of Technology in Artificial Intelligence and Data Science is a

Bonafide record of work.

Internship Coordinator Head of the Department

Examiner 1 Principal Examiner 2

CERTIFICATE OF INTERNSHIP
ii
iii
ACKNOWLEDGEMENT

The satisfaction that accompanies the successful completion of any task would be incomplete
without mentioning the people who made it possible and whose constant guidance and
engagement crown all the efforts with success. I thank our college management and respected
Sri D. PANDURANGA RAO, CEO for providing us the necessary infrastructure to carry out
the Internship.

I express my sincere thanks to Dr. T. Vamsee Kiran, Principal who has been a great source of
inspiration and motivation for the internship program.

I profoundly thank Dr. G. Sai Chaitanya Kumar, Head, Department of Artificial


Intelligence for permitting me to carry out the internship.

I am thankful to the AICTE, EduSkills and Google for Developers for enabling me an
opportunity to carry out the internship in such a prestigious organization.

I am thankful to our Internship Coordinator Mr. A. Kalyan Kumar Assistant Professor,


Department of AI for their internal support and professionalism who helped us in completing the
internship on time.

I take this opportunity to express our thanks to one and all who directly or indirectly helped me
in bringing this effort to present form.

Finally, my special thanks go to my family for their continuous support and help throughout and
for their continual support and encouragement for the completion of the Internship on time.

iv
CONTENTS

Title
Declaration
Certificate
Acknowledgement
Contents
Abstract
Chapter 1 : Introduction

1.1 Open code reference log


1.2 Overview of aws data engineering
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Chapter 11
Chapter 12

v
ABSTRACT

AWS Data Engineering is the process of designing, building, and managing data pipelines
using Amazon Web Services (AWS). It focuses on efficiently handling large volumes of
structured and unstructured data by leveraging AWS’s cloud-based services. This field is
essential for businesses looking to process, store, and analyze data at scale, enabling data-driven
decision-making and real-time insights.

A key aspect of AWS Data Engineering is data ingestion, where raw data is extracted from
multiple sources, such as databases, APIs, and streaming services. Tools like Amazon Kinesis
and AWS Data Pipeline facilitate seamless real-time and batch data ingestion. Once collected,
data undergoes transformation and processing using AWS Glue, AWS Lambda, and Apache
Spark, ensuring it is structured, cleansed, and ready for analysis.

For data storage, AWS offers various solutions, including Amazon S3 for scalable object
storage, Amazon Redshift for data warehousing, Amazon DynamoDB for NoSQL storage, and
Amazon RDS for relational databases. These services enable organizations to store and retrieve
data efficiently while ensuring high availability and security. AWS Data Engineering also
involves automating data workflows through services like AWS Step Functions and AWS Glue
Workflows, which help orchestrate complex data processes with minimal manual intervention.

vi
CHAPTER 1

INTRODUCTION
1.1 Open Code Reference Log:
Open Code Reference Log Code Whisperer learns from open-source projects, and the code it
suggests might occasionally resemble code samples from the training data. With the reference
log, you can view references to code suggestions that are similar to the training data. When such
occurrences happen, Code Whisperer notifies you and provides repository and licensing
information. Use this information to make decisions about whether to use the code in your
project and properly attribute the source code as desired.

Code Whisperer code generation offers many benefits for software development organizations. It
accelerates application development for faster delivery of software solutions. Automating
repetitive tasks optimizes the use of developer time.

so developers can focus on more critical aspects of the project. Additionally, code generation
helps mitigate security vulnerabilities, safeguarding the integrity of the codebase. Code
Whisperer also protects open source intellectual property by providing the open source reference
tracker. Code Whisperer enhances code quality and reliability, leading to robust and efficient
1
applications. It supports an efficient response to evolving software threats, keeping the codebase
up to date with the latest security practices. Code Whisperer has the potential to increase
development speed, security, and the quality of software.

1.2 OVERVIEW OF AWS DATA ENGINEERING:

AWS Data Engineering is the process of designing, building, and managing scalable, cloud-
based data pipelines using Amazon Web Services (AWS). It involves collecting, transforming,
storing, and analyzing large volumes of data efficiently, enabling organizations to make data-
driven decisions. With AWS, data engineers can create high-performance, secure, and cost-
effective solutions for handling structured and unstructured data.
The core components of AWS Data Engineering include data ingestion, transformation, storage,
and analytics. Data ingestion is facilitated by tools like Amazon Kinesis, AWS Data Pipeline,
and AWS Glue, which help in streaming and batch data collection. Once the data is ingested, it
undergoes processing and transformation using services such as AWS Glue, AWS Lambda,
Apache Spark on Amazon EMR, and AWS Step Functions, ensuring it is clean and structured for
further analysis.
For storage and management, AWS provides multiple solutions based on data type and use case.
Amazon S3 is widely used for data lakes, while Amazon Redshift serves as a powerful data
warehouse for analytics. Amazon DynamoDB and Amazon RDS handle NoSQL and relational
database needs, respectively. These services offer scalability, high availability, and security for
managing large datasets.
AWS Data Engineering also focuses on automation, monitoring, and security. Engineers use
AWS Glue Workflows, AWS Step Functions, and AWS CloudWatch to automate and monitor
data pipelines. Security measures such as IAM roles, encryption, and AWS KMS (Key
Management Service) help protect sensitive data.

2
CHAPTER 2
DATA-DRIVEN ORGANIZATIONS

2.1 Filtering and Retrieving Data:

Filtering and retrieving data is a fundamental process in data engineering that involves selecting
specific subsets of data based on defined conditions. In AWS Data Engineering, this process is
crucial for optimizing performance, reducing storage costs, and improving data analysis
efficiency.
Data filtering allows users to extract only relevant data by applying conditions on attributes such
as time range, category, or numerical values. AWS provides multiple services to filter data
efficiently. For example, in Amazon S3, filtering can be done using S3 Select, which retrieves
only the required data from objects stored in a bucket instead of scanning the entire dataset.
Similarly, Amazon Athena allows SQL-based querying on data stored in S3, enabling users to
filter and retrieve data on demand.

AWS Services for Filtering & Retrieving Data:


 Amazon S3 Select – Retrieves only required data from objects instead of scanning full
datasets.
 Amazon Athena – Uses SQL queries to filter and retrieve data stored in Amazon S3.
 Amazon Redshift & Amazon RDS – Allows SQL-based filtering for structured data.
 Amazon DynamoDB – Uses query and scan operations with filtering expressions for
NoSQL data.
 Amazon Kinesis Data Analytics – Filters and transforms real-time streaming data using
SQL queries.
 AWS Lambda – Processes event-driven data filtering before storage or further processing.

Benefits of Filtering & Retrieving Data in AWS:

 Reduces storage and processing costs by eliminating unnecessary data


 Speeds up query execution for faster insights.
 Optimizes resource usage for efficient cloud computing
 Improves security and compliance by restricting ace ss to sensitive data.

3
In real-time data streaming, filtering can be performed using Amazon Kinesis Data Analytics,
which allows SQL-based transformations on streaming data before it is stored or processed
further. AWS also provides AWS Lambda for event-driven filtering, ensuring only relevant data
is processed. Efficient data filtering and retrieval help reduce query execution time, optimize
storage, and improve the overall performance of data processing pipelines. By leveraging AWS
tools, businesses can ensure faster insights and cost-effective data management.

2.2 Amazon S3:

Amazon S3 (Simple Storage Service)


Amazon S3 (Simple Storage Service) is a scalable, highly durable, and secure cloud storage
service provided by Amazon Web Services (AWS). It enables users to store and retrieve
unlimited amounts of data from anywhere on the internet. Designed for high availability and
durability, S3 is widely used for data backup, big data analytics, machine learning, web hosting,
and content delivery.
Amazon S3 stores data as objects within buckets, eliminating the need for complex file systems
or database structures. Each object consists of data, metadata, and a unique identifier. Users can
organize and manage data efficiently with lifecycle policies, versioning, and access controls,
ensuring optimized storage and security.
One of the key advantages of S3 is its flexible storage classes, allowing users to choose cost-
effective solutions based on data access frequency. These include S3 Standard for frequently
accessed data, S3 Intelligent-Tiering for automatic cost optimization, S3 Standard-IA for
infrequent access, and S3 Glacier for long-term archival storage. This flexibility helps businesses
reduce costs while ensuring data remains accessible when needed.

Amazon S3 Lifecycle:
Amazon S3 Lifecycle policies help you manage your objects through two types of actions,
Transition and Expiration. In the architecture shown in Figure 1, we create an S3 Lifecycle
configuration rule that expires objects after ‘x’ days. It has a filter for an object tag of. You can
configure the value of ‘x’ based on your requirements.
If you are using an S3 bucket to store short-lived objects with unknown access patterns, you
might want to keep the objects that are still being accessed but delete the rest. This will let you
retain objects in your S3 bucket even after their expiry date as per the S3 lifecycle rules while
saving you costs by deleting objects that are not needed anymore. The following diagram shows
an architecture that considers the last accessed date of the object before deleting S3 objects.

4
This diagram represents an Amazon S3 data processing workflow, integrating multiple AWS
services to manage, automate, and optimize data storage and processing. The process begins with
an Amazon S3 Source Bucket, where raw data is stored. To enhance tracking and monitoring,
server access logs and inventory reports are sent to designated S3 target buckets.
Next, Amazon Event Bridge is used to trigger AWS Lambda, automating data processing tasks.
Lambda then interacts with various AWS components to streamline operations. It can execute
Amazon Athena queries to analyze stored data, store processed metadata in an S3 Manifest
Bucket, or initiate large-scale modifications using S3 Batch Operations.

5
CHAPTER 3
THE ELEMENTS OF DATA

3.1 Big data & Semi-structured schema:

Big Data refers to the vast amounts of structured, semi-structured, and unstructured data that are
generated at high velocity and volume from various sources such as social media, IoT devices,
business transactions, and sensors. Traditional data processing systems struggle to handle Big
Data due to its three primary characteristics—Volume, Velocity, and Variety. To process and
analyze this data efficiently, modern data architectures leverage cloud platforms, distributed
computing, and specialized tools like AWS, Hadoop, Apache Spark, and NoSQL databases.

Semi-Structured Schema
A semi-structured schema falls between structured and unstructured data formats. Unlike
structured data (which follows a fixed schema like relational databases), semi-structured data
does not strictly adhere to predefined schemas but still contains tags or markers to separate
elements. Examples of semi-structured data include JSON, XML, YAML, Avro, and Parquet.

Key Characteristics of Semi-Structured Data:


 Flexible Schema: Unlike relational databases, fields can vary between records.
6
 Hierarchical or Nested Data: Stores relationships within the data structure.
 Easier Parsing than Unstructured Data: Readable by analytical tools without requiring strict
schemas.
 Common Formats: JSON, XML, YAML, Avro, and Parquet.
Why is Semi-Structured Data Important in Big Data?
 Scalability: It easily handles large-scale data across distributed systems.
 Flexibility: Allows schema evolution without requiring database redesign.
 Efficient Storage & Querying: Optimized formats like Parquet and Avro enable fast retrieval
and processing.
 Better Integration with Modern Tools: It works well with NoSQL databases, cloud data lakes
(Amazon S3), and real-time analytics platforms.
Big Data solutions heavily rely on semi-structured schemas to manage dynamic, fast-growing
datasets efficiently. AWS provides tools like AWS Glue, Amazon Athena, and Amazon Redshift
Spectrum to process and analyze such data seamlessly.

3.2 Healthcare Data Analytics Framework

AWS provides a scalable, secure, and HIPAA-compliant framework for healthcare data analytics,
integrating Big Data, AI, and Machine Learning (ML) to improve patient care and operational
efficiency. The framework starts with data ingestion, where healthcare data is collected from
multiple sources, such as Electronic Health Records (EHRs), IoT medical devices, medical
imaging, and clinical trials. AWS services like AWS Glue, Amazon Kinesis, and AWS Storage
Gateway enable efficient batch and real-time data ingestion.
Once collected, the data is stored in different AWS services based on its structure. Amazon S3
acts as a Data Lake, storing unstructured and semi-structured data, while Amazon RDS and
Aurora manage structured relational databases. For NoSQL-based records, Amazon DynamoDB
is used, and Amazon Redshift serves as a data warehouse for analytical queries. To ensure the
data is clean and ready for analysis, AWS Glue, AWS Lambda, and Kinesis handle data
transformation and processing in real-time or batch mode.

7
This image illustrates key components of Healthcare Data Analytics by showing how different
types of medical data are collected and analyzed for better patient care. The magnifying glass in
the center represents data analytics, used to process and interpret diverse healthcare data
sources.Security and compliance are critical in healthcare, and AWS ensures HIPAA, GDPR, and
HITECH compliance using AWS IAM for access control, AWS KMS for encryption, and AWS
CloudTrail for auditing. Data lifecycle management is optimized with Amazon S3 Lifecycle
Policies and Intelligent-Tiering, automatically archiving old medical records to lower-cost
storage while maintaining accessibility.
Overall, AWS provides a highly scalable, secure, and cost-effective healthcare data analytics
framework, enabling real-time insights, AI-driven diagnostics, and better patient outcomes while
ensuring regulatory compliance and operational efficiency.
This visualization emphasizes how integrating multiple healthcare data sources enables better
decision-making, predictive analytics, and personalized treatments. Healthcare organizations use
Big Data, AI, and Machine Learning to analyze these data points, detect anomalies, and improve
patient outcomes.

8
CHAPTER 4
DESIGN PRINCIPLES & PATTERNS FOR DATA PIPELINES

4.1 AWS WELL ARCHITECTED FRAMEWORK:

The AWS Well-Architected Framework is a set of best practices, principles, and guidelines
designed to help cloud architects build secure, high-performing, resilient, and efficient
infrastructure on AWS. It provides a structured approach to evaluate and optimize cloud
workloads based on six core pillars.
Six Pillars of the AWS Well-Architected Framework:
1. Operational Excellence
 Focuses on running and monitoring systems efficiently to deliver business value.
 Uses automation, continuous improvement, and well-defined operational procedures.
 AWS services: AWS CloudFormation, AWS CloudWatch, AWS X-Ray, AWS
Systems Manager.
2. Security
 Ensures the protection of data, systems, and assets through risk assessment and
mitigation.
 Implements identity and access controls, encryption, and security monitoring.
 AWS services: AWS IAM, AWS Shield, AWS WAF, AWS KMS, AWS CloudTrail.
3. Reliability
 Builds resilient architectures that can withstand failures and recover quickly.
 Includes fault tolerance, disaster recovery, and high availability strategies.
 AWS services: Amazon S3, AWS Auto Scaling, Amazon Route 53, AWS Backup.
4. Performance Efficiency
 Optimizes resources to ensure high performance and scalability.
 Uses the latest technologies and continuously improves system architecture.

 AWS services: Amazon EC2 Auto Scaling, Amazon CloudFront, AWS Lambda,
Amazon RDS.
5 Cost Optimization
 Helps organizations maximize efficiency while minimizing costs.
9
 Involves right-sizing resources, using reserved instances, and monitoring
expenditures.
 AWS services: AWS Cost Explorer, AWS Budgets, Amazon S3 Intelligent-Tiering.
6. Sustainability
 Focuses on reducing environmental impact by optimizing energy consumption.
 Encourages resource-efficient workload design and minimal carbon footprint.
 AWS services: AWS Graviton, AWS Auto Scaling, Amazon EC2 Spot Instances.

The AWS Well-Architected Framework provides best practices and guidelines to help
organizations design and build secure, high-performing, resilient, and efficient cloud
architectures. It is structured around six key pillars, each addressing critical aspects of cloud
system design and operation.
By following AWS Well-Architected principles, organizations can design, evaluate, and
optimize their cloud architectures for long-term success.

4.2 Data pipeline infrastructure

A Data Pipeline Infrastructure is a structured system designed to automate the movement,


transformation, and processing of data from multiple sources to a target destination for storage,
analysis, and decision-making. It ensures efficient data collection, validation, and enrichment,
allowing organizations to derive meaningful insights in real-time or batch mode.
The data pipeline process begins with data ingestion, where raw data is collected from various
sources, including databases, APIs, IoT devices, logs, and third-party applications. AWS services
10
such as Amazon Kinesis, AWS Glue, and AWS DataSync facilitate seamless and scalable data
ingestion. Once ingested, data is stored in structured, semi-structured, or unstructured formats
using AWS services like Amazon S3 for Data Lakes, Amazon RDS for relational databases,
Amazon Redshift for data warehousing, and Amazon DynamoDB for NoSQL storage.

AWS services like Amazon Athena, Amazon QuickSight, and Amazon SageMaker allow users to
perform queries, generate reports, and apply machine learning models to gain predictive insights.
By implementing a robust data pipeline infrastructure, organizations can improve efficiency,
enhance scalability, and ensure seamless data-driven operations while leveraging the power of
cloud-based computing.

11
CHAPTER 5
SECURING & SCALING THE DATA PIPELINE

5.1 Data classification & security pillars:

Data classification is the process of categorizing data based on its sensitivity, importance, and
compliance requirements. It helps organizations manage and protect data effectively by assigning
different security levels, such as public data (openly available information), internal data
(restricted to employees), and confidential or sensitive data (highly critical information like
financial records and personal data). AWS services like Amazon Macie assist in identifying and
protecting sensitive data, while AWS Glue Data Catalog helps manage metadata.

The security pillars in AWS focus on safeguarding data and cloud workloads through best
practices. Identity and Access Management (IAM) ensures that only authorized users can access
resources, using services like AWS IAM and AWS Cognito for role-based permissions. Data
protection and encryption secure data at rest and in transit using AWS Key Management Service
(KMS) and AWS Certificate Manager. Threat detection and monitoring involve continuous
security assessments with AWS Security Hub, AWS GuardDuty, and AWS CloudTrail to detect
anomalies and unauthorized activities. Compliance and governance ensure adherence to industry
regulations such as HIPAA, GDPR, and SOC 2, with tools like AWS Config and AWS Audit
Manager enforcing security policies. By implementing strong data classification and security
pillars, organizations can enhance data protection, prevent breaches, and ensure regulatory
compliance in the AWS cloud.

5.2 AWS cloud Development kit

12
How AWS CDK Works:
1. Define Infrastructure in Code – Developers use their preferred programming language to
define AWS resources.
2. Synthesize to CloudFormation – AWS CDK compiles the code into an AWS CloudFormation
template.
3. Deploy to AWS – The generated CloudFormation stack is deployed to AWS, provisioning
resources automatically.

The AWS Cloud Development Kit (AWS CDK) is an open-source Infrastructure as Code (IaC)
framework that allows developers to define and provision cloud infrastructure using familiar
programming languages like TypeScript, Python, Java, C#, and Go. Instead of manually writing
AWS CloudFormation templates in YAML or JSON, AWS CDK enables developers to define
cloud resources using high-level programming constructs, making infrastructure management
more efficient, scalable, and maintainable.
AWS CDK works by converting application code into AWS CloudFormation templates, which
AWS then uses to provision resources. It introduces the concept of constructs, which are reusable
components that encapsulate AWS resources and configurations, allowing developers to define
complex architectures with minimal code. These constructs are categorized into three levels: L1
(low-level constructs) that map directly to CloudFormation resources, L2 (higher-level
constructs) that provide sensible defaults and abstraction, and L3 (opinionated stacks) that offer
complete architectural patterns
For example, a developer can use AWS CDK in TypeScript to define a serverless web
application with an Amazon S3 bucket, an AWS Lambda function, and an API Gateway, all
within a few lines of code. This significantly reduces development time compared to manually
configuring resources through the AWS Management Console or writing CloudFormation
templates from scratch.

13
One of the key benefits of AWS CDK is its ability to support multi-language development,
allowing teams to use their preferred programming languages while leveraging the power of
AWS infrastructure automation. It also integrates seamlessly with AWS CloudFormation,
ensuring that infrastructure deployments are repeatable, version-controlled, and safely managed.
Additionally, AWS CDK promotes best practices such as modularity, reusability, and automation,
making it a powerful tool for DevOps teams looking to streamline cloud deployments.
Overall, AWS CDK revolutionizes cloud infrastructure management by bringing software
development best practices into the world of cloud provisioning. By using AWS CDK,
organizations can accelerate cloud adoption, improve efficiency, and maintain a well-architected
infrastructure with minimal complexity.

14
CHAPTER 6
INGESTING & PREPARING DATA

6.1 ETL & ELT


The Extract, Transform, Load (ETL) process is a critical data pipeline workflow used to collect,
process, and store data for analysis and business intelligence. It enables organizations to
consolidate data from multiple sources, transform it into a usable format, and load it into a
centralized system such as a data warehouse, data lake, or database.
The ETL process consists of three main stages:
1. Extract – In this stage, raw data is collected from various sources, such as databases, APIs,
flat files, IoT devices, logs, and third-party applications. Data extraction can be performed in
real time or in batches, depending on the use case. AWS services like AWS Glue, AWS Data
Sync, and Amazon Kinesis help efficiently extract data from different sources.
2. Transform – Once extracted, the data is cleaned, standardized, and enriched to ensure
consistency and quality. This step involves data validation, deduplication, aggregation,
filtering, and format conversion to make the data suitable for analysis. AWS services like
AWS Glue, AWS Lambda, and Amazon EMR (Apache Spark, Hadoop) are commonly used
for transformation.
3. Load – The processed data is then loaded into a destination system, such as Amazon Redshift
(data warehouse), Amazon S3 (data lake), or Amazon RDS (relational database) for storage
and analysis. The loading process ensures optimized performance, scalability, and
accessibility for business intelligence tools and analytics platforms like Amazon Athena and
Amazon Quick Sight.

Difference between ETL and ELT:


This image compares ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform)
processes.
 ETL: Extracts structured data, transforms it into a suitable format, and loads it into structured
storage for analysis.
 ELT: Extracts structured or unstructured data, loads it into a data lake in raw form, and then
transforms it as needed for analytics.

15
ETL is typically used in traditional data warehouses, while ELT is preferred for big data and
cloud-based analytics.

6.2 Ingesting & preparing data

Data ingestion is the process of collecting and importing raw data from various sources into a
centralized storage system for further processing and analysis. This is a critical first step in any
data pipeline, ensuring that structured, semi-structured, or unstructured data is efficiently
gathered and made available for transformation and analytics.
Data can be ingested in two main ways: batch ingestion and streaming ingestion. Batch ingestion
involves periodically collecting and processing large volumes of data at scheduled intervals,
making it suitable for traditional databases and data warehouses. In contrast, streaming ingestion
enables real-time data flow, allowing businesses to process and analyze data as it arrives, which
is essential for applications like fraud detection, IoT monitoring, and real-time analytics.
AWS provides a range of services to support data ingestion, including Amazon Kinesis for real-
time streaming, AWS Glue for batch processing, AWS DataSync for transferring large datasets,
and Amazon S3 for scalable storage. Once ingested, the data is ready for transformation,
analysis, and integration into data lakes, warehouses, or machine learning models.
By implementing an efficient data ingestion strategy, organizations can ensure seamless data
flow, improve decision-making, and enhance business intelligence capabilities.

16
Preparing data:
Preparing data is a crucial step in the data pipeline that ensures raw data is cleaned, structured,
and transformed for analysis, reporting, or machine learning applications. The preparation
process typically includes data cleaning, which involves filtering out errors, correcting
inconsistencies, and dealing with incomplete records. Next, data transformation converts data
into the required format through aggregation, normalization, or encoding. Additionally, data
enrichment combines multiple datasets to provide more meaningful insights by adding relevant
contextual information.

17
CHAPTER 7
INGESTING BY BATCH OR BY STREAM

7.1 AWS CloudTrail:

AWS CloudTrail is a service that provides audit logging, governance, and compliance
monitoring by recording all API calls and user activities across AWS accounts. It captures details
such as who made the request, what actions were taken, when they occurred, and from where the
request originated. CloudTrail helps organizations track changes, detect unauthorized access,
troubleshoot issues, and ensure security best practices are followed.
CloudTrail logs events in Amazon S3, allowing users to analyze activity using AWS Athena or
integrate with AWS CloudWatch for real-time monitoring and alerts. It also supports multi-
region logging and event history tracking, making it easier to investigate security incidents and
maintain regulatory compliance.

7.2 Ingesting by batch or Stream:


18
Data ingestion is the process of collecting and importing data from various sources into a
centralized system for processing and analysis. Organizations can choose between batch
ingestion and streaming ingestion based on their use case, data volume, and processing needs.
1. Batch Ingestion
Batch ingestion involves collecting and processing large chunks of data at scheduled intervals. It
is useful for scenarios where real-time processing is not required, such as traditional data
warehouses, financial reporting, and historical data analysis. Batch ingestion is typically more
cost-effective and easier to manage.

Examples of AWS Services for Batch Ingestion:


 AWS Glue – For ETL (Extract, Transform, Load) batch jobs.
 AWS Data Sync – For transferring large datasets.
 Amazon S3 – Storing and managing batch data.

2. Streaming Ingestion
Streaming ingestion involves real-time data collection and processing, making it ideal for time-
sensitive applications such as IoT monitoring, fraud detection, and real-time analytics. Unlike

19
batch processing, streaming ingestion enables immediate insights by continuously ingesting and
analyzing data as it arrives.

Examples of AWS Services for Streaming Ingestion:


 Amazon Kinesis – For real-time streaming and analytics.
 AWS Lambda – For event-driven processing.
 Amazon Managed Streaming for Apache Kafka (MSK) – For distributed event streaming.
 Amazon DynamoDB Streams – Capturing real-time database changes.

Choosing between batch and stream ingestion depends on the use case, latency requirements, and
infrastructure cost. While batch ingestion is suited for large-scale historical processing,
streaming ingestion is essential for real-time, event-driven applications.
The decision between batch and streaming ingestion depends on:
 Latency requirements – If real-time processing is needed, use streaming.
 Data volume and frequency – High-frequency data sources (IoT, financial transactions)
benefit from streaming, while periodic reports work well with batch processing.
 Cost considerations – Batch ingestion is generally more cost-effective, whereas streaming
requires continuous processing and higher resource utilization.

Many modern data architectures use a hybrid approach, combining both batch and streaming
ingestion to balance performance, cost, and real-time analytics.

20
CHAPTER 8
STORING & ORGANIZING DATA

8.1 Data warehouse:

A data warehouse is a centralized system designed for storing, managing, and analyzing large
volumes of structured and semi-structured data. It serves as a repository that consolidates data
from multiple sources, such as transactional databases, CRM systems, and log files, enabling
efficient querying and reporting for business intelligence (BI) and analytics.
Data warehouses are optimized for read-heavy workloads and complex queries, allowing
organizations to gain insights through historical analysis, trend forecasting, and decision-making.
Unlike traditional databases, which focus on real-time transactions, data warehouses are
structured for analytical processing using OLAP (Online Analytical Processing), supporting
aggregations, indexing, and multidimensional analysis.

Key

A typical data warehouse architecture consists of multiple layers, including a data source layer
that collects raw data, an ETL (Extract, Transform, Load) process that cleans and organizes data,
a storage layer where structured data is housed, and a presentation layer that allows users to

21
access data through reporting and visualization tools. Data is usually modeled using star schema
or snowflake schema, where a fact table holds measurable business data and dimension tables
provide descriptive attributes for better analysis.
To enhance performance, data warehouses leverage columnar storage, indexing, and partitioning,
which allow faster query execution. AWS provides Amazon Redshift, a cloud-based data
warehousing solution that supports massively parallel processing (MPP), enabling businesses to
scale efficiently and analyze petabyte-scale datasets. Redshift integrates seamlessly with AWS
Glue for ETL, Amazon S3 for data lakes, and Amazon QuickSight for visualization, making it a
powerful tool for advanced analytics.

8.2 Data warehouse system Architecture


A data warehouse system architecture is designed to efficiently store, process, and analyze large
volumes of structured and semi-structured data. It consists of multiple layers that work together
to collect, transform, store, and present data for business intelligence and decision-making. The
architecture follows a multi-tiered approach, typically comprising the following key components:
1. Data Source Layer
This layer includes various data sources from which raw data is collected. These sources can be
operational databases, CRM systems, ERP systems, web logs, IoT devices, or third-party
applications. The data can be structured (e.g., relational databases), semi-structured (e.g., JSON,
XML), or unstructured (e.g., logs, multimedia files).
2. ETL (Extract, Transform, Load) Layer
The ETL process is responsible for extracting data from multiple sources, transforming it into a
structured format, and loading it into the data warehouse. The transformation step includes data
cleaning, normalization, aggregation, and deduplication to ensure consistency and accuracy.

3. Data Storage Layer


This is the core of the data warehouse, where processed data is stored in optimized formats for
efficient querying. The data is often organized using schemas such as Star Schema or Snowflake
Schema, which improve performance by structuring data into fact and dimension tables.

22
Columnar storage formats like Amazon Redshift, Apache Parquet, and ORC enhance query
speed and data compression.

The Data Warehouse Architecture depicted in the diagram outlines the structured flow of data
from various sources to business intelligence tools. Initially, data is gathered from operational
systems and external sources into transaction databases, where it is temporarily stored.
The ETL (Extract, Transform, Load) process then cleans, transforms, and loads this data into the
data warehouse database, ensuring consistency and accuracy. Within the data warehouse, data
can be further segmented into data marts, which store specific subsets of data for different
business functions.

23
CHAPTER 9
PROCESSING BIG DATA

9.1 Apache Hadoop:

Processing Big Data Using Apache Hadoop


Apache Hadoop is a powerful open-source framework designed to process and analyze massive
datasets in a distributed and scalable manner. It enables businesses to handle big data efficiently,
which is characterized by high volume, velocity, and variety. Hadoop achieves this by leveraging
a distributed storage and processing model, making it an ideal choice for batch processing of
large-scale data.
The core components of Apache Hadoop include:
1. Hadoop Distributed File System (HDFS) – A scalable, fault-tolerant storage system that
distributes data across multiple nodes.
2. MapReduce – A parallel processing framework that breaks large tasks into smaller chunks
and processes them simultaneously.
3. Yet Another Resource Negotiator (YARN) – Manages cluster resources and job scheduling
to optimize performance.
4. Apache Hive & Pig – High-level tools for querying and processing big data efficiently.
Hadoop's distributed nature ensures fault tolerance, as data is replicated across nodes,
minimizing the risk of data loss. It is widely integrated with cloud platforms like AWS (Amazon
EMR), Azure HDInsight, and Google Cloud Dataproc, making it even more accessible for
organizations. By leveraging Hadoop, businesses can perform complex data analytics, log
processing, recommendation system modeling, fraud detection, and more with high efficiency
and scalability.

24
Hadoop's distributed nature ensures fault tolerance, as data is replicated across nodes, minimizing
the risk of data loss. It is widely integrated with cloud platforms like AWS (Amazon EMR), Azure
HDInsight, and Google Cloud Dataproc, making it even more accessible for organizations. By
leveraging Hadoop, businesses can perform complex data analytics, log processing,
recommendation system modeling, fraud detection, and more with high efficiency and scalability.

9.2 Apache Spark:

Apache Spark is an open-source distributed computing system designed for fast data processing
and analytics. It provides a high-performance alternative to MapReduce by utilizing in-memory
computing, making it significantly faster for iterative computations and real-time analytics. Spark
is widely used for Big Data processing, machine learning, graph analytics, and stream processing.
Spark supports multiple programming languages, including Scala, Python (PySpark), Java, and R,
making it versatile for developers. It runs efficiently on Hadoop clusters, Kubernetes, standalone
mode, or cloud services like AWS EMR, Databricks, and Google Cloud Data.
With its high speed, scalability, and ease of integration, Apache Spark is a preferred choice for
real-time analytics, ETL workflows, AI/ML applications, and Big Data transformations across
industries.

25
The diagram illustrates the Apache Spark cluster architecture, which follows a Master-Slave
(Driver-Worker) model for distributed data processing. The Master Node contains the Driver
Program, which is responsible for managing the execution of tasks. Within the driver program,
the Spark Context plays a crucial role in coordinating the overall execution flow by
communicating with the cluster manager. The cluster manager can be YARN, Mesos, or a
Standalone Manager, and it is responsible for allocating resources to different worker nodes.

This architecture enables Spark to process Big Data workloads efficiently by distributing tasks
across multiple worker nodes. By leveraging in-memory computation and parallel execution,
Apache Spark provides a high-performance framework for handling large-scale data analytics
and machine learning tasks.

26
CHAPTER 10
PROCESSING DATA FOR ML

Data preprocessing in Machine learning:

Data preprocessing is a fundamental step in machine learning that involves transforming raw
data into a clean, structured, and usable format for model training. Since real-world data is often
incomplete, inconsistent, and noisy, preprocessing ensures that machine learning models can
effectively learn patterns and make accurate predictions.
The preprocessing pipeline includes several essential tasks. Data cleaning handles missing
values, removes duplicates, and corrects errors, ensuring data consistency. Feature engineering
involves selecting or creating meaningful features to improve model performance. Normalization
and standardization scale numerical values to maintain consistency across different data ranges.
Encoding categorical variables converts non-numeric data into machine-readable formats while
handling outliers prevents extreme values from negatively impacting model performance.
Finally, data splitting divides the dataset into training, validation, and test sets to assess model
generalization.

Data preprocessing is a crucial step in the machine learning pipeline that enhances data quality
and ensures models can learn effectively. Raw data collected from various sources is often
incomplete, inconsistent, or noisy, which can lead to poor model performance. Preprocessing
involves a series of steps to clean, transform, and structure the data for optimal analysis and
learning.

27
The first step, data cleaning, deals with handling missing values, removing duplicates, and
correcting inconsistencies. Missing values can be addressed by techniques such as imputation
(filling in with mean, median, or mode) or removing incomplete records. Noise reduction
eliminates errors and irrelevant information, often using filtering techniques.
Next, data transformation ensures data is in a suitable format for machine learning models. This
includes feature scaling, where numerical values are normalized (scaling to a fixed range) or
standardized (converting to a normal distribution). Encoding categorical variables is another
transformation step, where non-numeric categories are converted into numerical representations
(e.g., one-hot encoding or label encoding).

Feature engineering is another critical step, where relevant features are created or selected to
improve model efficiency. Feature selection techniques, such as correlation analysis or
dimensionality reduction (PCA), help eliminate redundant or irrelevant features.
Finally, data splitting ensures that the dataset is properly divided into training, validation, and
test sets to evaluate model performance accurately. The dataset is usually split in a 70-20-10 or
80-10-10 ratio, ensuring models are tested on unseen data before deployment.
By performing comprehensive data preprocessing, machine learning models become more
robust, accurate, and capable of handling real-world scenarios effectively.

28
CHAPTER 11
ANALYZING & VISUALIZING DATA

11.1 Data visualizations on AWS

Data visualization on AWS enables businesses to analyze and interpret large datasets using
cloud-based tools and services. AWS provides a range of services to facilitate data visualization,
helping users derive insights from raw data efficiently. These services support interactive
dashboards, real-time monitoring, and reporting functionalities for business intelligence (BI) and
data analytics.
One of the primary tools for data visualization on AWS is Amazon QuickSight, a scalable and
fully managed BI service. QuickSight allows users to create interactive dashboards, perform ad-
hoc analysis, and generate reports using various data sources, including Amazon S3, Amazon
Redshift, and AWS Glue. It also integrates with machine learning capabilities to provide
advanced data insights.
For real-time data visualization, Amazon CloudWatch helps monitor AWS resources and
applications, providing metrics, logs, and automated alerts. Similarly, AWS Grafana is a
powerful open-source visualization tool that integrates with AWS services like Amazon
Timestream and AWS IoT Analytics for real-time operational monitoring.
Organizations dealing with big data can use Amazon Redshift for data warehousing and
analytical querying, paired with visualization tools such as Tableau, Power BI, or Looker. AWS
Glue assists in preparing and transforming data before visualization, ensuring clean and
structured datasets.

11.2 Amazon Quick Sight


Amazon Quick Sight is a fully managed, cloud-powered business intelligence (BI) service
provided by AWS. It enables organizations to visualize their data, create interactive dashboards,
and gain actionable insights from various data sources. Quick Sight is designed to be fast,
scalable, and cost-effective, making it an ideal solution for businesses of all sizes.
With Amazon Quick Sight, users can connect to multiple data sources, including Amazon
Redshift, Amazon S3, Amazon RDS, AWS IoT, and third-party applications like Salesforce and
MySQL databases.

29
It supports machine learning-powered insights, allowing users to detect anomalies,
forecast trends, and automate data analysis without requiring data science expertise.
One of Quick Sight’s key features is SPICE (Super-fast, Parallel, In-memory Calculation
Engine), which accelerates data retrieval and enhances performance by preloading data into
memory for quick access. This eliminates the need for continuous database queries, reducing
latency and improving dashboard responsiveness.
Quick Sight also integrates with AWS services such as CloudTrail, Athena, and Glue, providing
a seamless data analytics ecosystem. It supports embedded analytics, allowing businesses to
integrate Quick Sight dashboards directly into their applications or websites.
By using Amazon Quick Sight, businesses can create interactive reports, real-time visualizations
and share insights across teams without the complexities of traditional BI tools. Its serverless
nature ensures automatic scaling and pay-per-session pricing, making it a cost-efficient solution
for organizations looking to leverage data-driven decision-making.

30
CHAPTER 12
CONCLUSION

The AWS Data Engineering Virtual Internship provides a comprehensive learning experience for
individuals looking to build expertise in cloud-based data solutions. Throughout the program,
interns gain hands-on experience with key AWS services such as Amazon S3, AWS Glue,
Amazon Redshift, AWS Lambda, and Amazon Kinesis, enabling them to design and implement
efficient data pipelines. The internship covers critical aspects of data ingestion, transformation,
storage, and visualization, ensuring a strong foundation in cloud data engineering.
By working on real-world scenarios and projects, interns develop skills in ETL processes, big
data management, and analytics, preparing them for careers in data engineering and cloud
computing. The exposure to AWS best practices, security measures, and automation tools further
enhances their ability to build scalable and reliable data architectures.
Overall, the internship serves as a stepping stone for aspiring data engineers, equipping them
with practical knowledge and industry-relevant experience. It enables participants to understand
the power of AWS services in handling large-scale data workloads and helps them build the
expertise required for modern data-driven enterprises.

31

You might also like