0% found this document useful (0 votes)
22 views26 pages

Geetha Intern de

Uploaded by

ramtej1217
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views26 pages

Geetha Intern de

Uploaded by

ramtej1217
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

AWS Data Engineering Virtual Internship

Submitted in partial fulfilment of requirements to CSBS

Summer Internship (CB - 451)

IV/IV B. Tech CSBS (VII Semester)

Submitted by

Jalla Geetha Harshitha(Y21CB017)

MAY 2024
R.V.R. & J.C. College of Engineering (Autonomous)
(NAAC A+ Grade) (Approved by A.I.C.T.E.) (Affiliated to
Acharya Nagarjuna University)
Chandramoulipuram : : Chowdavaram
Guntur – 522019
R. V. R. & J. C. COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND BUSINESS SYSTEMS

CERTIFICATE

This is to certify that this internship report “AWS Data Engineering Virtual Internship” is the
Bonafide work of “J.Geetha Harshitha who has carried out the work under my supervision and
submitted in partial fulfillment for the award of Summer Internship (CB-451) during the year 2023
- 2024.

K. Subramanyam Dr. A. Sri Nagesh


Internship Incharge Prof. & HoD, CSBS
ACKNOWLEDGEMENT

I would like to express our sincere gratitude to these dignitaries, who are with us in the
journey of my Summer Internship “AWS Data Engineering Virtual Internship.”

First and foremost, we extend our heartfelt thanks to Dr. Kolla Srinivas, principal of R. V. R.
& J. C. College of Engineering, Guntur, for providing me with such overwhelming
environment to undertake this internship.

I am utmost grateful to Dr. A. Sri Nagesh, Head of the Department of Computer Science
and Business Systems for paving a path for me and assisting me to meet the requirements that
are needed to complete this internship.

I extend my gratitude to the Incharge K. Subramanyam for her ecstatic guidance and
feedback throughout the internship. Her constant support and constructive criticism have helped
me in completing the internship.

I would also like to express our sincere thanks to my friends and family for their moral
support though out our journey.

J.Geetha Harshitha
(Y21CB017)

I
Internship Details

Title of Internship: Data Engineering Virtual Internship

Name of the student: Jalla Geetha Harshitha

Year and Semester: IV - I

Name of organization from where AICTE- AWS Academy Virtual Internship


internship undergone:

Duration of Internship: 10 Weeks

From date and to date: July - September 2024

II
SUMMER INTERNSHIP CERTIFICATE
TABLE OF CONTENTS
Page no

List of Figure V
ABSTRACT VI

1.Overview of AWS academy Data Engineering 1-2

2.Data Driven Organizations 3

2.1 Data Driven Decisions 3

2.2 The data pipeline- infrastructure 3

3.Elements of data, design principles& patterns for data pipelines 4-5

3.1 The five Vs of data volume, velocity, variety& value 4

3.2 Variety Data Types, Modern Data Architecture pipeline 4

4.Securing & Scaling data pipeline 6

4.1 Scaling: An overview 6

4.2 Creating a Scalable Infrastructure 6

5.Ingesting & Preparing data 7

5.1 ETL & ELT Comparison 7

6.Ingesting by batch or by stream 8

6.1 Comparing Batch & Stream Ingestion 8

7.Storing & organizing data 9

7.1 Storage in the modern data architecture 9

8.Processing big data & data for ML 10- 12

8.1 Big Data processing Concepts 10

9.Analyzing and visualizing data 13


10.Automating the pipeline 14

11.Conclusion 15

12.References 16
LIST OF FIGURES
Figure Description Page no.
Fig- 1.1 Code Generation 1
Fig- 1.2 Benefits of Amazon Code Whisperer 2
Fig- 2.1 Data Pipeline 3
Fig- 3.1 Data Characteristics 4
Fig- 4.1 Types of Scaling 6
Fig- 4.2 Template Structure 6
Fig- 4.3 AWS Cloud Formation 6
Fig- 5.1 ETL & ELT Comparison 7
Fig- 6.1 Batch & Streaming Ingestion 8
Fig- 6.2 Built Ingestion Tools 8
Fig-7.1 Storage in Modern Architecture 9
Fig- 8.1 Data Processing 10
Fig-8.2 Apache Hadoop 10
Fig- 8.3 ML models 11
Fig- 8.4 ML life style 11
Fig- 8.5 ML framing 12
Fig- 8.6 Collecting data 12
Fig- 9.1 Factors & needs 13
Fig- 9.2 Quick Sight Example 13
Fig- 10.1 Automating Infrastructure 14
Fig- 10.2 Step function 14

V
ABSTRACT

Amazon Web Services (AWS) is a versatile cloud platform offering a broad range of
services tailored for data engineering, enabling organizations to efficiently manage, process, and
analyze large datasets. AWS is designed to accommodate businesses of all sizes, providing
scalable and flexible solutions to meet the complex needs of modern data engineering. Core
services such as Amazon S3, Amazon Redshift, and Amazon EMR form the foundation of AWS's
data engineering capabilities. Amazon S3 offers scalable object storage with high durability and
availability, making it ideal for storing data at various stages of the pipeline, while its integration
with other AWS services ensures seamless data flow. Amazon Redshift provides powerful data
warehousing, allowing organizations to run complex queries on large volumes of structured and
semi-structured data, making it key for business intelligence and analytics. Amazon EMR simplifies
running large-scale distributed data processing frameworks like Hadoop and Spark, enabling
efficient and cost-effective data transformation. AWS also offers advanced tools like AWS Glue,
which automates ETL (Extract, Transform, Load) processes, and Amazon Kinesis, which enables
real-time data streaming for immediate insights. AWS Lambda supports serverless computing,
allowing code execution without managing infrastructure, simplifying pipeline orchestration. The
global AWS infrastructure ensures high availability, low latency, and fault tolerance, essential for
reliable data processing. The pay-as-you-go pricing model allows businesses to scale
costeffectively without large upfront investments. AWS's robust management and monitoring tools,
like AWS CloudWatch and AWS CloudTrail, enhance the administration of data engineering
workloads by providing detailed monitoring and enforcing security measures. With strong security
features, including encryption and compliance certifications, AWS ensures the protection of
sensitive data, making it a powerful, flexible, and secure platform for building and managing
sophisticated data pipelines and analytics solutions.

VI
COURSE MODULES

1.OVERVIEW OF AWS ACADEMY DATA ENGINEERING


Course objectives:
This course prepares you to do the following:
• Summarize the role and value of data science in a data-driven organization.
• Recognize how the elements of data influence decisions about the infrastructure of a
data pipeline.
• Illustrate a data pipeline by using AWS services to meet a generalized use case.
• Identify the risks and approaches to secure and govern data at each step and each
transition of the data pipeline
• Identify scaling considerations and best practices for building pipelines that
• handle large-scale datasets.
• Design and build a data collection process while considering constraints such as
scalability, cost, fault tolerance, and latency.

Fig- 1.1: Code Generation

Open Code Reference Log


Code Whisperer learns from open-source projects and the code it suggests might occasionally
resemble code samples from the training data. With the reference log, you can view references
to code suggestions that are similar to the training data. When such occurrences happen, Code
Whisperer notifies you and provides repository and licensing information. Use this
information to make decisions about whether to use the code in your project and properly
attribute the source code as desired.

1
Fig- 1.2: Benefits of Amazon CodeWhisperer

Code Whisperer code generation offers many benefits for software development
organizations. It accelerates application development for faster delivery of software solutions.
By automating repetitive tasks, it optimizes the use of developer time, so developers can focus
on more critical aspects of the project. Additionally, code generation helps mitigate security
vulnerabilities, safeguarding the integrity of the codebase. Code Whisperer also protects open
source intellectual property by providing the open source reference tracker. Code Whisperer
enhances code quality and reliability, leading to robust and efficient applications. And it
supports an efficient response to evolving software threats, keeping the codebase up to date
with the latest security practices. Code Whisperer has the potential to increase development
speed, security, and the quality of software.

2
lOMoARcPSD|457 543 23

2.DATA DRIVEN ORGANIZATIONS


• Data Driven Decisions:

How do organizations decide...


• Which of these customer transactions should be flagged as fraud?
• Which webpage design leads to the most completed sales?
• Which patients are most likely to have a relapse?
• Which type of online activity represents a security issue?
• When is the optimum time to harvest this year's crop?
• The data pipeline –infrastructure for data-driven decisions:

Fig- 2.1: Data Pipeline

Another key characteristic of deriving insights by using your data pipeline is that the process
will almost always be iterative. You have a hypothesis about what you expect to find in the
data, and you need to experiment and see where it takes you. You might develop your
hypothesis by using BI tools to do initial discovery and analysis of data that has already been
collected. You might iterate within a pipeline segment, or you might iterate across the entire
pipeline. For example, in this illustration, the initial iteration (number 1) yielded a result that
wasn't as defined as was desired. Therefore, the data scientist refined the model and
reprocessed the data to get a better result (number 2). After reviewing those results, they
determined that additional data could improve the detail available in their result, so an
additional data source was tapped and ingested through the pipeline to produce the desired
result (number 3). A pipeline often has iterations of storage and processing. For example, after
the external data is ingested into pipeline storage, iterative processing transforms the data into
different levels of refinement for different needs.

3
lOMoARcPSD|457 543 23

3.THE ELEMENTS OF DATA, DESIGN PRINCIPLES &


PATTERNS FOR DATA PIPELINES

3.1 The five Vs of data- volume, velocity, variety, veracity& value:

Fig- 3.1: Data Characteristics The


evolution of data architectures:
So, which of these data stores or data architectures is the best one for your data pipeline?
The reality is that a modern architecture might include all of these elements. The key to
a modern data architecture is to apply the three-pronged strategy that you learned about
earlier. Modernize the technology that you are using. Unify your data sources to create
a single source of truth that can be accessed and used across the organization. And
innovate to get higher value analysis from the data that you have.

Variety data types, Modern data architecture on AWS:

The architecture illustrates the following other AWS purpose-built services that integrate
with Amazon S3 and map to each component that was described on the previous slide:
Amazon Redshift is a fully managed data warehouse service.
•Amazon OpenSearch Service is a purpose-built data store and search engine that is
optimized for real-time analytics, including log analytics.
•Amazon EMR provides big data processing and simplifies some of the most complex elements
of setting up big data processing.
•Amazon Aurora provides a relational database engine that was built for the cloud.
•Amazon DynamoDB is a fully managed nonrelational database that is designed to run high-
performance applications.
•Amazon Sage Maker is an AI/ML service that democratizes access to ML process

4
lOMoARcPSD|457 543 23

3.2 Modern data architecture pipeline: Ingestion and storage:


Data being ingested into the Amazon S3 data lake arrives at the landing zone, where it
is first cleaned and stored into the raw zone for permanent storage. Because data that is
destined for the data warehouse needs to be highly trusted and conformed to a schema,
the data needs to be processed further additional transformations would include
applying the schema and partitioning (structuring) as well as other transformations that
are required to make the data conform to requirements that are established for the trusted
zone. Finally, the processing layer prepares the data for the curated zone by modeling
and augmenting it to be joined with other datasets (enrichment) and then stores the
transformed, validated data in the curated layer. Datasets from the curated layer are
ready to be ingested into the data warehouse to make them available for lowlatency
access or complex SQL querying.

Streaming analytics pipeline:

Producers ingest records onto the stream. Producers are integrations that collect data from
a source and load it onto the stream. Consumers process records. Consumers read data from
the stream and perform their own processing on it. The stream itself provides a temporary
but durable storage layer for the streaming solution. In the pipeline that is depicted in this
slide, Amazon CloudWatch Events is the producer that puts CloudWatch Events event data
onto the stream. Kinesis Data Streams provides the storage. The data is then available to
multiple consumers.

5
lOMoARcPSD|457 543 23

4.
SECURING & SCALING DATA PIPELINE
4.1 Scaling: An overview:

Fig- 4.1: Types of Scaling

4.2 Creating a scalable infrastructure:

Fig- 4.2: Template Structure

Fig- 4.3: AWS Cloud Formation


AWS CloudFormation is a fully managed service that provides a common language
for you to describe and provision all of the infrastructure resources in your cloud
environment. Cloud Formation creates, updates, and deletes the resources for your
applications in environments called stacks. A stack is a collection of AWS resources
that are managed as a single unit.
CloudFormation is all about automated resource provisioning— it simplifies the
task of repeatedly and predictably creating groups of related resources that power
your applications. Resources are written in text files by using JSON or YAML
format.

6
lOMoARcPSD|457 543 23

5.INGESTING & PREPARING DATA


5.1 ETL and ELT comparison:

Fig- 5.1: ETL& ELT comparison

Data wrangling:

Transforming large amounts of unstructured or structured raw data from multiple


sources with different schemas into a meaningful set of data that has value for
downstream processes or users.

Data Structuring:

For the scenario that was described previously, the structuring step includes exporting
a .json file from the customer support ticket system, loading the .json file into Excel,
and letting Excel parse the file. For the mapping step for the supp2 data, the data
engineer would modify the cust num field to match the customer id field in the data
warehouse.

For this example, you would perform additional data wrangling steps before compressing
the file for upload to the S3 bucket. Data Cleaning:

It includes;

•Remove unwanted data.


•Fill in missing data values.
•Validate or modify data types.
•Fix outliers

7
lOMoARcPSD|457 543 23

6.
INGESTING BY BATCH OR BY STREAM
6.1 Comparing batch and stream ingestion:

Fig- 6.1: Batch& Streaming Ingestion


To generalize the characteristics of batch processing, batch ingestion involves running
batch jobs that query a source, move the resulting dataset or datasets to durable storage
in the pipeline, and then perform whatever transformations are required for the use case.
As noted in the Ingesting and Preparing Data module, this could be just cleaning and
minimally formatting data to put it into the lake. Or, it could be more complex
enrichment, augmentation, and processing to support complex querying or big data and
machine learning (ML) applications. Batch processing might be started on demand, run
on a schedule, or initiated by an event. Traditional extract, transform, and load (ETL)
uses batch processing, but extract, load, and transform (LT) processing might also be
done by batch.
Purpose Built Ingestion Tools:

Fig- 6.2: Built Ingestion Tools

Use Amazon App Flow to ingest data from a software as a service (SaaS)
application. You can do the following with Amazon App Flow: •Create a connector
that reads from a SaaS source and includes filters.
•Map fields in each source object to fields in the destination and perform transformations.
•Perform validation on records to be transferred.
•Securely transfer to Amazon S3 or Amazon Redshift. You can trigger an ingestion on demand, on event,
or on a schedule.
An example use case for Amazon App Flow is to ingest customer support ticket data from the Zendesk
SaaS product.

8
lOMoARcPSD|457 543 23

STORING AND ORGANIZING DATA:


7.1 Storage in the modern data architecture:

Fig- 7.1: Storage in modern Architecture

Data in cloud object storage is handled as objects. Each object is assigned a key,
which is a unique identifier. When the key is paired with metadata that is attached
to the objects, other AWS services can use the information to unlock a multitude
of capabilities. Thanks to economies of scale, cloud object storage comes at a
lower cost than traditional storage.
Data Warehouse Storage:
•Provide a centralized repository
•Store structured and semi-structured data •Store
data in one of two ways:
•Frequently accessed data in fast storage
•Infrequently accessed data in cheap storage
•Might contain multiple databases that are organized into tables and columns
•Separate analytics processing from transactional databases• Example:
Amazon Redshift
Purpose-Built Data Bases:
•ETL pipelines transform data in buffered memory prior to loading data into a data lake
or data warehouse for storage.
•ELT pipelines extract and load data into a data lake or data warehouse for storage
without transformation. Here are a few key points to summarize this section.
Storage plays an integral part in ELT and ETL pipelines. Data often moves in and
out of storage numerous times, based on pipeline type and workload type.
ELT pipelines extract and load data into data lake or data warehouse storage
without transformation. The transformation of the data is part of the target
system’s workload.
Securing Storage:

Security for a data warehouse in Amazon Redshift


•Amazon Redshift database security is distinct from the security of the service itself.
•Amazon Redshift provides additional features to manage database security.
•Due to third-party auditing, Amazon Redshift can help to support applications that are
required to meet international compliance standards.

9
lOMoARcPSD|457 543 23

PROCESSING BIG DATA & DATA FOR ML

8.1 Big Data processing Concepts:

Fig- 8.1: Data Processing


Apache Hadoop:

Fig- 8.2: Apache Hadoop

Apache Spark:
Apache Spark characteristics,
•Is an open-source, distributed processing framework
•Uses in-memory caching and optimized query processing
•Supports code reuse across multiple workloads
•Clusters consist of leader and worker nodes

Amazon EMR Characteristics:


•Managed cluster platform
•Big data solution for petabyte-scale data processing, interactive analytics, and machine learning

10
lOMoARcPSD|457 543 23

•Processes data for analytics and BI workloads using big data frameworks
•Transform and move large amounts of data into and out of AWS data stores
ML Concepts:

Fig- 8.3: ML models

ML Life Cycle:

Fig- 8.4: ML life

Framing the ML problem to meet Business Goals:


Working backwards from the business problem to be solved

11
lOMoARcPSD|457 543 23

•What is the business problem?


•What pain is it causing?
•Why does this problem need to be resolved?
•What will happen if you don't solve this problem?
•How will you measure success?

Fig- 8.5: ML Framing

Collecting Data:

Fig-8.6: Collecting Data

12
lOMoARcPSD|457 543 23

9.ANALYZING & VISUALIZING DATA

9.1 Consideration factors that influence tool selection:

Fig- 9.1: Factors & needs

Data characteristics:
•How much data is there?
•At what speed and volume does it arrive?
•How frequently is it updated?
•How quickly is it processed?
•What type of data is it?

9.2 Comparing AWS tools and Services:


For accessibility: Data from multiple sources is put in Amazon S3, where Athena can be used
for one-time queries. Amazon EMR aggregates the data and stores the aggregates in S3. Athena
can be used to query the aggregated datasets. From S3, the data can be used in Amazon Redshift,
where Quick Sight can access the data to create visualizations. End of accessibility description.

Fig- 9.2: QuickSight Example


13
lOMoARcPSD|457 543 23

10.AUTOMATING THE PIPELINE

10.1 Automating Infrastructure deployment:

Fig- 10.1: Automating Infrastructure

If you build infrastructure with code, you gain the benefits of repeatability and
reusability while you build your environments. In the example shown, a single template
is used to deploy Network Load Balancers and Auto Scaling groups that contain
Amazon Elastic Compute Cloud (Amazon EC2) instances. Network Load Balancers
distribute traffic evenly across targets.

CI/CD:
CI/CD can be pictured as a pipeline, where new code is submitted on one end, tested
over a series of stages (source, build, test, staging, and production), and then
published as production-ready code.

10.2 Automating with Step Function:

Fig- 10.2: Step Function

3 With Step Functions, you can use visual workflows to coordinate the
components of distributed applications and microservices.
4 You define a workflow, which is also referred to as a state machine, as a series
of steps and transitions between each step.
5 Step Functions is integrated with Athena to facilitate building workflows that
include Athena queries and data processing operations.

14
lOMoARcPSD|457 543 23

CONCLUSION

Data engineering is a critical component in the modern data landscape, playing a crucial
role in the success of data-driven decision-making and analytics. As we draw
conclusions about data engineering, several key points come to the forefront:

Foundation for Data-Driven Insights:


Data engineering serves as the foundation for extracting, transforming, and loading
(ETL) data from diverse sources into a format suitable for analysis. This process
is essential for generating meaningful insights and facilitating informed
decisionmaking within organizations.

Data Quality and Integrity:


Maintaining data quality and integrity is paramount in data engineering. Data engineers
are responsible for cleaning, validating, and ensuring the accuracy of data, contributing
to the reliability of downstream analyses and business processes.

Scalability and Performance:


With the increasing volume, velocity, and variety of data, data engineering solutions
must be scalable and performant. Scalability ensures that systems can handle growing
amounts of data, while performance optimization ensures timely processing and
availability of data for analytics.

Integration of Diverse Data Sources:


Data engineering enables the integration of data from various sources, whether
structured or unstructured, providing a unified view of information. This integration is
crucial for a comprehensive understanding of business operations and customer
behavior.

15
lOMoARcPSD|457 543 23

REFERENCES

1. https://docs.aws.amazon.com/prescriptive-guidance/latest/modern-datacentric-use-
cases/data-engineering-principles.html
2. https://medium.com/@polystat.io/data-engineering-complete-referenceguidefrom-
a-z- 2019-852c308b15ed
3. https://aws.amazon.com/big-data/datalakes-and-analytics/

4. https://www.preplaced.in/blog/an-actionable-guide-to-aws-glue-for-data-engineers

5. https://awsacademy.instructure.com/courses/68699/modules/items/6115658
6. https://aws.amazon.com/blogs/big-data/aws-serverless-data-
analyticspipelinereference- architecture/
7. https://dev.to/tkeyo/data-engineering-pipeline-with-aws-step-
functionscodebuildand- dagster-5290

16

You might also like