Geetha Intern de
Geetha Intern de
Submitted by
MAY 2024
R.V.R. & J.C. College of Engineering (Autonomous)
(NAAC A+ Grade) (Approved by A.I.C.T.E.) (Affiliated to
Acharya Nagarjuna University)
Chandramoulipuram : : Chowdavaram
Guntur – 522019
R. V. R. & J. C. COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND BUSINESS SYSTEMS
CERTIFICATE
This is to certify that this internship report “AWS Data Engineering Virtual Internship” is the
Bonafide work of “J.Geetha Harshitha who has carried out the work under my supervision and
submitted in partial fulfillment for the award of Summer Internship (CB-451) during the year 2023
- 2024.
I would like to express our sincere gratitude to these dignitaries, who are with us in the
journey of my Summer Internship “AWS Data Engineering Virtual Internship.”
First and foremost, we extend our heartfelt thanks to Dr. Kolla Srinivas, principal of R. V. R.
& J. C. College of Engineering, Guntur, for providing me with such overwhelming
environment to undertake this internship.
I am utmost grateful to Dr. A. Sri Nagesh, Head of the Department of Computer Science
and Business Systems for paving a path for me and assisting me to meet the requirements that
are needed to complete this internship.
I extend my gratitude to the Incharge K. Subramanyam for her ecstatic guidance and
feedback throughout the internship. Her constant support and constructive criticism have helped
me in completing the internship.
I would also like to express our sincere thanks to my friends and family for their moral
support though out our journey.
J.Geetha Harshitha
(Y21CB017)
I
Internship Details
II
SUMMER INTERNSHIP CERTIFICATE
TABLE OF CONTENTS
Page no
List of Figure V
ABSTRACT VI
11.Conclusion 15
12.References 16
LIST OF FIGURES
Figure Description Page no.
Fig- 1.1 Code Generation 1
Fig- 1.2 Benefits of Amazon Code Whisperer 2
Fig- 2.1 Data Pipeline 3
Fig- 3.1 Data Characteristics 4
Fig- 4.1 Types of Scaling 6
Fig- 4.2 Template Structure 6
Fig- 4.3 AWS Cloud Formation 6
Fig- 5.1 ETL & ELT Comparison 7
Fig- 6.1 Batch & Streaming Ingestion 8
Fig- 6.2 Built Ingestion Tools 8
Fig-7.1 Storage in Modern Architecture 9
Fig- 8.1 Data Processing 10
Fig-8.2 Apache Hadoop 10
Fig- 8.3 ML models 11
Fig- 8.4 ML life style 11
Fig- 8.5 ML framing 12
Fig- 8.6 Collecting data 12
Fig- 9.1 Factors & needs 13
Fig- 9.2 Quick Sight Example 13
Fig- 10.1 Automating Infrastructure 14
Fig- 10.2 Step function 14
V
ABSTRACT
Amazon Web Services (AWS) is a versatile cloud platform offering a broad range of
services tailored for data engineering, enabling organizations to efficiently manage, process, and
analyze large datasets. AWS is designed to accommodate businesses of all sizes, providing
scalable and flexible solutions to meet the complex needs of modern data engineering. Core
services such as Amazon S3, Amazon Redshift, and Amazon EMR form the foundation of AWS's
data engineering capabilities. Amazon S3 offers scalable object storage with high durability and
availability, making it ideal for storing data at various stages of the pipeline, while its integration
with other AWS services ensures seamless data flow. Amazon Redshift provides powerful data
warehousing, allowing organizations to run complex queries on large volumes of structured and
semi-structured data, making it key for business intelligence and analytics. Amazon EMR simplifies
running large-scale distributed data processing frameworks like Hadoop and Spark, enabling
efficient and cost-effective data transformation. AWS also offers advanced tools like AWS Glue,
which automates ETL (Extract, Transform, Load) processes, and Amazon Kinesis, which enables
real-time data streaming for immediate insights. AWS Lambda supports serverless computing,
allowing code execution without managing infrastructure, simplifying pipeline orchestration. The
global AWS infrastructure ensures high availability, low latency, and fault tolerance, essential for
reliable data processing. The pay-as-you-go pricing model allows businesses to scale
costeffectively without large upfront investments. AWS's robust management and monitoring tools,
like AWS CloudWatch and AWS CloudTrail, enhance the administration of data engineering
workloads by providing detailed monitoring and enforcing security measures. With strong security
features, including encryption and compliance certifications, AWS ensures the protection of
sensitive data, making it a powerful, flexible, and secure platform for building and managing
sophisticated data pipelines and analytics solutions.
VI
COURSE MODULES
1
Fig- 1.2: Benefits of Amazon CodeWhisperer
Code Whisperer code generation offers many benefits for software development
organizations. It accelerates application development for faster delivery of software solutions.
By automating repetitive tasks, it optimizes the use of developer time, so developers can focus
on more critical aspects of the project. Additionally, code generation helps mitigate security
vulnerabilities, safeguarding the integrity of the codebase. Code Whisperer also protects open
source intellectual property by providing the open source reference tracker. Code Whisperer
enhances code quality and reliability, leading to robust and efficient applications. And it
supports an efficient response to evolving software threats, keeping the codebase up to date
with the latest security practices. Code Whisperer has the potential to increase development
speed, security, and the quality of software.
2
lOMoARcPSD|457 543 23
Another key characteristic of deriving insights by using your data pipeline is that the process
will almost always be iterative. You have a hypothesis about what you expect to find in the
data, and you need to experiment and see where it takes you. You might develop your
hypothesis by using BI tools to do initial discovery and analysis of data that has already been
collected. You might iterate within a pipeline segment, or you might iterate across the entire
pipeline. For example, in this illustration, the initial iteration (number 1) yielded a result that
wasn't as defined as was desired. Therefore, the data scientist refined the model and
reprocessed the data to get a better result (number 2). After reviewing those results, they
determined that additional data could improve the detail available in their result, so an
additional data source was tapped and ingested through the pipeline to produce the desired
result (number 3). A pipeline often has iterations of storage and processing. For example, after
the external data is ingested into pipeline storage, iterative processing transforms the data into
different levels of refinement for different needs.
3
lOMoARcPSD|457 543 23
The architecture illustrates the following other AWS purpose-built services that integrate
with Amazon S3 and map to each component that was described on the previous slide:
Amazon Redshift is a fully managed data warehouse service.
•Amazon OpenSearch Service is a purpose-built data store and search engine that is
optimized for real-time analytics, including log analytics.
•Amazon EMR provides big data processing and simplifies some of the most complex elements
of setting up big data processing.
•Amazon Aurora provides a relational database engine that was built for the cloud.
•Amazon DynamoDB is a fully managed nonrelational database that is designed to run high-
performance applications.
•Amazon Sage Maker is an AI/ML service that democratizes access to ML process
4
lOMoARcPSD|457 543 23
Producers ingest records onto the stream. Producers are integrations that collect data from
a source and load it onto the stream. Consumers process records. Consumers read data from
the stream and perform their own processing on it. The stream itself provides a temporary
but durable storage layer for the streaming solution. In the pipeline that is depicted in this
slide, Amazon CloudWatch Events is the producer that puts CloudWatch Events event data
onto the stream. Kinesis Data Streams provides the storage. The data is then available to
multiple consumers.
5
lOMoARcPSD|457 543 23
4.
SECURING & SCALING DATA PIPELINE
4.1 Scaling: An overview:
6
lOMoARcPSD|457 543 23
Data wrangling:
Data Structuring:
For the scenario that was described previously, the structuring step includes exporting
a .json file from the customer support ticket system, loading the .json file into Excel,
and letting Excel parse the file. For the mapping step for the supp2 data, the data
engineer would modify the cust num field to match the customer id field in the data
warehouse.
For this example, you would perform additional data wrangling steps before compressing
the file for upload to the S3 bucket. Data Cleaning:
It includes;
7
lOMoARcPSD|457 543 23
6.
INGESTING BY BATCH OR BY STREAM
6.1 Comparing batch and stream ingestion:
Use Amazon App Flow to ingest data from a software as a service (SaaS)
application. You can do the following with Amazon App Flow: •Create a connector
that reads from a SaaS source and includes filters.
•Map fields in each source object to fields in the destination and perform transformations.
•Perform validation on records to be transferred.
•Securely transfer to Amazon S3 or Amazon Redshift. You can trigger an ingestion on demand, on event,
or on a schedule.
An example use case for Amazon App Flow is to ingest customer support ticket data from the Zendesk
SaaS product.
8
lOMoARcPSD|457 543 23
Data in cloud object storage is handled as objects. Each object is assigned a key,
which is a unique identifier. When the key is paired with metadata that is attached
to the objects, other AWS services can use the information to unlock a multitude
of capabilities. Thanks to economies of scale, cloud object storage comes at a
lower cost than traditional storage.
Data Warehouse Storage:
•Provide a centralized repository
•Store structured and semi-structured data •Store
data in one of two ways:
•Frequently accessed data in fast storage
•Infrequently accessed data in cheap storage
•Might contain multiple databases that are organized into tables and columns
•Separate analytics processing from transactional databases• Example:
Amazon Redshift
Purpose-Built Data Bases:
•ETL pipelines transform data in buffered memory prior to loading data into a data lake
or data warehouse for storage.
•ELT pipelines extract and load data into a data lake or data warehouse for storage
without transformation. Here are a few key points to summarize this section.
Storage plays an integral part in ELT and ETL pipelines. Data often moves in and
out of storage numerous times, based on pipeline type and workload type.
ELT pipelines extract and load data into data lake or data warehouse storage
without transformation. The transformation of the data is part of the target
system’s workload.
Securing Storage:
9
lOMoARcPSD|457 543 23
Apache Spark:
Apache Spark characteristics,
•Is an open-source, distributed processing framework
•Uses in-memory caching and optimized query processing
•Supports code reuse across multiple workloads
•Clusters consist of leader and worker nodes
10
lOMoARcPSD|457 543 23
•Processes data for analytics and BI workloads using big data frameworks
•Transform and move large amounts of data into and out of AWS data stores
ML Concepts:
ML Life Cycle:
11
lOMoARcPSD|457 543 23
Collecting Data:
12
lOMoARcPSD|457 543 23
Data characteristics:
•How much data is there?
•At what speed and volume does it arrive?
•How frequently is it updated?
•How quickly is it processed?
•What type of data is it?
If you build infrastructure with code, you gain the benefits of repeatability and
reusability while you build your environments. In the example shown, a single template
is used to deploy Network Load Balancers and Auto Scaling groups that contain
Amazon Elastic Compute Cloud (Amazon EC2) instances. Network Load Balancers
distribute traffic evenly across targets.
CI/CD:
CI/CD can be pictured as a pipeline, where new code is submitted on one end, tested
over a series of stages (source, build, test, staging, and production), and then
published as production-ready code.
3 With Step Functions, you can use visual workflows to coordinate the
components of distributed applications and microservices.
4 You define a workflow, which is also referred to as a state machine, as a series
of steps and transitions between each step.
5 Step Functions is integrated with Athena to facilitate building workflows that
include Athena queries and data processing operations.
14
lOMoARcPSD|457 543 23
CONCLUSION
Data engineering is a critical component in the modern data landscape, playing a crucial
role in the success of data-driven decision-making and analytics. As we draw
conclusions about data engineering, several key points come to the forefront:
15
lOMoARcPSD|457 543 23
REFERENCES
1. https://docs.aws.amazon.com/prescriptive-guidance/latest/modern-datacentric-use-
cases/data-engineering-principles.html
2. https://medium.com/@polystat.io/data-engineering-complete-referenceguidefrom-
a-z- 2019-852c308b15ed
3. https://aws.amazon.com/big-data/datalakes-and-analytics/
4. https://www.preplaced.in/blog/an-actionable-guide-to-aws-glue-for-data-engineers
5. https://awsacademy.instructure.com/courses/68699/modules/items/6115658
6. https://aws.amazon.com/blogs/big-data/aws-serverless-data-
analyticspipelinereference- architecture/
7. https://dev.to/tkeyo/data-engineering-pipeline-with-aws-step-
functionscodebuildand- dagster-5290
16