Summer Internship Report On: Aws Data Engineering (Topic)
Summer Internship Report On: Aws Data Engineering (Topic)
Submitted by
Student Name: VASAMSETTI KOWSHIK
Roll No: 216N1A05C1
(2024-2025)
SRINIVASA INSTITUTE OF ENGINEERING AND TECHNOLOGY
(Autonomous)
(Approved by AICTE, NewDelhi, Permanently affiliated to JNTUK, Kakinada)
(An ISO 9001:2015 Certified Institute, Accredited by NAAC with ‘A’ Grade) NH-216,
CERTIFICATE
This is to certify that V.KOWSHIK Reg. No. 216N1A05C1 has completed his/her Internship in AICTE
Eduskills on AWS Data Engineering as a part of partial fulfillment of the requirement for the Degree of
Bachelor of Technology in the Department of Computer Science and Engineering for the academic
year 2024-2025.
Mr.P.Chaitanya Dr.V.SaiPriya
Assistant Professor Professor
Department of T&P Cell Department of CSE
ACKNOWLEDGEMENT
I would like to extend my sincere gratitude to Shri Buddha Chandrasekhar, Chief Coordinating Officer NEAT
Cell, AICTE and Dr. Satya Ranjan Biswal, Chief Technology Officer, Eduskills for their invaluable support
throughout my AWS Data Engineering Virtual Internship. This opportunity provided me with practical insights into
cloud technologies and data engineering, significantly enhancing my technical skills and professional growth. Your
leadership was key to making this learning experience truly impactful.
I sincerely appreciate AWS Academy for the comprehensive curriculum in my AWS Data Engineering Virtual
Internship. It provided invaluable insights into cloud technologies and significantly enhanced my technical capabilities
for future challenges.
Our sincere gratitude goes to Chaitanya, our Internship Coordinator, whose constant support, valuable feedback, and
motivating presence steered us through the challenges we encountered during the project. His leadership played a
critical role in the successful completion of our internship.
I am deeply indebted to Dr. V.Sai Priya, Head of the Department, for her guidance and for ensuring we had access to
the necessary resources and support throughout the internship. Her encouragement has been a driving force in our
progress.
My sincere thanks also go to Dr. B. Rathna Raju, Principal, for providing us with the opportunity to embark on this
journey, as well as for the continuous support extended during this period.
Finally, I would like to express my appreciation to our College Management, faculty, lab technicians, non-teaching
staff, and friends, who have played an essential role in helping us complete the internship. Their timely support, both
direct and indirect, contributed greatly to our success.
ABSTRACT
This whitepaper helps architects, data scientists, and developers understand the big data analytics options
available in the AWS cloud by providing an overview of services, with thefollowing information:
This paper concludes with scenarios that showcase the analytics options in use, as well asadditional
resources for getting started with big data analytics on AWS
Objectives:
1. Data Integration and Centralization:
Unify diverse datasets from educational institutions, encompassing student records
faculty information, academic performance, and administrative data.
2. Real-time Data Processing:
Enable near-real-time processing and analysis of data, facilitating quick decision-making for
educational administrators and policymakers.
3. Security and Compliance:
Implement stringent security measures to safeguard sensitive student and faculty
information, adhering to regulatory standards set forth by AICTE.
4. Scalable Infrastructure:
Design and deploy a scalable data infrastructure on AWS, ensuring it can adapt to the
growing data volumes from an expanding network of technical institutions.
5. Analytics and Reporting:
Establish a comprehensive analytics framework for generating actionable insights,
supporting informed decision-making at both institutional and regulatory levels.
6. Collaborative Data Ecosystem:
Promote collaboration between educational institutions by facilitating secure data exchange
exchange and interoperability within the AWS environment.
INTERNSHIP ACTIVITIES
WEEK – 1:
Identify the risks and approaches to secure and govern data at each step andeach
transition of the data pipeline
Identify scaling considerations and best practices for building pipelines that
handle large-scale datasets.
Design and build a data collection process while considering constraints suchas
scalability, cost, fault tolerance, and latency.
Code Whisperer code generation offers many benefits for software development organizations. It
accelerates application development for faster delivery of software solutions. By automating repetitive tasks,
it optimizes the use of developer time, so developers can focus on more critical aspects of the project.
Additionally, code generation helps mitigate security vulnerabilities, safeguarding the integrity of the
codebase. Code Whisperer also protects open source intellectual property by providing the open source
reference tracker. Code Whisperer enhances code quality and reliability, leading to robust and efficient
applications. And it supports an efficient response to evolving software threats, keeping the codebase up to
date with the latest security practices. Code Whisperer has the potential to increase development speed,
security, and the quality of software.
DATA DRIVEN ORGANIZATIONS
Data Driven Decisions:
Another key characteristic of deriving insights by using your data pipeline is that the process will
almost always be iterative. You have a hypothesis about what you expect to find in the data, and
you need to experiment and see where it takes you. You might develop your hypothesis by using
BI tools to do initial discovery and analysis of data that has already been collected. You might
iterate within a pipeline segment, or you might iterate across the entire pipeline. For example, in
this illustration, the initial iteration (number 1) yielded a result that wasn't as defined as was
desired. Therefore, the data scientist refined the model and reprocessed the data to get a better
result (number 2). After reviewing those results, they determined that additional data could
improve the detail available in their result, so an additional data source was tapped and ingested
through the pipeline to produce the desired result (number 3). A pipeline often has iterations of
storage and processing. For example, after the external data is ingested into pipeline storage,
iterative processing transforms the data into different levels of refinement for different needs.
WEEK – 2:
The reality is that a modern architecture might include all of these elements. The key to a modern
data architecture is to apply the three-pronged strategy that you learned about earlier. Modernize
the technology that you are using. Unify your data sources to create a single source of truth that can
be accessed and used across the organization. And innovate to get higher value analysis from the
data that you have.
The architecture illustrates the following other AWS purpose-built services that integrate with
Amazon S3 and map to each component that was described on the previous slide: Amazon
Redshift is a fully managed data warehouse service.
•Amazon OpenSearch Service is a purpose-built data store and search engine that is
optimized for real-time analytics, including log analytics.
•Amazon EMR provides big data processing and simplifies some of the most complex
elements of setting up big data processing.
•Amazon Aurora provides a relational database engine that was built for the cloud.
•Amazon DynamoDB is a fully managed nonrelational database that is designed to run
high-performance applications.
•Amazon Sage Maker is an AI/ML service that democratizes access to ML process
Modern data architecture pipeline: Ingestion and storage:
Data being ingested into the Amazon S3 data lake arrives at the landing zone, where it is first
cleaned and stored into the raw zone for permanent storage. Because data that is destined for the
data warehouse needs to be highly trusted and conformed to a schema, the data needs to be
processed further additional transformations would include applying the schema and partitioning
(structuring) as well as other transformations that are required to make the data conform to
requirements that are established for the trusted zone. Finally, the processing layer prepares the data
for the curated zone by modeling and augmenting it to be joined with other datasets (enrichment)
and then stores the transformed, validated data in the curated layer. Datasets from the curated
layer are ready to be ingested into the data warehouse to make them available for low-latency
access or complex SQL querying.
Producers ingest records onto the stream. Producers are integrations that collect data from
a source and load it onto the stream. Consumers process records. Consumers read data from the
stream and perform their own processing on it. The stream itself provides a temporary but durable
storage layer for the streaming solution. In the pipeline that is depicted in this slide, Amazon
CloudWatch Events is the producer that puts CloudWatch Events event data onto the stream.
Kinesis Data Streams provides the storage. The data is then available to multiple consumers.
WEEK – 3:
AWS CloudFormation is a fully managed service that provides a common language for you to
describe and provision all of the infrastructure resources in your cloud environment. Cloud
Formation creates, updates, and deletes the resources for your applications in environmentscalled
stacks. A stack is a collection of AWS resources that are managed as a single unit.
CloudFormation is all about automated resource provisioning it simplifies the task of repeatedly and
predictably creating groups of related resources that power your applications. Resources are written in
text files by using JSON or YAML format.
INGESTING & PREPARING DATA
Data wrangling:
Transforming large amounts of unstructured or structured raw data from multiple sources with different
schemas into a meaningful set of data that has value for downstream processes or users.
Data Structuring:
For the scenario that was described previously, the structuring step includes exporting a .json file from the
customer support ticket system, loading the .json file into Excel, and letting Excel parse the file. For the
mapping step for the supp2 data, the data engineer would modifythe cust num field to match the customer
id field in the data warehouse.
For this example, you would perform additional data wrangling steps before compressingthe file for
upload to the S3 bucket.
Data Cleaning:
It includes:
To generalize the characteristics of batch processing, batch ingestion involves running batch jobs
that query a source, move the resulting dataset or datasets to durable storage in the pipeline, and
then perform whatever transformations are required for the use case. As noted in the Ingesting and
Preparing Data module, this could be just cleaning and minimally formatting data to put it into the
lake. Or, it could be more complex enrichment, augmentation, and processing to support complex
querying or big data and machine learning (ML) applications. Batch processing might be started
on demand, run on a schedule, or initiated by an event. Traditional extract, transform, and load
(ETL) uses batch processing, but extract, load, and transform (LT) processing might also be done
by batch.
The process of transporting data from one or more sources to a target site for further processing
and analysis. This data can originate from a range of sources, including data lakes, IoT devices, on-
premises databases, and SaaS apps, and end up in different target environments, such as cloud
data warehouses or data marts.
Use Amazon App Flow to ingest data from a software as a service (SaaS) application. Youcan
do the following with Amazon App Flow:
• Create a connector that reads from a SaaS source and includes filters.
• Map fields in each source object to fields in the destination and perform
transformations.
• Perform validation on records to be transferred.
• Securely transfer to Amazon S3 or Amazon Redshift. You can trigger an ingestion on
demand, on event, or on a schedule.
An example use case for Amazon App Flow is to ingest customer support ticket data fromthe Zendesk
SaaS product.
WEEK – 5:
Data in cloud object storage is handled as objects. Each object is assigned a key, which is a
unique identifier. When the key is paired with metadata that is attached to the objects, otherAWS
services can use the information to unlock a multitude of capabilities. Thanks to economies
of scale, cloud object storage comes at a lower cost than traditional storage.
• ETL pipelines transform data in buffered memory prior to loading data into a data lake or
data warehouse for storage.
• ELT pipelines extract and load data into a data lake or data warehouse for storage without
transformation.
Here are a few key points to summarize this section:
Storage plays an integralpart in ELT and ETL pipelines. Data often moves in and out of storage
numerous times, basedon pipeline type and workload type.
ETL pipelines transform data in buffered memory prior to loading data into a data lake or
data warehouse for storage. Levels of buffered memory vary by service.
ELT pipelines extract and load data into data lake or data warehouse storage without
transformation. The transformation of the data is part of the target system’s workload .
Securing Storage:
Security for a data warehouse in Amazon Redshift:
• Amazon Redshift database security is distinct from the security of the service itself.
• Amazon Redshift provides additional features to manage database security.
• Due to third-party auditing, Amazon Redshift can help to support applications that are
required to meet international compliance standards.
WEEK – 6:
Apache Hadoop:
Apache Spark:
ML Concepts:
ML Life Cycle:
Fig- 8.4: ML life cycle Framing the ML problem to meet Business Goals: Working backwards from
the business problem to be solved
• What is the business problem?
• What pain is it causing?
• Why does this problem need to be resolved?
• What will happen if you don't solve this problem?
• How will you measure success?
Collecting Data:
Data characteristics:
•How much data is there?
•At what speed and volume does it arrive?
•How frequently is it updated?
•How quickly is it processed?
•What type of data is it?
For accessibility: Data from multiple sources is put in Amazon S3, where Athena can be used for
one-time queries. Amazon EMR aggregates the data and stores the aggregates in S3. Athena can
be used to query the aggregated datasets. From S3, the data can be used in Amazon Redshift, where
Quick Sight can access the data to create visualizations. End of accessibilitydescription.
If you build infrastructure with code, you gain the benefits of repeatability and reusability while you build
your environments. In the example shown, a single template is used to deployNetwork Load Balancers and
Auto Scaling groups that contain Amazon Elastic Compute Cloud (Amazon EC2) instances. Network Load
Balancers distribute traffic evenly across targets.
CI/CD:
CI/CD can be pictured as a pipeline, where new code is submitted on one end, tested over aseries of
stages (source, build, test, staging, and production), and then published as production-ready code.
With Step Functions, you can use visual workflows to coordinate the components of
distributed applications and microservices.
You define a workflow, which is also referred to as a state machine, as a series ofsteps
and transitions between each step.
Step Functions is integrated with Athena to facilitate building workflows that include
Athena queries and data processing operations.
CONCLUSION
Data engineering is a critical component in the modern data landscape, playing a crucial role in the
success of data-driven decision-making and analytics. As we draw conclusions about data
engineering, several key points come to the forefront: