0% found this document useful (0 votes)

511 views21 pages

Summer Internship Report On: Aws Data Engineering (Topic)

Uploaded by

sathvi947

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

511 views21 pages

Summer Internship Report On: Aws Data Engineering (Topic)

Uploaded by

sathvi947

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

SUMMER INTERNSHIP REPORT ON

AWS DATA ENGINEERING (Topic)

in partial fulfillment for the award of the degree

of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING

Submitted by
Student Name: VASAMSETTI KOWSHIK
Roll No: 216N1A05C1

Department of Conputer Science and Engineering

SRINIVASA INSTITUTE OF ENGINEERING AND TECHNOLOGY
(Autonomous)

(2024-2025)
SRINIVASA INSTITUTE OF ENGINEERING AND TECHNOLOGY

(Autonomous)
(Approved by AICTE, NewDelhi, Permanently affiliated to JNTUK, Kakinada)

(An ISO 9001:2015 Certified Institute, Accredited by NAAC with ‘A’ Grade) NH-216,

Amalapuram-Kakinada Highway, Cheyyeru (V), Amalapuram.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

This is to certify that V.KOWSHIK Reg. No. 216N1A05C1 has completed his/her Internship in AICTE
Eduskills on AWS Data Engineering as a part of partial fulfillment of the requirement for the Degree of
Bachelor of Technology in the Department of Computer Science and Engineering for the academic
year 2024-2025.

Internship Coordinator Head of the Department

Mr.P.Chaitanya Dr.V.SaiPriya
Assistant Professor Professor
Department of T&P Cell Department of CSE
ACKNOWLEDGEMENT

I would like to extend my sincere gratitude to Shri Buddha Chandrasekhar, Chief Coordinating Officer NEAT
Cell, AICTE and Dr. Satya Ranjan Biswal, Chief Technology Officer, Eduskills for their invaluable support
throughout my AWS Data Engineering Virtual Internship. This opportunity provided me with practical insights into
cloud technologies and data engineering, significantly enhancing my technical skills and professional growth. Your
leadership was key to making this learning experience truly impactful.

I sincerely appreciate AWS Academy for the comprehensive curriculum in my AWS Data Engineering Virtual
Internship. It provided invaluable insights into cloud technologies and significantly enhanced my technical capabilities
for future challenges.

Our sincere gratitude goes to Chaitanya, our Internship Coordinator, whose constant support, valuable feedback, and
motivating presence steered us through the challenges we encountered during the project. His leadership played a
critical role in the successful completion of our internship.

I am deeply indebted to Dr. V.Sai Priya, Head of the Department, for her guidance and for ensuring we had access to
the necessary resources and support throughout the internship. Her encouragement has been a driving force in our
progress.

My sincere thanks also go to Dr. B. Rathna Raju, Principal, for providing us with the opportunity to embark on this
journey, as well as for the continuous support extended during this period.

Finally, I would like to express my appreciation to our College Management, faculty, lab technicians, non-teaching
staff, and friends, who have played an essential role in helping us complete the internship. Their timely support, both
direct and indirect, contributed greatly to our success.
ABSTRACT

This whitepaper helps architects, data scientists, and developers understand the big data analytics options
available in the AWS cloud by providing an overview of services, with thefollowing information:

• Ideal usage patterns

• Cost model
• Performance
• Durability and availability
• Scalability and elasticity
• Interfaces
• Anti-patterns

This paper concludes with scenarios that showcase the analytics options in use, as well asadditional
resources for getting started with big data analytics on AWS
Objectives:
1. Data Integration and Centralization:
Unify diverse datasets from educational institutions, encompassing student records
faculty information, academic performance, and administrative data.
2. Real-time Data Processing:
Enable near-real-time processing and analysis of data, facilitating quick decision-making for
educational administrators and policymakers.
3. Security and Compliance:
Implement stringent security measures to safeguard sensitive student and faculty
information, adhering to regulatory standards set forth by AICTE.
4. Scalable Infrastructure:
Design and deploy a scalable data infrastructure on AWS, ensuring it can adapt to the
growing data volumes from an expanding network of technical institutions.
5. Analytics and Reporting:
Establish a comprehensive analytics framework for generating actionable insights,
supporting informed decision-making at both institutional and regulatory levels.
6. Collaborative Data Ecosystem:
Promote collaboration between educational institutions by facilitating secure data exchange
exchange and interoperability within the AWS environment.
INTERNSHIP ACTIVITIES

WEEK – 1:

OVERVIEW OF AWS ACADEMY DATA ENGINEERING

Course Objectives:

This course prepares you to do the following:

 Summarize the role and value of data science in a data-driven organization.

 Recognize how the elements of data influence decisions about the

infrastructure of a data pipeline.
 Illustrate a data pipeline by using AWS services to meet a generalized use case.

 Identify the risks and approaches to secure and govern data at each step andeach
transition of the data pipeline
 Identify scaling considerations and best practices for building pipelines that
 handle large-scale datasets.
 Design and build a data collection process while considering constraints suchas
scalability, cost, fault tolerance, and latency.
Code Whisperer code generation offers many benefits for software development organizations. It
accelerates application development for faster delivery of software solutions. By automating repetitive tasks,
it optimizes the use of developer time, so developers can focus on more critical aspects of the project.
Additionally, code generation helps mitigate security vulnerabilities, safeguarding the integrity of the
codebase. Code Whisperer also protects open source intellectual property by providing the open source
reference tracker. Code Whisperer enhances code quality and reliability, leading to robust and efficient
applications. And it supports an efficient response to evolving software threats, keeping the codebase up to
date with the latest security practices. Code Whisperer has the potential to increase development speed,
security, and the quality of software.
DATA DRIVEN ORGANIZATIONS
Data Driven Decisions:

How do organizations decide...

 Which of these customer transactions should be flagged as fraud?

 Which webpage design leads to the most completed sales?
 Which patients are most likely to have a relapse?
 Which type of online activity represents a security issue?
 When is the optimum time to harvest this year's crop?
The data pipeline –infrastructure for data-driven decisions:

Fig- 2.1: Data Pipeline

Another key characteristic of deriving insights by using your data pipeline is that the process will
almost always be iterative. You have a hypothesis about what you expect to find in the data, and
you need to experiment and see where it takes you. You might develop your hypothesis by using
BI tools to do initial discovery and analysis of data that has already been collected. You might
iterate within a pipeline segment, or you might iterate across the entire pipeline. For example, in
this illustration, the initial iteration (number 1) yielded a result that wasn't as defined as was
desired. Therefore, the data scientist refined the model and reprocessed the data to get a better
result (number 2). After reviewing those results, they determined that additional data could
improve the detail available in their result, so an additional data source was tapped and ingested
through the pipeline to produce the desired result (number 3). A pipeline often has iterations of
storage and processing. For example, after the external data is ingested into pipeline storage,
iterative processing transforms the data into different levels of refinement for different needs.
WEEK – 2:

THE ELEMENTS OF DATA, DESIGN PRINCIPLES &

PATTERNS FOR DATA PIPELINES

The five Vs of data- volume, velocity, variety, veracity& value:

Fig- 3.1: Data Characteristics

The evolution of data architectures:

So, which of these data stores or data architectures is the best one for your data pipeline?

The reality is that a modern architecture might include all of these elements. The key to a modern
data architecture is to apply the three-pronged strategy that you learned about earlier. Modernize
the technology that you are using. Unify your data sources to create a single source of truth that can
be accessed and used across the organization. And innovate to get higher value analysis from the
data that you have.

Variety data types, Modern data architecture on AWS:

The architecture illustrates the following other AWS purpose-built services that integrate with
Amazon S3 and map to each component that was described on the previous slide: Amazon
Redshift is a fully managed data warehouse service.
•Amazon OpenSearch Service is a purpose-built data store and search engine that is
optimized for real-time analytics, including log analytics.
•Amazon EMR provides big data processing and simplifies some of the most complex
elements of setting up big data processing.
•Amazon Aurora provides a relational database engine that was built for the cloud.
•Amazon DynamoDB is a fully managed nonrelational database that is designed to run
high-performance applications.
•Amazon Sage Maker is an AI/ML service that democratizes access to ML process
Modern data architecture pipeline: Ingestion and storage:

Data being ingested into the Amazon S3 data lake arrives at the landing zone, where it is first
cleaned and stored into the raw zone for permanent storage. Because data that is destined for the
data warehouse needs to be highly trusted and conformed to a schema, the data needs to be
processed further additional transformations would include applying the schema and partitioning
(structuring) as well as other transformations that are required to make the data conform to
requirements that are established for the trusted zone. Finally, the processing layer prepares the data
for the curated zone by modeling and augmenting it to be joined with other datasets (enrichment)
and then stores the transformed, validated data in the curated layer. Datasets from the curated
layer are ready to be ingested into the data warehouse to make them available for low-latency
access or complex SQL querying.

Streaming analytics pipeline:

Producers ingest records onto the stream. Producers are integrations that collect data from
a source and load it onto the stream. Consumers process records. Consumers read data from the
stream and perform their own processing on it. The stream itself provides a temporary but durable
storage layer for the streaming solution. In the pipeline that is depicted in this slide, Amazon
CloudWatch Events is the producer that puts CloudWatch Events event data onto the stream.
Kinesis Data Streams provides the storage. The data is then available to multiple consumers.
WEEK – 3:

SECURING & SCALING DATA PIPELINE

Scaling: An overview:

Fig- 4.1: Types of Scaling

Creating a scalable infrastructure:

Fig- 4.2: Template Structure

Fig- 4.3: AWS Cloud Formation

AWS CloudFormation is a fully managed service that provides a common language for you to
describe and provision all of the infrastructure resources in your cloud environment. Cloud
Formation creates, updates, and deletes the resources for your applications in environmentscalled
stacks. A stack is a collection of AWS resources that are managed as a single unit.
CloudFormation is all about automated resource provisioning it simplifies the task of repeatedly and
predictably creating groups of related resources that power your applications. Resources are written in
text files by using JSON or YAML format.
INGESTING & PREPARING DATA

5.1 ETL and ELT comparison:

Fig- 5.1: ETL& ELT comparison

Data wrangling:

Transforming large amounts of unstructured or structured raw data from multiple sources with different
schemas into a meaningful set of data that has value for downstream processes or users.

Data Structuring:

For the scenario that was described previously, the structuring step includes exporting a .json file from the
customer support ticket system, loading the .json file into Excel, and letting Excel parse the file. For the
mapping step for the supp2 data, the data engineer would modifythe cust num field to match the customer
id field in the data warehouse.

For this example, you would perform additional data wrangling steps before compressingthe file for
upload to the S3 bucket.

Data Cleaning:

It includes:

• Remove unwanted data.

• Fill in missing data values.
• Validate or modify data types.
• Fix outliers
WEEK – 4:

INGESTING BY BATCH OR BY STREAM

6.1 Comparing batch and stream ingestion:

Fig- 6.1: Batch& Streaming Ingestion

To generalize the characteristics of batch processing, batch ingestion involves running batch jobs
that query a source, move the resulting dataset or datasets to durable storage in the pipeline, and
then perform whatever transformations are required for the use case. As noted in the Ingesting and
Preparing Data module, this could be just cleaning and minimally formatting data to put it into the
lake. Or, it could be more complex enrichment, augmentation, and processing to support complex
querying or big data and machine learning (ML) applications. Batch processing might be started
on demand, run on a schedule, or initiated by an event. Traditional extract, transform, and load
(ETL) uses batch processing, but extract, load, and transform (LT) processing might also be done
by batch.

Batch Ingestion Processing:

The process of transporting data from one or more sources to a target site for further processing
and analysis. This data can originate from a range of sources, including data lakes, IoT devices, on-
premises databases, and SaaS apps, and end up in different target environments, such as cloud
data warehouses or data marts.

Purpose Built Ingestion Tools:

Fig- 6.2: Built Ingestion Tools

Use Amazon App Flow to ingest data from a software as a service (SaaS) application. Youcan
do the following with Amazon App Flow:
• Create a connector that reads from a SaaS source and includes filters.
• Map fields in each source object to fields in the destination and perform
transformations.
• Perform validation on records to be transferred.
• Securely transfer to Amazon S3 or Amazon Redshift. You can trigger an ingestion on
demand, on event, or on a schedule.
An example use case for Amazon App Flow is to ingest customer support ticket data fromthe Zendesk
SaaS product.
WEEK – 5:

STORING AND ORGANIZING DATA:

7.1 Storage in the modern data architecture:

Fig- 7.1: Storage in modern Architecture

Data in cloud object storage is handled as objects. Each object is assigned a key, which is a
unique identifier. When the key is paired with metadata that is attached to the objects, otherAWS
services can use the information to unlock a multitude of capabilities. Thanks to economies
of scale, cloud object storage comes at a lower cost than traditional storage.

Data Warehouse Storage:

• Provide a centralized repository

• Store structured and semi-structured data
• Store data in one of two ways:
• Frequently accessed data in fast storage
• Infrequently accessed data in cheap storage
• Might contain multiple databases that are organized into tables and columns
• Separate analytics processing from transactional databases•
Example: Amazon Redshift

Purpose-Built Data Bases:

• ETL pipelines transform data in buffered memory prior to loading data into a data lake or
data warehouse for storage.
• ELT pipelines extract and load data into a data lake or data warehouse for storage without
transformation.
Here are a few key points to summarize this section:
 Storage plays an integralpart in ELT and ETL pipelines. Data often moves in and out of storage
numerous times, basedon pipeline type and workload type.
 ETL pipelines transform data in buffered memory prior to loading data into a data lake or
data warehouse for storage. Levels of buffered memory vary by service.
 ELT pipelines extract and load data into data lake or data warehouse storage without
transformation. The transformation of the data is part of the target system’s workload .
Securing Storage:
Security for a data warehouse in Amazon Redshift:
• Amazon Redshift database security is distinct from the security of the service itself.
• Amazon Redshift provides additional features to manage database security.
• Due to third-party auditing, Amazon Redshift can help to support applications that are
required to meet international compliance standards.
WEEK – 6:

PROCESSING BIG DATA & DATA FOR ML

8.1 Big Data processing Concepts:

Fig- 8.1: Data Processing

Apache Hadoop:

Fig- 8.2: Apache Hadoop

Apache Spark:

Apache Spark characteristics -

• Is an open-source, distributed processing framework
• Uses in-memory caching and optimized query processing
• Supports code reuse across multiple workloads
• Clusters consist of leader and worker nodes

Amazon EMR Characteristics:

• Managed cluster platform

• Big data solution for petabyte-scale data processing, interactive analytics, and machine
learning.
• Processes data for analytics and BI workloads using big data frameworks
• Transform and move large amounts of data into and out of AWS data stores

ML Concepts:

Fig- 8.3: ML models

ML Life Cycle:

Fig- 8.4: ML life cycle Framing the ML problem to meet Business Goals: Working backwards from
the business problem to be solved
• What is the business problem?
• What pain is it causing?
• Why does this problem need to be resolved?
• What will happen if you don't solve this problem?
• How will you measure success?

Fig- 8.5: ML Framing

Collecting Data:

Fig-8.6: Collecting Data

WEEK – 7:

ANALYZING & VISUALIZING DATA

Consideration factors that influence tool selection:

Fig- 9.1: Factors & needs

Data characteristics:
•How much data is there?
•At what speed and volume does it arrive?
•How frequently is it updated?
•How quickly is it processed?
•What type of data is it?

9.2 Comparing AWS tools and Services:

For accessibility: Data from multiple sources is put in Amazon S3, where Athena can be used for
one-time queries. Amazon EMR aggregates the data and stores the aggregates in S3. Athena can
be used to query the aggregated datasets. From S3, the data can be used in Amazon Redshift, where
Quick Sight can access the data to create visualizations. End of accessibilitydescription.

Fig- 9.2: QuickSight Example

WEEK – 8:

AUTOMATING THE PIPELINE

Automating Infrastructure deployment:

Fig- 10.1: Automating Infrastructure

If you build infrastructure with code, you gain the benefits of repeatability and reusability while you build
your environments. In the example shown, a single template is used to deployNetwork Load Balancers and
Auto Scaling groups that contain Amazon Elastic Compute Cloud (Amazon EC2) instances. Network Load
Balancers distribute traffic evenly across targets.

CI/CD:
CI/CD can be pictured as a pipeline, where new code is submitted on one end, tested over aseries of
stages (source, build, test, staging, and production), and then published as production-ready code.

Automating with Step Function:

Fig- 10.2: Step Function

 With Step Functions, you can use visual workflows to coordinate the components of
distributed applications and microservices.
 You define a workflow, which is also referred to as a state machine, as a series ofsteps
and transitions between each step.
 Step Functions is integrated with Athena to facilitate building workflows that include
Athena queries and data processing operations.
CONCLUSION

Data engineering is a critical component in the modern data landscape, playing a crucial role in the
success of data-driven decision-making and analytics. As we draw conclusions about data
engineering, several key points come to the forefront:

Foundation for Data-Driven Insights:

Data engineering serves as the foundation for extracting, transforming, and loading (ETL) data from
diverse sources into a format suitable for analysis. This process is essential for generating
meaningful insights and facilitating informed decision-making within organizations.

Data Quality and Integrity:

Maintaining data quality and integrity is paramount in data engineering. Data engineers are
responsible for cleaning, validating, and ensuring the accuracy of data, contributing to the
reliability of downstream analyses and business processes.

Scalability and Performance:

With the increasing volume, velocity, and variety of data, data engineering solutions must be
scalable and performant. Scalability ensures that systems can handle growing amounts of data, while
performance optimization ensures timely processing and availability of data for analytics.

Integration of Diverse Data Sources:

Data engineering enables the integration of data from various sources, whether structured or
unstructured, providing a unified view of information. This integration is crucial for a
comprehensive understanding of business operations and customer behavior.

Data Science Handwritten Notes by Kirtika
No ratings yet
Data Science Handwritten Notes by Kirtika
59 pages
Full Stack Report (Movie Recommendation)
100% (1)
Full Stack Report (Movie Recommendation)
59 pages
Qspiders Internship Certificate
No ratings yet
Qspiders Internship Certificate
1 page
01 - The Software Quality Challenges
No ratings yet
01 - The Software Quality Challenges
12 pages
Mini Project Game
No ratings yet
Mini Project Game
28 pages
Internpe Final
No ratings yet
Internpe Final
35 pages
Sentiment Analysis On Youtube Comments
No ratings yet
Sentiment Analysis On Youtube Comments
54 pages
Credit Card
100% (1)
Credit Card
26 pages
E Mart - 093653
No ratings yet
E Mart - 093653
49 pages
Dbms Final Report Nithin and Ramesh
No ratings yet
Dbms Final Report Nithin and Ramesh
40 pages
Experiment No 4
No ratings yet
Experiment No 4
9 pages
Final B.tech. Internship Report Sample Format-3
100% (1)
Final B.tech. Internship Report Sample Format-3
13 pages
Project Final Report
100% (1)
Project Final Report
44 pages
Codsoft Report
No ratings yet
Codsoft Report
26 pages
Social Networking Sites
No ratings yet
Social Networking Sites
160 pages
Data Governance For Managers The Driver of Value Stream Optimization and A Pacemaker For Digital Transformation (Lars Michael Bollweg) (Z-Library)
100% (2)
Data Governance For Managers The Driver of Value Stream Optimization and A Pacemaker For Digital Transformation (Lars Michael Bollweg) (Z-Library)
167 pages
Ipl Team Management
No ratings yet
Ipl Team Management
18 pages
Minor Project
No ratings yet
Minor Project
259 pages
Ooad Record Abinash
No ratings yet
Ooad Record Abinash
241 pages
AI Lab Manual
No ratings yet
AI Lab Manual
37 pages
Minor Project Report
No ratings yet
Minor Project Report
49 pages
Final Year Project
0% (1)
Final Year Project
15 pages
SPM
No ratings yet
SPM
83 pages
Project - Report
No ratings yet
Project - Report
56 pages
Full Stack Internship Report
No ratings yet
Full Stack Internship Report
26 pages
BLOG Project Reprort
No ratings yet
BLOG Project Reprort
47 pages
Risk Management RMMM Plan
0% (1)
Risk Management RMMM Plan
2 pages
Unit-1 STQA
No ratings yet
Unit-1 STQA
127 pages
East West Institute of Technology: Sadp Notes
No ratings yet
East West Institute of Technology: Sadp Notes
30 pages
8th Sem Project PPT-1
No ratings yet
8th Sem Project PPT-1
26 pages
Report - File - Vikram 221347009
No ratings yet
Report - File - Vikram 221347009
8 pages
OSS Information Gateway 2016 Issue 02 (U2000 Poster U2000 Overview V200R016C10)
No ratings yet
OSS Information Gateway 2016 Issue 02 (U2000 Poster U2000 Overview V200R016C10)
4 pages
Report of Industrial Training
No ratings yet
Report of Industrial Training
22 pages
Internship Report AIML
No ratings yet
Internship Report AIML
40 pages
PY001
No ratings yet
PY001
6 pages
A Project Report: Submitted in Partial Fulfillment For The Award of The Degree of
No ratings yet
A Project Report: Submitted in Partial Fulfillment For The Award of The Degree of
71 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
18 pages
STC Sample Report
No ratings yet
STC Sample Report
21 pages
SDE Himanshu Gupta Resume
No ratings yet
SDE Himanshu Gupta Resume
1 page
Major Project Innfinder
No ratings yet
Major Project Innfinder
50 pages
Pooja Intership2
No ratings yet
Pooja Intership2
35 pages
Major Project Report BIG MART Final Reedited
No ratings yet
Major Project Report BIG MART Final Reedited
91 pages
Sample Technical Seminar Vtu
No ratings yet
Sample Technical Seminar Vtu
14 pages
Enhanced Petition Tracking System Final Report
No ratings yet
Enhanced Petition Tracking System Final Report
75 pages
How To Create A War File Using Maven
0% (1)
How To Create A War File Using Maven
3 pages
Bug Tracking System
No ratings yet
Bug Tracking System
12 pages
Case Study DS-BDA
No ratings yet
Case Study DS-BDA
29 pages
Web Development
No ratings yet
Web Development
23 pages
Presentation ON Neo4J
No ratings yet
Presentation ON Neo4J
5 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
ML Interview Questions
No ratings yet
ML Interview Questions
7 pages
Buffalo Link Station Quad LS-QL-R5 User Manual
No ratings yet
Buffalo Link Station Quad LS-QL-R5 User Manual
96 pages
Seminar Report Final
No ratings yet
Seminar Report Final
26 pages
Online Bus Ticket Reservation System: Statistics and Computing January 2015
No ratings yet
Online Bus Ticket Reservation System: Statistics and Computing January 2015
18 pages
Expenses Calculation Mechanism For Any App .
No ratings yet
Expenses Calculation Mechanism For Any App .
69 pages
E-Commerce Website For Visually Challenged
No ratings yet
E-Commerce Website For Visually Challenged
5 pages
Industrial Extreme Programming: Submitted By: Group 3 Submitted To
No ratings yet
Industrial Extreme Programming: Submitted By: Group 3 Submitted To
7 pages
HP Client Automation Enterprise User Guide - Windows and Linux.
No ratings yet
HP Client Automation Enterprise User Guide - Windows and Linux.
485 pages
Java - Lab - Manual-21csl35 - Skit
No ratings yet
Java - Lab - Manual-21csl35 - Skit
30 pages
Paging Counters
No ratings yet
Paging Counters
6 pages
19 - Crop Recommender System Using Machine Learning Approach
No ratings yet
19 - Crop Recommender System Using Machine Learning Approach
64 pages
ML-2 Quick Start Guide 4189340603 UK
No ratings yet
ML-2 Quick Start Guide 4189340603 UK
23 pages
Job Recommender Java Spring Boot
No ratings yet
Job Recommender Java Spring Boot
21 pages
Getting Started nRF5SDK Ses
No ratings yet
Getting Started nRF5SDK Ses
39 pages
Projectreport 121202043850 Phpapp02
No ratings yet
Projectreport 121202043850 Phpapp02
120 pages
Data Mining of Restaurant Review Using W PDF
No ratings yet
Data Mining of Restaurant Review Using W PDF
4 pages
Project Report On HMS
No ratings yet
Project Report On HMS
150 pages
Xss White Paper
No ratings yet
Xss White Paper
8 pages
Sap Basis
No ratings yet
Sap Basis
6 pages
SCE Enm C60
No ratings yet
SCE Enm C60
538 pages
Unit01-Getting Started With .NET Framework 4.0
No ratings yet
Unit01-Getting Started With .NET Framework 4.0
40 pages
Ict Lab Manual 1
No ratings yet
Ict Lab Manual 1
9 pages
Composition Aggregation UML Class Diagram For Composition and Aggregation
No ratings yet
Composition Aggregation UML Class Diagram For Composition and Aggregation
25 pages
Introduction to Linux: Installation and Programming
From Everand
Introduction to Linux: Installation and Programming
N. B. Venkateswarlu
No ratings yet
Support: The Role of Financial Management
No ratings yet
Support: The Role of Financial Management
8 pages
Recommendation System Sample Paper
No ratings yet
Recommendation System Sample Paper
6 pages
Timers & Constants in RRC
No ratings yet
Timers & Constants in RRC
2 pages
Madan Raj C R Maven SIliconEx - Suma Maven Silicon
No ratings yet
Madan Raj C R Maven SIliconEx - Suma Maven Silicon
3 pages
Nvidia Quadro P4000
No ratings yet
Nvidia Quadro P4000
1 page
H3C Cloud云学堂用户手册-E0303H03-5W102-整本手册
No ratings yet
H3C Cloud云学堂用户手册-E0303H03-5W102-整本手册
53 pages
1-SYSC5602 2introduction
No ratings yet
1-SYSC5602 2introduction
20 pages
Aptio 4.x Status Codes: Checkpoints & Beep Codes For Debugging
No ratings yet
Aptio 4.x Status Codes: Checkpoints & Beep Codes For Debugging
12 pages
VeNus Manual For Vendor (EN) - Final
No ratings yet
VeNus Manual For Vendor (EN) - Final
27 pages
Unit 2 - JDBC
No ratings yet
Unit 2 - JDBC
114 pages
CG Programs
No ratings yet
CG Programs
72 pages
Workshop01 - Answer
No ratings yet
Workshop01 - Answer
8 pages
Logs and Data Purge
No ratings yet
Logs and Data Purge
5 pages
A Novel Three-Factor Authentication Protocol For Wireless Sensor Networks With IoT Notion
No ratings yet
A Novel Three-Factor Authentication Protocol For Wireless Sensor Networks With IoT Notion
10 pages
1 Simbound Tabletto Case
No ratings yet
1 Simbound Tabletto Case
2 pages
Evolution of Computers 5
No ratings yet
Evolution of Computers 5
9 pages

Summer Internship Report On: Aws Data Engineering (Topic)

Uploaded by

Summer Internship Report On: Aws Data Engineering (Topic)

Uploaded by

SUMMER INTERNSHIP REPORT ON

AWS DATA ENGINEERING (Topic)

in partial fulfillment for the award of the degree

Department of Conputer Science and Engineering

Amalapuram-Kakinada Highway, Cheyyeru (V), Amalapuram.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Internship Coordinator Head of the Department

• Ideal usage patterns

OVERVIEW OF AWS ACADEMY DATA ENGINEERING

This course prepares you to do the following:

 Recognize how the elements of data influence decisions about the

How do organizations decide...

 Which of these customer transactions should be flagged as fraud?

Fig- 2.1: Data Pipeline

THE ELEMENTS OF DATA, DESIGN PRINCIPLES &

The five Vs of data- volume, velocity, variety, veracity& value:

Fig- 3.1: Data Characteristics

The evolution of data architectures:

Variety data types, Modern data architecture on AWS:

Streaming analytics pipeline:

SECURING & SCALING DATA PIPELINE

Fig- 4.1: Types of Scaling

Creating a scalable infrastructure:

Fig- 4.2: Template Structure

Fig- 4.3: AWS Cloud Formation

5.1 ETL and ELT comparison:

Fig- 5.1: ETL& ELT comparison

• Remove unwanted data.

INGESTING BY BATCH OR BY STREAM

Fig- 6.1: Batch& Streaming Ingestion

Batch Ingestion Processing:

Purpose Built Ingestion Tools:

Fig- 6.2: Built Ingestion Tools

STORING AND ORGANIZING DATA:

Fig- 7.1: Storage in modern Architecture

Data Warehouse Storage:

• Provide a centralized repository

Purpose-Built Data Bases:

PROCESSING BIG DATA & DATA FOR ML

Fig- 8.1: Data Processing

Fig- 8.2: Apache Hadoop

Apache Spark characteristics -

Amazon EMR Characteristics:

• Managed cluster platform

Fig- 8.3: ML models

Fig- 8.5: ML Framing

Fig-8.6: Collecting Data

ANALYZING & VISUALIZING DATA

Consideration factors that influence tool selection:

Fig- 9.1: Factors & needs

9.2 Comparing AWS tools and Services:

Fig- 9.2: QuickSight Example

AUTOMATING THE PIPELINE

Automating Infrastructure deployment:

Fig- 10.1: Automating Infrastructure

Automating with Step Function:

Fig- 10.2: Step Function

Foundation for Data-Driven Insights:

Data Quality and Integrity:

Scalability and Performance:

Integration of Diverse Data Sources:

You might also like