Data Split
Data Split
by
K.SAI ADITYA 21131A4213
K.SOWMYA 21131A4215
P.AMRUTHA 21131A4230
P.YUVA SRI 22135A4205
CERTIFICATE
This report on
“AWS DATA ENGINEERING VIRTUAL INTERNSHIP”
is a bonafide record of the Internship work submitted
by
K.SAI ADITYA 21131A4213
K.SOWMYA 21131A4215
P.AMRUTHA 21131A4230
P.YUVA SRI 22135A4205
We would like to express our deep sense of gratitude to our esteemed institute
Gayatri Vidya Parishad College of Engineering (Autonomous), which has provided us
an opportunity to fulfil our cherished desire.
We are highly indebted to Dr. D. UMA DEVI , Associate Professor, Head of the
Department of Computer Science and Engineering and Department of Computer
Science and Engineering(AI-ML) Gayatri Vidya Parishad College of Engineering
(Autonomous), for giving us an opportunity to do the internship in college.
Finally, we are indebted to the teaching and non-teaching staff of the Computer
Science and Engineering Department for all their support in the completion of our
project.
Data engineering is the science of analysing raw data to make conclusions about
that information. Data engineering relies on a variety of software tools ranging
from spreadsheets, data visualization, and reporting tools, data mining programs,
or open- source languages for the greatest data manipulation. Data Analysis
mainly deals with data collection, data storage, data preprocessing and data
visualisation.
In the course-1 we have learnt about the cloud computing, Cloud computing is
the on- demand delivery of compute power, database, storage, applications, and
other IT resources via the internet with pay-as-you-go pricing. These resources
run on server computers that are located in large data centers in different
locations around the world. When you use a cloud service provider like AWS,
that service provider owns the computers that you. This course deals with the
following main concepts of compute services, storge services, management
services, database services, compliance services, AWS cost management
services.
As a part of our course-2 we learnt data engineering which deals with the raw
data to draw solutions. In this course we have learnt about big data which is the
main and foremost important tool for the data analysis so, this big data is very
important for data engineering, and here comes the problem of storing the data
for that we have learnt different tools used for data storage and how to analyse
the big data and preprocess the data the following are the concepts that we learnt
in this course. The main concepts includes storage using AMAZON includes
amazon S3, amazon athena, amazon redshift, amazon glue, amazon sagemaker
and amazon IOT analysis.
viii
1. INTRODUCTION TO CLOUD CONCEPTS
I. Public
II. Private
III. Hybrid
9
The advantages of cloud computing:
10
2. CLOUD ECONOMICS AND BILLING
WHAT IS AWS?
AWS is designed to allow application providers, ISVs, and vendors toquickly
and securely host your applications – whether an existing application or a new SaaS-based
application. You can use the AWS Management Console or well-documented web services
APIsto access AWS's application hosting platform.
HOW WE PAY FOR THE RESOURCES USED IN AWS?
1. Pay for what you use
2. Pay less when you reserve
3. Pay less when you use more
4. Pay even less as AWS grows
AWS realizes that every customer has different needs. If none of the AWS pricing models
work for your project, custom pricing is available for high-volume projects with unique
requirements.There are some rules associated with the amount paying to AWS:
• There is no charge (with some exceptions) for:
o Inbound data transfer.
o Data transfer between services within the same AWS Region.
• Pay for what you use.
• Start and stop anytime.
• No long-term contracts are required.
Some services are free, but the other AWS services that they provision might not be free
11
WHAT IS TCO?
Total Cost of Ownership (TCO) is the financial estimate to help identify direct
and indirect costs of a system.
13
4. AWS CLOUD SECURITY
This shared model can help relieve the customers operational burden as
AWS operates, manages and controls the components from the host operating system. The
responsibility of this model is protecting infrastructure that runs all the services offered in
the AWS Cloud. The infrastructure composed of hardware, software, networking and
facilities that run AWS Cloud services.
14
IAM:
With AWS identity and access management (IAM) you can specify who or what can access
services and resources in AWS. IAM is a web service that helps securely control access to
AWS resources. We use IAM control who is authenticated and authorized to use resources.
15
Make your account more secure by –
16
5. NETWORKING AND CONTENT DELIVERY
This module includes some activities that challenge you to label a network diagram and
design a basic VPC architecture.
NETWORIKNG BASICS:
A computer network is two or more client machines that are connected
together to share resources. A network can be logically partitioned into subnets. Networking
requires a networking device (such as a router or switch) to connect all the clients together
and enable communication between them.
Each client machine in a network has a unique Internet Protocol (IP) address
that identifies it. A 32-bit IP address is called an IPv4 address. A 128-bit IP address is called
an IPv6 address. IPv6 addresses can accommodate more user devices. A common method to
describe networks is Classless Inter-Domain Routing (CIDR). The bits that are not fixed are
allowed to change. CIDR is a way to express a group of IP addresses that are consecutive to
each other.
AMAZON VPC:
VPC Amazon Virtual Private Cloud (Amazon VPC) is a service that lets you
provision a logically isolated section of the AWS Cloud (called a virtual private cloud, or
VPC) where you can launch your AWS resources.
VPC NETWORKING:
An internet gateway is a scalable, redundant, and highly available VPC
17
Component that allows communication between instances in your VPC and the internet. A
network address translation (NAT) gateway enables instances in a private subnet to connect
to the internet or other AWS services, but prevents the internet from initiating a connection
with those instances. There are several VPC networking options, which include: Internet
gateway, NAT gateway, VPC endpoint, VPC peering, VPC sharing, AWS Site-to-Site VPN,
AWS Direct Connect, AWS Transit, AWS Transit gateway etc.
AMAZON ROUTE 53:
Amazon Route 53 is a highly available and scalable cloud Domain Name
System (DNS) web service. It is designed to give developers and businesses a reliable and
cost-effective way to route users to internet applications by translating names (like
www.example.com) into the numeric IP addresses (like 192.0.2.1) that computers use to
connect to each other.
Amazon Route 53 supports several types of routing policies, which determine
how Amazon Route 53 responds to queries:
• Simple routing (round robin)
• Weighted round robin routing
• Latency routing (LBR)
• Geolocation routing
AMAZON CLOUD FRONT
Amazon CloudFront is a fast CDN service that securely delivers data, videos, applications,
and application programming interfaces (APIs) to customers globally with low latency and
high transfer speeds. Amazon CloudFront is a self-service offering with pay as-you-go
pricing.
Amazon CloudFront benefits:
• Fast and global
• Security at the edge
• Highly programmable
• Deeply integrated with AWS
• Cost-effective
18
6. COMPUTE
19
7. STORAGE
AMAZON EBS:
Amazon EBS provides persistent block storage volumes for use with Amazon
EC2 instances. Persistent storage is any data storage device that retains data after power to
that device is shut off. It is also sometimes called non- volatile storage. Each Amazon EBS
volume is automatically replicated within its Availability Zone to protect you from
component failure. It is designed for high availability and durability. Amazon EBS volumes
provide the consistent and low- latency performance that is needed to run your workloads.
Amazon EBS enables you to create individual storage volumes and attach them to an
Amazon
EC2 instance:
20
• Database hosts
• Enterprise applications
AMAZON S3:
Amazon S3 is object storage that is built to store and retrieve any amount of data from
anywhere: websites and mobile apps, corporate applications, and data from Internet of Things
(IoT) sensors or devices. Amazon S3 is object - level storage, which means that if you want
to change a part of a file, you must make the change and then re-upload the entire modified
file. Amazon S3 stores data as objects within resources that are called buckets. The data that
you store in Amazon S3 is not associated with any particular server, and you do not need
manage any infrastructure yourself. You can put as many objects into Amazon S3 as you want.
Amazon S3 holds trillions of objects and regularly peaks at millions of requests per second.
AMAZON EFS:
Amazon EFS implements storage for EC2 instances that multiple virtual machines can access
at the same time. It is implemented as a shared file system that uses the Network File System
(NFS) protocol. Amazon Elastic File System (Amazon EFS) provides simple, scalable,
elastic file storage for use with AWS services and on - premises resources. It offers a simple
interface that enables you to create and configure file systems quickly and easily. Amazon
EFS is built to dynamically scale on demand without disrupting applications—it will grow
and shrink automatically as you add and remove files. It is designed so that your applications
have the storage they need, when they need it.
AMAZON S3 GLACIER:
Amazon S3 Glacier is a secure, durable, and extremely low - cost cloud storage
service for data archiving and long - term backup. Amazon S3 Glacier is a data archiving
service that is designed for security, durability, and an range from a few minutes to several
hours.
21
8. DATA BASES
Fig:8-AWS Databases
Challenges of relational databases:
• Server maintenance and energy footprint
• Software installation and patches
• Database backups and high availability
• Limits on scalability
• Data security
• Operating system (OS) installation and
patches There are other databases along with this
they are:
• Amazon dynamo db
• Amazon red shift
22
9. CLOUD ARCHITECTURE
Cloud architects:
• Engage with decision makers to identify the business goal and the capabilities that
need improvement.
• Ensure alignment between technology deliverables of a solution and the business
goals.
• Work with delivery teams that are implementing the solution to ensure that the
technology features are appropriate
AWS Well Architected Framework is
• A way to provide best practices that were developed through lessons learned by
reviewing customer architectures.
Fig:9-Cloud Architecture
23
AWS TRUSTED ADVISOR:
AWS Trusted Advisor is an online tool that provides real time guidance to help you
provision your resources following AWS best practices.
• Cost Optimization: AWS Trusted Advisor looks at your resource use and makes
recommendations to help you optimize cost by eliminating unused and idle resources,
or by making commitments to reserved capacity.
• Performance: Improve the performance of your service by checking your service
limits, ensuring you take advantage of provisioned throughput, and monitoring for
overutilized instances.
• Security: Improve the security of your application by closing gaps, enabling various
AWS security features, and examining your permissions.
• Fault Tolerance: Increase the availability and redundancy of your AWS application
by taking advantage of automatic scaling, health checks, Multi AZdeployments, and
backup capabilities.
• Service Limits: AWS Trusted Advisor checks for service usage that is more than
80 percent of the service limit. Values are based on a snapshot, so your current usage
might differ. Limit and usage data can take up to 24 hours to reflect any changes
24
10. AUTO SCALING AND MONITORING
• Amazon CloudWatch helps you monitor your AWS resources and the applications
that you run on AWS in real time.
AMAZON EC2 AUTO SCALING:
• Scaling is the ability to increase or decrease the compute capacity of your application
• It helps you maintain application availability
25
1.WELCOME TO AWS DATA ENGINEERING
Data engineering is the process of designing and building systems that let people collect and analyze raw
data from multiple sources and formats. These systems empower people to find practice applications of
the data, which businesses can use to thrive.
Data engineering is a skill that is in increasing demand. Data engineers are the people who design the
system that unifies data and can help you navigate it. Data engineers perform many different tasks
including:
• Acquisition: Finding all the different data sets around the business
• Cleansing: Finding and cleaning any errors in the data
• Conversion: Giving all the data a common format
• Disambiguation: Interpreting data that could be interpreted in multiple ways
• Deduplication: Removing duplicate copies of data
Once this is done, data may be stored in a central repository such as a datalake or data lakehouse.
Data engineers may also copy and move subsets of data into a data warehouse
Architecture of Data Engineering
25
2. Data Driven Organizations
26
in the digital age.
27
3. The Elements of Data
Structured Data:
Structured data represents the foundation of traditional databases, characterized by its organized
format, predictable schema, and tabular structure. This form of data lends itself well to relational
databases, where information is stored in rows and columns, enabling efficient storage, retrieval, and
analysis. Examples of structured data include transaction records, customer profiles, and inventory
lists, each meticulously organized to facilitate easy access and manipulation.
Unstructured Data:
In contrast to structured data, unstructured data defies conventional categorization,
encompassing a wide array of formats, including text, images, videos, and audio recordings. This form
of data lacks a predefined schema, making it inherently more challenging to analyze and interpret.
However, within the realm of unstructured data lies a treasure trove of insights, waiting to be
unearthed through advanced analytics, natural language processing, and machine learning
algorithms.
Semi-Structured Data:
Semi-structured data occupies a unique space between structured and unstructured data,
combining elements of both. While it may possess some semblance of organization, such as tags
or metadata, it lacks the rigid structure of traditional databases. Examples of semi-structured data
include XML files, JSON Documents, and log files, each offering a flexible framework for
storing and exchanging information across disparate systems.
28
The Lifecycle of Data:
Beyond its structural composition, data traverses a lifecycle – from its inception to its eventual
obsolescence. This lifecycle encompasses five distinct stages: capture, storage, processing, analysis,
and dissemination. At each stage, data undergoes transformation, refinement, and enrichment, evolving
from mere bytes to actionable insights that drive informed decision-making and strategic planning.
29
4. Data Principles and Patterns for Data Pipelines
This exploration delves into the intricacies of data pipeline design, unveiling the principles and
patterns that enable organizations to orchestrate data flows efficiently, reliably, and at scale.
1. Extract, Transform, Load (ETL): This classic pattern involves extracting data from various
sources, transforming it into a structured format, and loading it into a destination for analysis or
storage. ETL pipelines are well-suited for batch processing scenarios, where data is collected and
processed in discrete chunks.
3. Lambda Architecture: Combining the strengths of both batch and stream processing, the
Lambda architecture provides a framework for building robust, fault-tolerant data pipelines
that can handle both historical and real-time data. By leveraging batch and speed layers in
parallel, organizations can achieve comprehensive insights with low latency.
30
4. Microservices Architecture: In the realm of distributed systems, microservices architecture
offers a modular approach to building data pipelines, where individual components or services
are decoupled and independently deployable. This pattern enables greater agility, scalability, and
fault isolation, albeit at the cost of increased complexity.
Data Pipeline
31
Securing and Scaling the data pipeline
Scaling the data pipeline involves expanding its capacity and capabilities to accommodate
growing volumes of data, users, and workloads. This requires a strategic approach that encompasses
both vertical and horizontal scaling techniques. Vertical scaling involves adding more resources, such
as CPU, memory, or storage, to existing infrastructure to handle increased demand. Horizontal
scaling, on the other hand, involves distributing workloads across multiple nodes or instances, enabling
parallel processing and improved performance.
Scalability Considerations:
While scalability unlocks new opportunities for growth and innovation, it also introduces a host
of challenges and considerations. Organizations must carefully evaluate factors such as data volume,
velocity, variety, and veracity, as the underlying infrastructure, architecture, and technology stack.
Moreover, as data pipelines scale in complexity and size, managing dependencies, optimizing
performance, and ensuring fault tolerance become increasingly critical tasks.
32
Emerging Technologies and Best Practices:
1. To address these challenges, organizations are turning to emerging technologies and best practices
that offer scalable, secure, and efficient solutions for data pipeline management. This includes the
adoption of cloud-native architectures, containerization technologies such as Docker and Kubernetes,
serverless computing platforms like AWS Lambda and Google Cloud Functions, and distributed
processing frameworks such as Apache Spark and Apache Flink. Additionally, organizations are
leveraging DevOps practices, automation tools, and infrastructure-as-code principles to streamline
deployment, monitoring, and management of data pipelines.
33
5. Ingesting and preparing data
2. Stream Processing: Stream processing enables the ingestion of data in real- time as it is
generated or produced. This approach is ideal for scenarios where immediate insights or actions
are required, such as monitoring, anomaly detection, or fraud detection.
3. Change Data Capture (CDC): CDC techniques capture and replicate incremental changes
to data sources, enabling organizations to ingest only the modified or updated data, rather than
the entire dataset. This minimizes processing overhead and reduces latency, making it well-
suited for scenarios where data freshness is critical.
34
Data Preparation:
Once data is ingested into the organizational ecosystem, it must undergo a process of
preparation to make it suitable for analysis, modeling, or visualization. Data preparation
encompasses a range of activities, including cleaning, transforming, enriching, and
aggregating data to ensure its quality, consistency, and relevance. This process is often iterative
and involves collaboration between data engineers, data scientists, and domain experts to
identify, understand, and address data quality issues and inconsistencies.
1. Data Integration Platforms: Data integration platforms provide a unified environment for
orchestrating data ingestion, transformation, and loading tasks across disparate sources and
destinations. These platforms offer features such as data profiling, data cleansing, and data
enrichment to ensure data quality and consistency.
2. Data Wrangling Tools: Data wrangling tools empower users to visually explore, clean, and
transform data without writing code. These tools offer intuitive interfaces and built-in
algorithms for tasks such as missing value imputation, outlier detection, and feature
engineering, enabling users to prepare data more efficiently and effectively.
35
Ingestion by batch or system
Understanding Batch Data Ingestion:
Batch data ingestion involves collecting and processing data in discrete chunks or batches at
scheduled intervals. This approach is well-suited for scenarios where data latency is acceptable, such
as historical analysis, batch reporting, or periodic updates. Batch data ingestion offers several
advantages, including simplicity, scalability, and fault tolerance. By processing data in predefined
batches, organizations can optimize resource utilization, minimize processing overhead, and ensure
consistent performance even in the face of failures or disruptions.
36
Strategies for Stream Data Ingestion:
To implement stream data ingestion effectively, organizations leverage a variety of strategies
and technologies that enable real-time data processing and analysis. This includes:
37
6. Storing and Organizing Data
2. NoSQL Databases: NoSQL databases offer a flexible and scalable approach to storing and
querying unstructured or semi-structured data, using document, key-value, column-family, or
graph-based data models. This approach is well- suited for scenarios where data volumes are
large, data schemas are dynamic, or horizontal scalability is required.
Organizing Data for Efficiency: In addition to storage considerations, organizing data effectively is
critical to ensuring its usability, discoverability, and maintainability. Data organization encompasses
the processes and methodologies involved in structuring, categorizing, and indexing data to facilitate
efficient retrieval and analysis. This includes:
1. Data Modeling: Data modeling involves defining the structure, relationships, and constraints
of data entities and attributes, typically using entity-relationship diagrams, schema definitions,
or object-oriented models. This approach helps ensure data consistency, integrity, and
interoperability across the organization.
2. Data Partitioning: Data partitioning involves dividing large datasets into smaller, more
manageable partitions based on certain criteria, such as time, geography, or key ranges. This
approach helps distribute data processing and storage resources more evenly, improving
performance, scalability, and availability.
38
Technologies for Data Storage and Organization:
To facilitate data storage and organization effectively, organizations leverage a variety of technologies
and tools that offer scalability, reliability, and flexibility. This includes:
1. Cloud Storage Services: Cloud storage services, such as Amazon S3, Google Cloud
Storage, or Microsoft Azure Blob Storage, provide scalable and durable storage solutions for
storing and managing data in the cloud. These services offer features such as encryption,
versioning, and lifecycle management, making them well-suited for a wide range of use cases.
Workflow management systems, such as Apache Airflow, Apache NiFi, or Luigi, provide a
centralized platform for defining, scheduling, and executing data processing workflows. These
systems offer features such as task dependencies, scheduling, retry mechanisms, and
monitoring capabilities, making them well-suited for orchestrating complex data pipelines.
2. Data Lakes: Data lakes provide a centralized repository for storing and managing large
volumes of structured, semi-structured, and unstructured data in its native format. This approach
enables organizations to ingest, store, and analyze diverse datasets without the need for predefined
schemas or data transformations. Infrastructure as code (IaC) frameworks, such as Terraform or
AWS CloudFormation, enable organizations to automate the provisioning and configuration of
computing resources, such as virtual machines, containers, or serverless functions, needed to
execute data processing tasks. By defining infrastructure as code, organizations can ensure
consistency, reproducibility, and scalability in their data pipeline deployments.
39
Processing Big Data
To overcome these challenges, organizations employ a variety of strategies and technologies for big
data processing, each tailored to the unique requirements and characteristics of their data ecosystem.
This includes:
1. Lambda Architecture: The Lambda architecture provides a framework for building robust
and fault-tolerant big data processing pipelines that can handle both batch and stream
processing of data. By combining batch and speed layers in parallel, organizations can achieve
comprehensive insights with low latency, enabling real-time and near-real-time analytics.
2. Kappa Architecture: The Kappa architecture offers a simplified alternative to the Lambda
architecture by eliminating the batch layer and relying solely on-stream processing for data
ingestion and analysis. This approach simplifies the architecture, reduces complexity, and
enables faster time-to-insight, making it well-suited for real-time analytics and event-driven
applications.
40
Best Practices for Big Data Processing:
To optimize big data processing workflows, organizations should adhere to several best practices,
including:
1. Data Partitioning and Sharding: Partitioning large datasets into smaller, more manageable
chunks enables parallel processing and improves scalability and performance. By dividing data
based on certain criteria, such as time, geography, or key ranges, organizations can distribute
processing and storage resources more evenly, minimizing bottlenecks and contention.
2. Data Compression and Serialization: Compressing and serializing data before processing
reduces storage and bandwidth requirements, improves data transfer speeds, and accelerates
query execution. By using efficient compression algorithms and serialization formats, such as
Apache Avro or Protocol Buffers, organizations can minimize data footprint and optimize
resource utilization.
41
Analyzing and Visualizing data
42
Best Practices for Data Analysis and Visualization:
To optimize data analysis and visualization workflows, organizations should adhere to several best
practices, including:
1. Audience-Centric Design: Designing visualizations with the end-user in mind ensures that
insights are communicated effectively and resonate with the intended audience. Organizations
should consider factors such as audience demographics, preferences, and prior knowledge when
designing visualizations.
2. Iterative Exploration: Data analysis and visualization are iterative processes that require
continuous exploration and refinement. Organizations should encourage a culture of
experimentation and iteration, where insights are refined based on feedback, new data, and
evolving analysis objectives.
43
CASE STUDY:
PROBLEM STATEMENT:
SOLUTION:
The Solution is developed by deep analysis of the problem statement and making some quick
insights.
• Phoenix Corporation undertook a holistic transformation of its data engineering
infrastructure by harnessing the versatile capabilities of AWS services. The cornerstone
of this initiative, termed PhoenixData, integrated a suite of AWS solutions including
AWS Glue, Amazon S3, Amazon Redshift, AWS Lambda, and Amazon EMR. This
amalgamated approach facilitated a seamless orchestration of data workflows
encompassing ingestion, processing, and analytics, all while operating at
unprecedented scales.
Subsequently, Amazon S3 emerged as the bedrock for scalable and durable storage,
accommodating the burgeoning volumes of data ingested through AWS Glue. Its
object storage architecture provided a reliable repository for housing raw and
processed data, facilitating efficient data management and accessibility across the
organization.
• Furthermore, Amazon EMR served as a cornerstone for big data processing and
analytics, providing a managed Hadoop framework that facilitated the seamless
processing of large datasets. By leveraging EMR's elastic scalability and support for a
wide array of data processing frameworks, Phoenix Corporation could efficiently
execute complex data processing tasks, ranging from batch processing to real-time
analysis
45
DIAGRAM:
46
CONCLUSION:
47
REFERENCES:
[1] Phoenix Contact: "Meet The Hardware and Software Company Jumpstarting the Digital
Transformation," Phoenix Contact Digital Transformation
[2] AWS Case Study: Phoenix Corporation Data Infrastructure Transformation: Phoenix
Corporation Data Infrastructure Transformation
[3] AWS Case Study: Netflix Data Infrastructure Transformation: Netflix Data Infrastructure
Transformation
[4] Data Engineering Best Practices on AWS: AWS Data Engineering Best Practices
[5] Case Study: Spotify's Data Engineering with AWS: Spotify Data Engineering Case Study
[6] Case Study: Pinterest's Data Engineering Infrastructure on AWS: Pinterest Data
Engineering Case Study
[7] Case Study: GE Healthcare's Data Engineering Solution on AWS: GE Healthcare Data
Engineering Case Study
[8] AWS Case Study: Phoenix Corporation Data Infrastructure Transformation: Phoenix
Corporation Data Infrastructure Transformation
[9] Case Study: GE Healthcare's Data Engineering Solution on AWS: GE Healthcare Data
Engineering Case Study
[10] Real-time Data Processing with AWS Lambda: AWS Lambda Real-time Data
Processing
48