AWS Notes-1
AWS Notes-1
Cloud computing refers to the delivery of computing services, including servers, storage,
networking, databases, analytics, software, and intelligence, over the internet to offer faster
innovation, flexible resources, and economies of scale. In essence, cloud computing allows
individuals and businesses to access and use computing resources hosted on remote servers
instead of relying on their own local servers or personal computers.
On-Demand Self-Service: Users can provision and manage computing resources, such as
server time and network storage, as needed without requiring human intervention from the
service provider.
Broad Network Access: Cloud services are available over the internet and can be accessed
by users through various devices (e.g., laptops, smartphones, tablets).
Resource Pooling: Cloud providers use multi-tenant models, which means that resources are
pooled and used by multiple customers. The resources are dynamically allocated and
reassigned based on demand.
Rapid Elasticity: Cloud resources can be rapidly and elastically provisioned to quickly scale
up or down based on demand. This allows for flexibility and cost optimization.
Measured Service: Cloud systems automatically control and optimize resource use by
leveraging a metering capability at some level of abstraction. Resources are monitored,
controlled, and reported, providing transparency for both the provider and consumer.
Advantages Of Cloud
Cost-Efficiency: Cloud computing eliminates the need for organizations to invest in and
maintain physical hardware and infrastructure. Instead, they can use cloud services on a pay-
as-you-go or subscription basis, which can lead to significant cost savings. This includes
reduced expenses for hardware, software licenses, maintenance, and energy consumption.
Scalability: Cloud services offer the ability to easily scale resources up or down based on
demand. This allows organizations to quickly adapt to changing workloads, ensuring that
they have the right amount of computing power and storage at any given time. This flexibility
is particularly valuable for businesses with fluctuating resource needs.
Flexibility and Mobility: Cloud computing allows users to access applications and data from
anywhere with an internet connection and on a variety of devices. This provides greater
flexibility for remote work, collaboration among team members, and enables employees to be
productive even when they are not in the office.
Public Cloud: Resources are owned and operated by a third-party cloud service provider.
Available to the general public and multiple organizations.
Examples: Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP).
Private Cloud: Resources are used exclusively by a single organization.
Can be hosted on-premises or by a third-party provider.
Offers greater control, security, and customization.
Commonly used for sensitive data or compliance requirements.
Hybrid Cloud: Combination of public and private clouds, allowing data and applications to
be shared between them.
Provides flexibility to balance performance, security, and cost-efficiency.
Useful for organizations with varying workload demands.
Community Cloud: Shared infrastructure by several organizations with common interests or
requirements.
Built and managed by the organizations themselves or a third-party provider.
Can address specific regulatory, compliance, or security concerns of the community
members.
Types of Cloud
Maintenance and Organization responsible for all Cloud provider handles maintenance and
Upgrades maintenance, updates, and upgrades upgrades of physical infrastructure
IAM Features:
1. Shared access to your AWS account:
You can grant other people permission to administer and use resources in your AWS account
without having to share your access credentials.
2. Granular permission:
You can grant different permission to different people for different resources.
For instance, you can allow some users complete access to EC2, S3, Dynamo DB, Redshift
while for others, you can allow read only access to just some S3 buckets, or permission to
administer just some EC2 instances or to access your billing information but nothing else.
3. Secure access to AWS resources for application that run on Amazon EC2:
You can use IAM features to securely give application that run on EC2 instances the credentials
that they need in order to access other AWS resources. For example, include S3 buckets and RDS
or Dynamo DB databases.
5. Identity federation:
You can allow users who already have passwords elsewhere. For e.g: in your corporate network
of with an internet identity provider to get temporary access to your AWS account.
IAM Terms:
Following are the major terms which are used in an IAM account.
1. Principal
2. Request
3. Authentication
4. Authorization
5. Action/Operation
6. Resources
1. Principal:
A principal is a person or application that can make a result for an action or operation on an
AWS resources.
Your administrative IAM user is your first principal.
You can allow users and services to assume a role.
IAM users, roles, federated users and application are all AWS principals.
2. Request:
When a principal tries to use the AWS management console, the AWS API of the AWS CLI that
principal sends a request to AWS. The request includes the following information:
Actions: That the principal wants to perform.
Resources: upon which the actions are performed.
Principal information: it’s including the environment from which the request was made.
Request context: before AWS can evaluate and authorize a request, AWS gathers the request
information. Principal (the requester) which is determined based on the authorization data. This
includes the aggregate permissions that the associated with that principal.
Environment data: such as IP address, user agent, SSL enabled status, or the time of the day.
Resource data: it is related to the resource that is being requested.
3. Authentication:
A principal sending a request must be authenticated (sighed into AWS) to send a request to
AWS.
Some AWS services, like AWS S3 allow request from anonymous users, they are exception to
the role.
To authenticate from the console as a root user, you must sign-in with your user name and
password.
To authenticate from the API to CLI, you must provide your access key and secrete key.
You might also be required to provide additional security information like MFA (e.g: Google
Authentication )
4. Authorization:
To authorize request, IAM uses value from the request context to check, for matching policies
and determine whether to allow or deny the request.
IAM policies are stored in IAM as JSON documents and specify the permission that are
allowed or denied.
5. Actions:
Actions are defined by a service, and more the things that you can do to a resource such as
viewing, creating, editing, and deleting that resource.
IAM supports approx. 40 actions for a user resource including create user, delete user etc.
Any actions or resources that are not explicitly allowed are denied by default.
After your request has been authenticated and authorized, AWS approves the actions in your
request.
6. Resource:
IAM Identities:
IAM identities is what you create under your AWS account to provide authentication for
people, application and process in your AWS account.
Identities represents the user and can be authenticated and then authorized to perform actions
in AWS.
Each of these can be associated with one or more policies to determine what actions a user,
role or member of the group can do with which resources and under what conditions.
IAM group is a collection of IAM user.
IAM role is very limit IAM user.
A. IAM Users:
An IAM user is an entity that you create in AWS. It represents the person or service who uses
the IAM user to interact with AWS.
You can create 5 users at time.
An IAM user can represent an actual person or an application that requires AWS access to
perform actions on AWS resources.
A primary use of IAM users is to give people the ability to sign-in to the AWS management
console for interactive task and to make programmatic request to AWS services using the API or
CLI.
For any user you can assign them:
A username and password to access the AWS console.
An access key ID and secrete key that can use for programmatic access.
The newly created IAM user have no password and no access key. You need to create the user
password.
Each IAM user is associated with one and only one AWS account.
Users are defined within your account, so user do not have to do payment. Bill would be pay
by the parent account.
B. IAM Groups:
C. IAM Roles:
An IAM role is very similar to a user, in that it is an identity with permission policies that
determine what the identity can and cannot do in AWS.
An IAM role does not have any credentials (password or access key) associated with it.
Instead of being associated with one person, a role is intended to be assumable by anyone who
needs it.
An IAM user can assume a role to temporarily take on different permissions for a specific task.
An IAM role can be assigned to a federated user who sign-in by using an external identity
provider instead of IAM.
D.IAM Policies:
IAM (Identity and Access Management) policies in the AWS Management Console are a set
of permissions that define what actions are allowed or denied for different AWS resources.
These policies help you control who can do what within your AWS account, whether it's
accessing services, creating or modifying resources, or performing other operations.
IAM policies are written in JSON (JavaScript Object Notation) format and they can be
attached to IAM users, groups, or roles. Policies can be as broad or as specific as needed,
allowing you to grant or restrict access to individual services, actions, or even specific
resources.
Difference Between Roles and Policies
Delegates permissions to entities that need Defines what actions are allowed or
Purpose to access AWS resources. denied on which resources.
Not associated with a specific user or Associated with specific users, groups, or
Ownership group of users. roles.
Credential Entities assume the role to obtain Not related to the acquisition of
Usage temporary security credentials. temporary credentials.
Single entity, assumed by other entities JSON documents that define permissions
Type for a defined duration. and can be attached to multiple entities.
To delegate permission to access a resource you create an IAM role that has two policies
attached.
i. The Trust Policy
ii. The Permission Policy
The trusted entity is included in the policy as the principal element in the document.
When you create a trust policy, you cannot specify a wildcard (*) as a principal.
Cross Account Permissions:
You might need to allow user from another AWS account to access resources in your AWS
account. If so, don’t share security credentials, such as access keys between accounts. Instead use
IAM roles.
You can define a role in the trusted account that specifies what permissions the IAM users in
the other account are allowed.
You can also designate which AWS account have the IAM users that are allowed to assume the
role. We do not define users here rather AWS account.
Purpose Identifies the source of API requests. Signs API requests for authentication.
20-character string of uppercase letters and 40-character string of uppercase and lowercase
Format numbers. letters, and numbers.
Sensitive, but not as critical as the Secret Extremely sensitive and must be kept
Security Access Key. confidential. Should never be shared or exposed.
Instance Type:
The instance type determines the hardw
are of the host computer used for the instance. Different instance types have varying combinations of
CPU, memory, storage, and networking capacity.
Examples of instance types include t2.micro, m5.large, c5.xlarge, etc.
Key Pairs:
When you launch an EC2 instance, you can associate it with a key pair. This is a set of public and
private keys used for secure SSH (Linux) or RDP (Windows) access to the instance.The private key is
kept secure by the user, and the public key is placed on the instance. When you connect to the
instance, you use the private key to authenticate.
Security Groups:
Security groups act as a virtual firewall for the instance, controlling inbound and outbound traffic.
They can be configured to allow or deny specific types of traffic based on rules defined by the user.
Elastic IP Addresses: An Elastic IP address is a static, public IPv4 address that you can allocate to
your AWS account. It can be associated with an EC2 instance, providing a fixed public IP that can be
remapped to different instances.
Storage (EBS Volumes): EC2 instances can be associated with Elastic Block Store (EBS) volumes
for persistent storage. These volumes can be attached and detached from instances, providing a way to
store data independently of the instance's lifecycle.
Instance Metadata: EC2 instances have access to instance metadata, which provides information
about the instance's configuration and environment. This information can be accessed from within the
instance.
Tags: Tags are key-value pairs that you can assign to EC2 instances (and other AWS resources). They
are useful for organizing and managing your resources, and they can be used for cost allocation and
billing purposes.
Types of Instance
1. General purpose
2. Compute optimized
3. Memory optimized
4. Storage optimized
5. Accelerated computing or GPU
6. Storage optimized
7. High memory optimized
1. General purpose:
General purpose instances provide a balance of compute, memory and networking resources and
can be used for a variety of workloads.
There are 3 series are available in general purpose instance:
a. A series: A1
b. M series: M4, M5, M5a, M5d, M5ad (large)
c. T series: T2 (free tier eligible), T3, T3a
A1 instances are ideally suited for scale out workloads that are supported by the ARM
Ecosystem.
These instances are well suited for the following applications:
1. Web server
2. Containerized micro services
3. Caching fleets
4. Distributed data stores
5. Application that requires ARM instruction set
M4 instance:
The new M4 instances features a custom Intel Xeon E5-2676 v3 Haswell processor optimized
specifically for EC2.
vCPU- 2 to 40 (max)
RAM- 8GB to 160GB(max)
Instance storage: EBS only (root volume storage)
M5, M5a, M5d and M5ad instances:
These instances provide an ideal cloud infra, offering a balance of compute, memory and
networking resources for a broad range of applications.
Used in : gaming server, webserver, small and medium database
vCPU- 2 to 96(max)
RAM- 8 to 384(max)
Instance storage- EBS and NVMe SSD
c. T series: T2, T3, T3a instances:
These instances provide a baseline level of CPU performance with the ability to burst to a higher
level when required by your workload. An unlimited instances can sustain high CPU performance
for any period of time whenever required.
vCPU- 2 to 8
RAM- 0.5 to 32 GB
Used for:
i. Website and web app
ii. Code repositories
iii. Development, build, test
iv. Micro services
2. Compute optimized:
Compute optimized are ideal for compute bound applications that benefits from high
performance processors.
C Series: Three types are available: C4, C5, C5n [C3- previous instance]
C4:
C4 instances are optimized for compute intensive workloads and deliver very cost effective high
performance at a low price per complete ratio.
vCPU- 2 to 36
RAM- 3.75 to 60GB
Storage- EBS only
Network BW- 10 Gbps
Usecase: web server, batch processing, MMO gaming, Video encoding
3. Memory Optimized:
Memory optimized instances are designed to deliver fast performance for workloads that large
data sets in memory.
There are 3 series are available:
R series, X series, Z series
A. R Series:
R4, R5, R5a, R5ad, R5ad
High performance, relational (MySQL) and NoSQL (MongoDB, Casssandra) databases.
Distributed web scale cache stores that provide in memory caching of key volume type data.
Used in financial services, Hadoop
vCPU- 2 to 96
RAM- 16 768GB
Instance storage- EBS only and NVMe SSD
B. X Series:
X1, X1e instances:
Well suited for high performance database, memory intensive enterprise application, relational
database workload, SAP HANA.
Electronic design automation
vCPU- 4 to 128
RAM- 122 to 3904GB ,Instance storage- SSD
C. Z1d instance:
High frequency Z1d delivers a sustained all core frequency of up to 4.0 GHz, the fastest of any
cloud instances.
AWS Nitro System, Xeon processor, up to 1.8 TB of instances storage.
vCPU- 2 to 48
RAM- 16 to 384 GB
Storage- NVM SSD
Use case: electronic design automation and certain database workloads with high per-core
licensing cost.
4. Storage optimized:
Storage optimized instances are designed for workloads that require high, sequential Read and
Write access to very large data sets on local storage. They are optimized to deliver tens of
thousands of low latency, random I/O operations per second (IOPS) to application.
It is of three types:
A. D series- D2 instance
B. H series- H1 instance
C. I series- I3 and I3en instance
A. D2 instance:
Massive parallel processing (MPP) data warehouse.
Map reduce and Hadoop distributed computing.
Log or data processing app
vCPU- 4 to 36
RAM- 30.5 to 244GB
Storage- SSD
B. H series- H1 instance:
This family features up to 16GB of HDD based local storage, high disk throughput and balance
of compute and memory.
Well suited for app requiring sequential access to large amounts of data on direct attached
instance storage.
Application that requires high throughput access to large quantities of data.
vCPU- 8 to 64
RAM- 32 to 256GB
Storage- HDD
NoSQL database
Distributed file system
Data warehousing application
vCPU- 2 to 96
RAM- 16 to 768GB
Local storage- NVMe SSD
Networking performance- 25 Gbps to 100 Gbps
Sequential throughput: Read- 16GBps Write- 6.4 GBps (I3)
a. F1 instance:
F1 instances offers customizable hardware acceleration with field programmable gate
arrays.(FPGA)
Each FPGA contains 2.5 million logic elements and 6800 DSP (Digital Processing Unit)
engines.
Designed to accelerate computationally intensive algorithms such as data flow or highly
parallel operations.
Used in- genomics research, financial analytics, real time video processing and big data
search.
b. P2 and P3 Instance:
It uses NVIDIA Tesla GPUs.
Provide high bandwidth networking.
Up to 32GB of memory per GPUs which makes them ideal for deep learning and
computational fluid dynamics.
Used in- machine learning, databases, seismic analysis, genomics, molecular modeling, AI,
deep learning
c. G2 and G3 instances:
Optimized for graphics intensive application.
Well suited for app like 3D visualization.
G3 instances use NVIDIA Tesla M60 GPU and provide a cost effective, high performance
platform for graphics applications.
Used in: video creation service, 3D visualization, streaming, graphic intensive application
High memory instances are purpose built to run large-in-memory databases, including
production developments of SAP HANA in the cloud.
It has only on series i.e U series.
Features:
Latest generation intel Xeon Pentium 8176M processor.
6, 9, 12 TB of instance memory, the largest of any EC2 instance.
Powered by the AWS Nitro System, a combination of dedicated hardware and light weight
hypervisor.
Bare metal performance with direct access to host hardware.
EBs optimized by default at no additional cost.
Model number- U-6tb1.metal, U-9tbi.metal, U-12tb1.metal
Network performance- 25 GBps
Dedicated bandwidth- 14GBps
EC2 PURCHASING OPTION:
There are 6 ways of purchasing options available for AWS EC2 instances, but there are 3 ways to
pay for Amazon EC2 instance i.e On demand, Reserved instance and Spot instance.
You can also pay for dedicated host which provide you with EC2 instance capacity on physical
servers dedicated for your use.
1. On demad
2. Dedicated instance
3. Dedicated Host
4. Spot instance
5. Scheduled instance
6. Reserved instance
1. On-Demand Instance:
AWS on demand instances are virtual servers that run in AWS of AWS relational database
service (RDS) and are purchased at a fixed rate per hour.
AWS recommends using on demand instances for applications with short term irregular
workloads that cannot be interrupted.
They also suitable for use during testing and development of applications on EC2.
With on demand instances you only pay for EC2 instances you use.
The use of on demand instances frees you from the cost and complexities of planning,
purchasing, and maintaining hardware and transforms what are commonly large fixed costs into
mush smaller variable cost.
Pricing is per instance hour consumed for each instance from the time an instance is launched
until if it terminated of stopped.
Each partial instance hour consumed will be billed per second for linux instances and as a full
hour for all other instance types.
2. Dedicated Instance:
Dedicated instances are run in a VPC on hardware that is dedicated to a single customer.
Your dedicated instances are physically isolated at the host hardware level from instances that
belong to other AWS account.
Dedicated instances may share hardware with other instances from the same account that are
not dedicated instance.
Pay for dedicated instances on demand save up to 70% by purchasing reserved instance or save
up to 90% by purchasing spot instances.
3. Dedicated Host:
An Amazon EC2 dedicated host is a physical server with EC2 instance capacity fully dedicated
to your use.
Dedicated host can help you address compliance requirement and reduce costs by allowing you
to use your existing server bound software licenses.
Pay for a physical host that is fully dedicated to running your instances and bring your existing
per socket, per core, per VM software license to reduce cost.
Dedicated host gives you additional visibility and control over how instances are placed in a
physical server and you can consistently deploy your instances to the same server over time.
As a result dedicated host enables you to use your existing server bound software license and
address corporate compliance and regulatory requirements.
Instances that run on a dedicated host are the same virtualized instances that you had get with
traditional EC2 instance3s that use the XEN Hypervisor.
Each dedicated host supports a single instance size and type (for e.g C3.XLARGE)
Only BYOL, Amazon linux and AWS marketplace AMIs can be launched onto dedicated
hosts.
4. Spot Instances:
Amazon EC2 spot instances let you take advantage of unused EC2 capacity in the AWS cloud.
Spot instances are available at up to 90% discount compared to on-demand prices.
You can use spot instances for various test and development workloads.
You can also have the options to hibernate, stop or terminate your spot instances when EC2
reclaims the capacity back with two minutes of notice.
Spot instances are spare EC2 capacity that can save you up 90% off of on-demand prices that
AWS can interrupt with a 2 minute notification. Spot uses the same underlying EC2 instances as
on-demand and reserved instances, and is best suited for flexible workloads.
You can request spot instances up to your spot limit for each region.
You can determine the status of your spot request via spot request status code and message.
You can access spot request status information on the spot instance page of the EC2 console of
the AWS management console.
In case of hibernate, your instance gets hibernated and RAM data persisted. In case of stop,
your instance gets shutdown and RAM is cleared.
With hibernate, spot instances will pause and resume around any interruptions so your
workloads can pick up from exactly where they left off.
5. Scheduled Instance:
Scheduled reserve instances enable you to purchase capacity reservations that recur on a daily,
weekly or monthly basis, with a specified start time and duration for one year term.
You reserve the capacity in advance so that you know it is available when you need it.
You pay for the time that the instances are scheduled even if you do not use them.
Scheduled instances are a good choice for workloads that do not run continuously but do run
on a regular schedule.
Purchase instances that are always available on the specified recurring schedule for a one year
term.
For example: you can use schedule instances for an application that runs during business hours of
for batch processing that run at the end of the week.
6. Reserved Instances:
Amazon EC2 RI provides a significant discount up to 75% compared to on-demand pricing
and provide a capacity reservation when used in a specific availability zone.
Reserved instances give you the option to reserve a DB instance for a one or three year term
and in turn receive a significant discount compared to the on-demand instance pricing for the DB
instance.
There are 3 types of RI are available such as
a. Standard RI: these provide the most significant discount up to 75% off on-demand and are best
suited for steady-state usage.
b. Convertible RI: these provide a discount up to 54% and the capability to change the attributes
of the RI as long as the exchange results in the creation of reserved instances of greater or equal
value.
c. Scheduled RI: these are available to lunch within the time window you reserve.
There are two types of block store devices are available for EC2.
1. Elastic Block Store (persistent, network attached virtual drive)
2. Instances Store Backed EC2:
Basically the virtual hard drive on the host allocated to this EC2 instance.
Limit to 10GB per device
Ephemeral storage (non-persistent storage)
The EC2 instance can’t be stopped, can only be rebooted or terminated. Terminate will delete
data.
EBS volume behaves like RAW, unformatted, external block storage devices that you can
attached to your EC2 instance.
EBS volumes are block storage devices suitable for database style data that requires frequent
reads and writes.
EBS volumes are attached to your EC2 instances through the AWS network, like virtual hard
drive.
An EBS volume can attach to a single EC2 instances only at a time.
Both EBS volumes and EC2 instances must be in the same AZ.
An EBS volume data is replicated by AWS across multiple servers in the same AZ to prevent
data loss resulting from any single AWS component failure.
gp3 volumes provide single-digit millisecond latency and 99.8 percent to 99.9 percent
volume durability with an annual failure rate (AFR) no higher than 0.2 percent, which
translates to a maximum of two volume failures per 1,000 running volumes over a one-year
period. AWS designs gp3 volumes to deliver their provisioned performance 99 percent of the
time.
3-Magnetic Standard:
Lowest cost per GB of all EBS volume type that is bootable.
Magnetic (standard) volumes are previous generation volumes that are backed by magnetic
drives. They are suited for workloads with small datasets where data is accessed infrequently
and performance is not of primary importance. These volumes deliver approximately 100
IOPS on average, with burst capability of up to hundreds of IOPS, and they can range in size
from 1 GiB to 1 TiB.
Magnetic volumes are ideal for workloads where data is accessed infrequently and
applications where the lowest storage cost is important.
Price: $0.05/GB/month
Volume size: 1GB to 1TB
Max IOPS/volume: 40-200
Snapshots
In the context of AWS (Amazon Web Services) and EC2 (Elastic Compute Cloud), a snapshot
is a point-in-time copy of an Amazon Machine Image (AMI) or the data on an Amazon
Elastic Block Store (EBS) volume.
Snapshots are incremental, which means that only the blocks on the device that have changed
after the last snapshot are saved. This helps to reduce storage costs and minimize the time it
takes to create the snapshot.
Snapshots are typically stored in Amazon S3 (Simple Storage Service) and can be used to
create new volumes or restore existing ones. They are an important part of creating reliable
and durable backups in AWS.
Keep in mind that while snapshots are a crucial component of backup and disaster recovery
strategies, they do not replace the need for regular data backup and retention policies. It's
important to have a comprehensive backup strategy that includes both snapshots and regular
backups.
Advantages of Snapshot:
Data Backup and Recovery: Snapshots provide a reliable way to back up and restore
your data.
Incremental Backups: They save only the changed data, reducing storage costs.
Cost-Effective: You pay only for the changed data blocks.
Efficient Disk Management: Easily manage and restore to specific states.
Disaster Recovery: Quickly recover from failures by creating new volumes or
instances from snapshots.
Cloning and Replication: Duplicate instances or volumes easily.
Migration and Data Transfer: Move data between AWS regions or Availability Zones.
Customized Environments (AMI Snapshots): Create custom machine images with
specific software and configurations.
Version Control: Keep track of changes and roll back if needed.
Security and Compliance: Snapshots can be encrypted for added security.
Tagging and Organization: Add tags to easily manage and track your backups.
Flexibility: Allows for various operations like creating, copying, sharing, and deleting
snapshots based on your needs.
2. AMI Snapshots:
An Amazon Machine Image (AMI) is a pre-configured virtual machine image, which
includes an operating system and any additional software or configurations you've installed.
It serves as the basis for launching EC2 instances.
Why Use AMI Snapshots:
Instance Replication: AMI snapshots allow you to replicate your EC2 instances, ensuring that
you can launch new instances with the same configurations as the original.
Customized Environments: You can create custom AMIs with specific software,
configurations, and data pre-installed. This allows you to easily deploy instances with a
known and consistent environment.
How AMI Snapshots Work:
Creating an AMI from an Instance:You can create an AMI from a running or stopped EC2
instance. This process involves specifying the source instance and then AWS takes a snapshot
of the root volume to create the image.
Launching an Instance from an AMI:When you launch an instance from an AMI, AWS
creates a new EBS volume from the snapshot and attaches it to the new instance. This means
that any data or changes made to the instance after the AMI snapshot was taken will not be
included in the new instance.
(Network & Security) Security Group :
In Amazon Web Services (AWS), a security group is a fundamental component of the
networking and security model for Amazon Elastic Compute Cloud (EC2) instances. It acts
as a virtual firewall that controls inbound and outbound traffic for one or more EC2
instances.
Security groups play a crucial role in controlling network access to your EC2 instances and
are an important part of your overall security posture in AWS. It's recommended to follow the
principle of least privilege when configuring security group rules, meaning that you should
only allow the minimum necessary access to accomplish a specific task.
Here are some key points about security groups:
Stateful Filtering: Security groups are stateful, meaning if you allow inbound traffic from a
specific IP address, the corresponding outbound traffic is automatically allowed, regardless of
outbound rules. This simplifies the process of managing network access.
Rule-Based: Security groups operate based on rules. Each rule specifies a type of traffic (e.g.,
HTTP, SSH), a protocol (TCP, UDP, ICMP), and a range of allowable IP addresses (CIDR
blocks) or specific IP addresses.
Allow Rules: Allow rules permit specific types of inbound or outbound traffic. For example,
you might have an allow rule that allows inbound traffic on port 80 (HTTP) to your web
server.
Deny Rules: Deny rules are not used in security groups. Instead, if a particular type of traffic
is not explicitly allowed, it is implicitly denied.
Bound to Instances: Security groups are associated with EC2 instances. When you launch an
instance, you can specify one or more security groups to be associated with it. You can also
modify the security groups associated with an existing instance.
Multiple Security Groups per Instance: An instance can be associated with multiple security
groups. The rules from all associated security groups are effectively combined.
Priority of Rules: If a traffic type is allowed by one security group but denied by another, the
"allow" rule takes precedence.
Basic Rules for Defining Security Group:
Rules for a security group in AWS EC2 define the type of inbound and outbound traffic that
is allowed or denied. Each rule specifies the following:
Type: This defines the type of traffic, such as SSH (for remote access via Secure Shell), HTTP,
HTTPS, etc.
Protocol: This specifies the network protocol used for the rule, which can be TCP, UDP, or
ICMP.
Port Range: For TCP and UDP, this indicates the range of ports that the rule applies to. For
example, for HTTP, you would use port 80.
Source/Destination: This is the source of inbound traffic or the destination of outbound
traffic. It can be specified as an IP address, an IP range (in CIDR notation), or another
security group.
(Network & Security)
(Network & Security) Placement Groups
A placement group in AWS is like a special area where you can put your computer servers
(EC2 instances). Depending on the type of placement group, the servers will be arranged in a
way that helps them work together better.
Placement groups are used to influence the placement of instances to meet the needs of your
workload.
Cluster Placement Group: If you want your servers to talk to each other very quickly, this is
like putting them on the same team in a sports game. They'll be placed really close together to
reduce the time it takes for them to communicate.
Instance Types: Instances in a cluster placement group must be of the same instance type.
Availability Zone: All instances in a cluster placement group must be in the same Availability
Zone.
Limitations: There is a limit on the number of instances you can launch in a cluster placement
group, and you cannot move an existing instance into a cluster placement group.
instance Types: Instances in a partition placement group can be of different instance types.
Availability Zone: Instances in a partition placement group can span multiple Availability
Zones in a region.
Limitations: There are limits on the number of partitions and instances you can have in a
partition placement group.
Imagine each server is given its own special spot in a big room. This helps make sure that if
something goes wrong with one server, it won't affect the others.
Instance Types: Instances in a spread placement group can be of different instance types.
Availability Zone: Instances in a spread placement group are placed on distinct hardware in a
single Availability Zone.
Limitations: There are limits on the number of instances you can have in a spread placement
group.
(Network & Security) Network Interference
Network Interference is a virtual network interface that you can attach to an EC2 instance. It
acts as a networking component for an EC2 instance, allowing it to communicate with other
resources within a Virtual Private Cloud (VPC) or over the internet.
A network interface, in simple words, is like a virtual plug that allows a computer or server to
connect to a network. It's a way for your device to communicate with other devices, like other
computers, servers, or the internet.
Imagine your computer as a house with many rooms. Each room has a different door. The
network interface is like a special door that connects your house to the outside world. It lets
you send and receive information over a network, which could be a local network in your
home or a global network like the internet.
Security: Network interfaces can be associated with security groups and Network Access
Control Lists (NACLs), which allow you to control inbound and outbound traffic to and from
the instance.
Elastic Load Balancing: You can attach multiple network interfaces to an instance and
associate them with different subnets and security groups to distribute traffic using an Elastic
Load Balancer. Virtual Private Cloud (VPC): Network interfaces play a crucial role in
enabling communication between instances in a VPC, as well as allowing instances to
connect to the internet or other AWS services.
Elastic IP Addresses: You can associate Elastic IP addresses with a specific network interface,
allowing you to have a consistent public IP address even if you stop and start the associated
instance.
Multiple IP Addresses: You can assign multiple IP addresses to a single network interface,
which is useful in scenarios where an instance needs to have multiple IP addresses.
It's important to note that when you terminate an EC2 instance, all the associated network
interfaces are also terminated by default. However, you have the option to detach a network
interface from an instance, which keeps it alive even after the instance termination.
Overall, network interfaces in AWS EC2 provide flexibility and control over the networking
capabilities of your instances, allowing you to design and configure your network to suit your
specific requirements.
(Load Balancing)Load Balancer
A load balancer in AWS (Amazon Web Services) EC2 is a service that helps distribute
incoming network traffic across multiple EC2 instances. This helps improve the
availability and fault tolerance of your application by ensuring that no single instance
becomes overwhelmed with too much traffic.
Classic Load Balancer (CLB): This is the original AWS load balancer that provides basic
load balancing across multiple EC2 instances.
The Classic Load Balancer is designed to work across multiple Availability Zones (AZs) to
increase fault tolerance. You can distribute your EC2 instances across different AZs, and the
load balancer will route traffic to healthy instances in each zone.
The load balancer regularly checks the health of registered instances. If an instance is
determined to be unhealthy, the load balancer stops sending traffic to it.
The load balancer regularly checks the health of registered instances. If an instance is
determined to be unhealthy, the load balancer stops sending traffic to it.
Application Load Balancer (ALB):
ALB operates at the application layer (Layer 7) and is ideal for routing HTTP/HTTPS traffic.
It can route requests based on content of the request like URL path or hostname, making it
suitable for modern web applications.
Application Load Balancers provide advanced routing and visibility features targeted at
application architectures, including microservices and containers.
Operating at the connection level, Network Load Balancers are capable of handling millions
of requests per second securely while maintaining ultra-low latencies.
Gateway Load Balancer (GWLB) is a service provided by AWS that allows you to deploy,
scale, and manage third-party virtual appliances like firewalls, intrusion detection systems,
and other network security and monitoring tools in the cloud.
The Gateway Load Balancer is part of AWS Network Firewall, which helps protect your VPC
resources.
Auto Scaling
Creating group of EC2 instances that scale up or down depending on the conditions you set.
Amazon EC2 Auto Scaling helps you ensure that you have the correct number of
Amazon EC2 instances available to handle the load for your application. You create
collections of EC2 instances, called Auto Scaling groups. You can specify the minimum
number of instances in each Auto Scaling group, and Amazon EC2 Auto Scaling ensures
that your group never goes below this size. You can specify the maximum number of
instances in each Auto Scaling group, and Amazon EC2 Auto Scaling ensures that your
group never goes above this size. If you specify the desired capacity, either when you
create the group or at any time thereafter, Amazon EC2 Auto Scaling ensures that your
group has this many instances. If you specify scaling policies, then Amazon EC2 Auto
Scaling can launch or terminate instances as demand on your application increases or
decreases.
Example:
For example, the following Auto Scaling group has a minimum size of one instance, a
desired capacity of two instances, and a maximum size of four instances. The scaling
policies that you define adjust the number of instances, within your minimum and
maximum number of instances, based on the criteria that you specify.
Enable elasticity by scaling horizontally through adding or terminating EC2 instances.
Auto scaling ensures that you have the right number of AWS EC2 instances for your
needs at all time.
Auto scaling helps you to save cost by cutting down the number of EC2 instances when
not needed and scaling out to add more instances only it is required.
Auto Scaling Components:
1-Auto Scaling Groups:
An Auto Scaling group contains a collection of EC2 instances that are treated as a
logical grouping for the purposes of automatic scaling and management
The Auto Scaling group continues to maintain a fixed number of instances even if an
instance becomes unhealthy. If an instance becomes unhealthy, the group terminates
the unhealthy instance and launches another instance to replace it.
When creating an Auto Scaling group, you can choose whether to launch On-Demand
Instances, Spot Instances, or both. You can specify multiple purchase options for your
Auto Scaling group only when you use a launch template
2-Scaling Policies:
Scale manually
Manual scaling is the most basic way to scale your resources, where you specify only the change
in the maximum, minimum, or desired capacity of your Auto Scaling group. Amazon EC2 Auto
Scaling manages the process of creating or terminating instances to maintain the updated
capacity.
Predictive scaling:
Predictive Scaling is a feature within Amazon EC2 Auto Scaling that uses machine learning
to predict your application's future traffic and adjust the capacity of your Auto Scaling groups
accordingly. This helps ensure that you have the right amount of capacity to handle expected
changes in demand.
Cyclical traffic, such as high use of resources during regular business hours and low
use of resources during evenings and weekends
Recurring on-and-off workload patterns, such as batch processing, testing, or periodic
data analysis
Load Balancing
Required Yes No
Management May require more management due to Simpler management with fewer
Complexity multiple instances instances
Block Storage:
Block storage is suitable for transitional databases, random read/write loads and structured
database storage.
Block storage divides the data to be stored in evenly sized blocks called data chunks for
instance, a file can be split into evenly sized blocks before it is stored.
Data blocks stored in block storage would not contain metadata. (Data created, data
modified, content type etc.)
Block storage only keeps the address (index number) where the data blocks are stored, it
does not care what is in that block, just how to retrieve it when required.
Object Storage:
Object storage stores the files as a whole and does not divide them.
In object storage an object is: the file/ data itself, its Meta data, object global unique ID.
The object global unique ID is a unique identifier for the object (can be the object name
itself) and it must be unique such that it can be retrieved disregarding where it’s physical
storage location is.
Object storage cannot be mounted as a drive.
Example of object storage solutions are Dropbox, AWS S3, Facebook.
Availability &
Durability High availability and durability High availability and durability
Simple Storage Service(S3)
Amazon S3, or Simple Storage Service, is a widely used object storage service provided by
Amazon Web Services (AWS). It allows you to store and retrieve data over the internet.
S3 is a storage for the internet. It has a simple web service interface for simple storing and
retrieving of any amount of data, anytime from anywhere on the internet.
S3 is object based storage.
S3 has a distributed data store architecture where objects are redundantly stored in multiple
locations. (minimum 3 locations in same region)
A bucket is a flat container of objects.
Maximum capacity of a bucket is 5TB.
You cannot create nested buckets.
Bucket ownership is non transferrable.
S3 bucket is region specific.
You can have up to 100 buckets per account. (may expand on request)
S3 bucket names (keys) are globally unique across all AWS regions.
Bucket names cannot be change after they are created.
If bucket is deleted its name become available again to you or other account to use.
Bucket names must be at least 3 and no more than 63 characters long.
Bucket names are part of URL used to access a bucket.
Bucket name must be a series of one or more labels (xyz bucket)
Bucket names can contain lowercase, numbers and hyphen but cannot use uppercase
letters.
Bucket name should not be an IP address.
Each label must start and end with a lowercase letter or a number.
By default buckets and its objects are private, and by default only owner can access the
bucket.
S3 Objects:
1. Amazon S3 Standard:
S3 standard offers high durability, availability and performance object storage for
frequently accessed data.
Durability is 99.999999999%.
Designed for 99.99% availability over a given year.
Supports SSL for data in transit and encryption of data at rest.
The storage cost for the object is fairly high but there is very less charge for accessing the
objects.
Largest object that can be uploaded in a single PUT in 5GB.
2. Amazon S3 IA (standard):
S3-IA is for data that is accessed less frequently but requires rapid access when needed.
The storage cost is much cheaper than S3-standard almost half the price, but you are
charged more heavily for accessing your objects.
Durability is 99.999999999%.
Resilient against events that impact an entire AZ.
Availability is 99.9% in a year.
Supports SSL for data in transit and encryption of data at rest.
Data that is deleted from S3-IA within 30 days will be charged for a full 30 days.
Backed with the Amazon S3 service level agreement for availability.
3. Amazon One-Zone IA
S3 one zone IA is for data that is accessed less frequently but requires rapid access when
needed.
Data store is single AZ.
Ideal for those who want lower cost option of IA data.
It is good choice for storing secondary backup copies of on-premise data of easily re-
creatable data.
You can use S3 lifecycle policies.
Durability is 99.999999999%.
Availability is 99.5%.
Because S3 one zone IA stores data in a single AZ, data stored in this storage class will be
lost in the event of AZ destruction.
4. Amazon S3 Glacier:
S3 glacier is a secure, durable, low cost storage class for data archiving.
To keep cost low yet suitable for varying needs S3 glacier provides three retrieval options
that ranges from a few minutes to hours.
You can upload object directly to glacier or use lifecycle policies.
Durability is 99.999999999%.
Data is resilient in the event of one entire AZ destruction.
Supports SSL for data in transit and encryption data at rest.
You can retrieve 10GB of your amazon S3 glacier data per month for free with free tier
account.
S3 bucket versioning is a feature provided by Amazon S3 that allows you to keep multiple
versions of an object in the same S3 bucket. When versioning is enabled for a bucket, any
new version of an object that is uploaded will not overwrite the existing version. Instead, it
will create a new version of the object.
Bucket versioning is a S3 bucket sub resource used to protect against accidental object/data
deletion or overwrites.
Versioning can also be used for data retention and archive.
Once you enable versioning on a bucket it cannot be disabled however it can be suspended.
When enable, bucket versioning will protect existing and new objects and maintains their
versions as they are updated.
Updating objects refers to PUT, POST, COPY, DELETE actions on objects.
When versioning is enable and you try to delete an object a delete marker is placed on the
object.
You can still view the object and delete the marker.
If you reconsider deleting the objects you can delete the delete marker and the object will
be enable again.
You will be charged for all S3 storage cost for all object versions stored.
You can use versioning with S3 lifecycle policies to delete older version or you can move
them to a cheaper S3 storage (Glacier.)
An S3 bucket lifecycle rule in AWS defines a set of actions that should be automatically
applied to objects stored in the bucket over time. These actions typically include transitioning
objects to different storage classes or deleting them altogether based on specified criteria.
This helps you manage your storage costs and optimize your data storage strategy.
A typical S3 bucket lifecycle rule consists of the following components:
Transition Actions:
Transition to S3 Standard-IA (Infrequent Access): You can specify a certain number of days
after which objects are automatically moved to the Standard-IA storage class. This class
offers lower storage costs compared to S3 Standard, but with a retrieval fee for accessing the
data.
Transition to S3 One Zone-IA (Infrequent Access): Similar to Standard-IA, but data is stored
in a single Availability Zone, providing a lower-cost option with a slight trade-off in
durability compared to Standard-IA.
Transition to S3 Glacier and Glacier Deep Archive: You can specify a certain number of days
after which objects are automatically moved to these archival storage classes. Glacier
provides even lower storage costs but with a longer retrieval time compared to S3 Standard-
IA.
Expiration Actions:
Expiration: You can set a number of days after which objects are automatically deleted from
the bucket. This is useful for managing data retention policies.
Noncurrent Version Expiration: If versioning is enabled on the bucket, you can set a
specific number of days after which non-current versions of objects are automatically deleted.
Object Size: You can specify minimum and maximum size of object . For example, you
could set up a rule to transition objects larger than a certain size to a different storage class
after a specified number of days.
Object Tags: You can optionally specify object tags as a condition for applying the lifecycle
rule. This allows you to target specific objects based on their tags.
Status: You can enable or disable a lifecycle rule to control its current applicability.
Keep in mind that once a lifecycle rule is set up, AWS S3 will automatically manage the
transitions and deletions based on the specified criteria, which can help you optimize your
storage costs and data management processes.
Replication in S3
Use Cases:
Data Redundancy: Replication provides redundancy by storing copies of your data in
multiple locations. This helps protect against data loss due to hardware failures or other
issues.
Disaster Recovery: Cross-Region Replication can be a critical part of a disaster recovery
strategy, ensuring that your data is stored in geographically separate locations.
Compliance: Same-Region Replication can be used to meet compliance requirements that
mandate data be kept within a specific region.
Global Access: Replication can improve access times for users located in different regions,
as they can retrieve objects from a bucket located closer to them.
Remember, while replication is a powerful tool, it is not a substitute for regular backups. It's
recommended to implement a comprehensive backup strategy alongside replication for
critical data.
Access Point:
Amazon S3 access points simplify data access for any AWS service or customer application
that stores data in S3. Access points are named network endpoints that are attached to buckets
that you can use to perform S3 object operations, such as GetObject and PutObject.
An access point in an S3 bucket is a way to manage access to your S3 storage resources. It
allows you to define specific access policies for individual applications or users.
Here are some key Advantages about access points in AWS S3:
Granular Access Control: Access Points allow you to define specific access policies for
individual applications, users, or resources. This enables fine-grained control over who can
access your S3 resources.
Network Isolation: You can associate an Access Point with a specific Virtual Private Cloud
(VPC). This ensures that data transfers stay within the VPC, providing an additional layer of
security.
Enhanced Security: By controlling access through Access Points, you can limit the exposure
of your S3 bucket to only the resources that need it, reducing the risk of unauthorized access.
Private Endpoint: Each Access Point provides a unique DNS name that can be used to
access your S3 bucket. This allows you to keep the access restricted and controlled.
Easier Multi-Tenancy: In scenarios where multiple teams or applications need access to the
same S3 bucket, Access Points make it easier to manage permissions separately for each
entity.
Cost Control with Requester Pays: You can configure an Access Point to enforce "requester
pays", which means the requester (the one making the API request) is responsible for data
transfer and request costs. This can be useful in scenarios where you want to share data but
not the cost.
Simplified Management: Access Points provide a clear and structured way to manage access
to your S3 resources. This can make it easier to understand and control who has access to
your data.
Resource-Level Permissions: Access Points can be associated with specific IAM roles,
allowing you to grant access to specific resources within your S3 bucket.
Scalability: As your applications and teams grow, Access Points can help you scale access
control without having to manage complex bucket policies.
Policy Flexibility: You can use both bucket policies and access point policies to control
access to your S3 resources, providing flexibility in how you define access rules.
Overall, Access Points provide a powerful tool for managing access to your S3 buckets with
increased granularity and security options. They are particularly valuable in scenarios where
you need fine-grained control over access permissions or when dealing with complex, multi-
tenant environments.
Multi-Region Access Points
Multi-Region Access Points in Amazon S3 provides a powerful tool for managing and
optimizing access to your S3 data across multiple regions, improving both performance and
availability for your applications.
When you create a Multi-Region Access Point, you specify a set of AWS Regions where you
want to store data to be served through that Multi-Region Access Point. You can use S3
Cross-Region Replication (CRR) to synchronize data among buckets in those Regions. You
can then request or write data through the Multi-Region Access Point global endpoint.
Advantages
Faster Access: Automatically routes requests to the nearest AWS region for improved
performance.
Simplified Setup: Provides a single access point, eliminating the need to manage separate
buckets in multiple regions.
Fine-Tuned Routing: Lets you define policies for which regions to use based on your needs.
Better Availability: Redirects requests to a backup region if the primary region is down.
Disaster Recovery: Works with Cross-Region Replication for data backup in separate
regions.
Consistency: Maintains strong read-after-write consistency, even across different regions.
Global Data Distribution: Enables easy distribution of data to serve a global audience.
Centralized Access Control: Inherits access policies from underlying buckets for fine-
grained control.
Cost Efficiency: Reduces manual data management efforts and optimizes costs.
Athena
Amazon Athena is an interactive query service provided by Amazon Web Services (AWS)
that allows you to analyze data stored in Amazon S3 using SQL. It's part of the AWS Big
Data and Analytics suite of services.
Here are some key details about AWS Athena:
Serverless and Managed Service: Athena is a serverless service, which means you don't
need to manage any underlying infrastructure. You don't have to provision or scale clusters.
You simply submit SQL queries, and Athena handles the rest.
Querying Data in Amazon S3: Athena is designed to work with data stored in Amazon S3,
which is AWS's object storage service. You can use it to analyze various types of data like
CSV, JSON, Parquet, ORC, and more.
Schema on Read: Athena uses a schema-on-read approach, meaning it doesn't require you to
define a schema before querying your data. Instead, it infers the schema from the data when
you run a query.
Supports Standard SQL: You can use standard SQL (Structured Query Language) to query
your data in Athena. This makes it accessible to a wide range of users who are already
familiar with SQL.
Integration with AWS Glue: Athena can leverage AWS Glue Data Catalog, which is a
managed metadata catalog that integrates with various AWS services. This makes it easier to
manage metadata and discover and query your data.
Partitioning and Compression: You can optimize your queries by partitioning your data in
S3 and using appropriate compression techniques. This can significantly improve query
performance.
Cost Model: With Athena, you pay only for the queries you run. You are billed based on the
amount of data scanned by your queries. This can be cost-effective for sporadic, ad-hoc
querying, but it's important to manage your data and queries efficiently to control costs.
Result Output and Export: Athena allows you to save query results in a variety of formats
including CSV, JSON, or Parquet. You can also integrate Athena with other AWS services
like Amazon QuickSight for data visualization and analysis.
Workgroup Management: Athena allows you to manage multiple workgroups, each with its
own set of users, queries, and settings. This can help you separate and manage workloads
effectively.
Geospatial Functions: Athena has support for geospatial functions, which allows you to
perform spatial queries on geospatial data.
Integration with Other AWS Services: Athena can be integrated with various AWS services
like AWS Glue, Amazon QuickSight, AWS Lambda, and more, allowing you to build
comprehensive data pipelines and analytical solutions
Types of Tables in Cluster Ecosystem:
In a cluster computing ecosystem, such as Apache Hadoop, there are mainly two types of tables:
Internal Tables:
External Tables:
Use Cases:
Internal Tables are typically used for intermediate or temporary data, or when you want Hive
to have full control over the data's lifecycle and storage.
External Tables are useful when you want to maintain data independently from Hive, share
data across different systems or clusters, or work with data that's already stored in an external
location.
Both types have their own advantages and use cases, and the choice between them depends
on your specific requirements and workflows.
Metadata Hive Metastore stores schema and Hive Metastore stores schema, location is
Storage location information. external.
Data Location Tightly integrated with Hadoop storage. Location is external to the cluster.
Typically used for isolated, Hive-centric Used for data accessible by multiple
Data Sharing data. systems/clusters.
Hive manages data and can perform Data must be maintained and updated
Data Updates updates. externally.
DPU(Data Processing Unit)
In Amazon Athena, DPU stands for Data Processing Unit. It is a unit of measure for
the amount of resources consumed by a query when it runs in Athena. DPUs are used
to quantify the processing capacity needed to execute a query on your data.
Each DPU provides a certain amount of CPU, memory, and networking capacity. The
amount of processing power required for a query depends on factors like the
complexity of the query, the volume of data being scanned, and the types of
operations being performed.
When you run a query in Athena, the number of DPUs consumed is determined by the
amount of data scanned by the query. You are billed based on the total amount of data
scanned in your query, rounded up to the nearest megabyte, and the number of DPUs
used.
It's worth noting that DPUs in Athena are specific to the execution of a query and do
not directly correlate to any specific hardware or instance type. They are a logical unit
used for billing purposes. Different types of queries will use different amounts of
DPUs depending on their resource requirements.
Keep in mind that pricing details and DPU values may change over time, so it's a
good idea to check the official Amazon Athena pricing documentation for the most
up-to-date information.
S3 Query Editor Vs Athena Query Editor
Feature Amazon S3 Query Editor Amazon Athena
SQL-like (limited to basic querying Standard SQL with advanced capabilities (joins,
Query Language capabilities) aggregations, etc.)
Limited support (may require Supported for improved performance and cost
Data Partitioning manual handling) efficiency
Limited support (may require Supported for improved performance and cost
Data Compression manual handling) efficiency
Data Catalog
Integration Does not have its own data catalog Integrates with AWS Glue Data Catalog
Performance Less optimized for complex queries Optimized for performance, can handle large
Optimization and large datasets datasets and complex queries
Limited scalability for large datasets Designed to handle large volumes of data and
Scalability and complex queries complex queries
Access controlled through AWS Access controlled through IAM policies, and
Identity and Access Management integrates with AWS Lake Formation for fine-
Data Security (IAM) policies grained access control
Integration with Directly interacts with data stored in Integrates with various AWS services for data
Other AWS Services S3 buckets ingestion, transformation, and visualization
ETL (Extract, Transform, Load)
ETL stands for Extract, Transform, Load. It is a process used in data warehousing and
data integration to move data from various sources, transform it into a usable format,
and then load it into a target database or data warehouse for analysis and reporting.
ETL processes are commonly used in business intelligence, data warehousing, and
data integration projects to ensure that data is available and in the right format for
analysis and reporting.
Components of ETL
A.Extract
Data Extraction : Retrieving data from source systems, which can include databases,
flat files, APIs, or other repositories ==>Salesforce,Data Lakes,Data Warehouse.
Change Data Capture (CDC) :Identifying and capturing only the changed data since
the last extraction to optimize efficiency
B.Transform
Data Cleaning : Removing or correcting errors, inconsistencies,or inaccuracies in the
source data
Data Transformation : Restructuring and converting data into a format suitable for the
target system
Data Enrichment : Enhancing data by adding additional information or attributes
C.Load :
Data Staging : Temporary storage of transformed data before loading it into the target
system
Data Loading : Inserting or updating data into the destination database or data
warehouse (UPSERT operation)
Error Handling : Managing and logging errors that may occur during the loading
process
B.Transformation Phase
-Data Mapping : Creating a mapping between source and target data structures
-Data Cleansing : Identifying and correcting data quality issues
-Data Validation : Ensuring transformed data meets specified ,quality standards
C.Loading Phase
Data Staging : Moving transformed data to a staging area for further processing
Bulk Loading : Efficiently inserting large volumes of data into the target system
Indexing : Creating indexes to optimize data retrieval in the target database
Post-Load Verification : Confirming that the data has been loaded successfully
Cloud Based ETL Tools:
AWS Glue: AWS Glue is a fully managed ETL service provided by Amazon Web Services. It
automatically discovers, catalogs, and transforms your data, making it easier to prepare and
load it for analytics.
AWS EMR:
Amazon EMR (Elastic MapReduce) is a cloud-based big data processing service provided by
Amazon Web Services (AWS). It allows for the processing of large amounts of data using
popular open-source frameworks like Apache Spark, Hadoop, Hive, and Presto, among
others.
Microsoft Azure Data Factory: Azure Data Factory is a cloud-based ETL and data
integration service provided by Microsoft Azure. It allows you to create, schedule, and
manage data pipelines.
Google Cloud Dataflow: Dataflow is a fully managed stream and batch data processing
service provided by Google Cloud. It allows for building data pipelines using both batch and
streaming data.
Talend Cloud: Talend also offers a cloud-based version of their ETL platform, which
provides data integration and transformation capabilities in a cloud environment.
Matillion: Matillion is a cloud-native ETL platform that is purpose-built for popular cloud
data platforms like AWS, Google Cloud, and Snowflake. It offers native connectors to
various cloud data sources.
7) Maintenance/Ease ETL processes that involve an on- The ELT process typically requires low
of Use premise server require frequent maintenance given that all data is always
maintenance by IT given their fixed available and the transformation process
tables, fixed timelines and the is usually automated and cloud-based.
requirement to repeatedly select data
to load and transform. Newer
automated, cloud-based ETL solutions
require little maintenance.
8) Cost ETL can be cost-prohibitive for many ELT benefits from a robust ecosystem of
small and medium businesses. cloud-based platforms which offer much
lower costs and a variety of plan options
to store and process data.
9) Hardware The traditional, on-premises ETL Given that the ELT process is inherently
process requires expensive hardware. cloud-based, no additional hardware is
Newer, cloud-based ETL solutions do required.
not require hardware
10) Compliance ETL is better suited for compliance ELT carries more risk of exposing private
with GDPR, HIPAA, and CCPA data and not complying with GDPR,
standards given that users can omit HIPAA, and CCPA standards given that
any sensitive data prior to loading in all data is loaded into the target system.
the target system.
AWS Glue
AWS Glue is a serverless data integration service that makes it easy for analytics users to
discover, prepare, move, and integrate data from multiple sources.
With AWS Glue, you can discover and connect to more than 70 diverse data sources and
manage your data in a centralized data catalog. You can visually create, run, and monitor
extract, transform, and load (ETL) pipelines to load data into your data lakes. Also, you can
immediately search and query cataloged data using Amazon Athena, Amazon EMR, and
Amazon Redshift Spectrum.
AWS Glue consolidates major data integration capabilities into a single service. These
include data discovery, modern ETL, cleansing, transforming, and centralized cataloging. It's
also serverless, which means there's no infrastructure to manage. With flexible support for all
workloads like ETL, ELT, and streaming in one service, AWS Glue supports users across
various workloads and types of users.
Also, AWS Glue makes it easy to integrate data across your architecture. It integrates with
AWS analytics services and Amazon S3 data lakes
Unify and search across multiple data stores – Store, index, and search across
multiple data sources and sinks by cataloging all your data in AWS.
Automatically discover data – Use AWS Glue crawlers to automatically infer
schema information and integrate it into your AWS Glue Data Catalog.
Manage schemas and permissions – Validate and control access to your databases
and tables.
Connect to a wide variety of data sources – Tap into multiple data sources, both on
premises and on AWS, using AWS Glue connections to build your data lake.
Transform, prepare, and clean data for analysis
You define jobs in AWS Glue to accomplish the work that's required to extract, transform,
and load (ETL) data from a data source to a data target. You typically perform the following
actions:
For data store sources, you define a crawler to populate your AWS Glue Data Catalog
with metadata table definitions. You point your crawler at a data store, and the
crawler creates table definitions in the Data Catalog. For streaming sources, you
manually define Data Catalog tables and specify data stream properties.
In addition to table definitions, the AWS Glue Data Catalog contains other metadata
that is required to define ETL jobs. You use this metadata when you define a job to
transform your data.
AWS Glue can generate a script to transform your data. Or, you can provide the script
in the AWS Glue console or API.
You can run your job on demand, or you can set it up to start when a
specified trigger occurs. The trigger can be a time-based schedule or an event.
When your job runs, a script extracts data from your data source, transforms the data, and
loads it to your data target. The script runs in an Apache Spark environment in AWS Glue.
Crawlers and Data Catalog
Crawlers:
Crawlers are used to automatically discover and catalog metadata about your data
sources, which can be stored in various formats and locations, such as Amazon S3,
Amazon RDS, Amazon Redshift, and more.
Crawlers automatically scan and discover the schema and metadata of your data
sources. This includes information like column names, data types, and relationships
between tables.
Catalog Creation: Once the crawler has scanned the data, it creates or updates a
metadata catalog in the AWS Glue Data Catalog. The Data Catalog is a central
repository where information about your data sources is stored. This catalog can be
used by other AWS services and applications for tasks like data querying and
transformation.
Schema Evolution: Crawlers are capable of detecting changes in the underlying data,
such as new columns or modified data types. They can update the catalog to reflect
these changes.
Support for Multiple Data Formats: AWS Glue Crawlers support a wide range of
data formats including CSV, JSON, Parquet, Avro, and more.
Integration with Other AWS Services: The metadata catalog created by AWS Glue
Crawlers can be used by other AWS services like Amazon Athena (for querying data),
Amazon Redshift Spectrum, and Amazon EMR.
Data Catalog
The AWS Glue Data Catalog is a centralized metadata repository provided by
Amazon Web Services (AWS) as part of the AWS Glue service.
It stores and organizes metadata about your data sources, transformations, and targets.
The AWS Glue Data Catalog is an index to the location, schema, and runtime metrics
of your data.
You use the information in the Data Catalog to create and monitor your ETL jobs.
Information in the Data Catalog is stored as metadata tables, where each table
specifies a single data store.
Typically, you run a crawler to take inventory of the data in your data stores, but there
are other ways to add metadata tables into your Data Catalog
The following workflow diagram shows how AWS Glue crawlers interact with data
stores and other elements to populate the Data Catalog.
The following is the general workflow for how a crawler populates the AWS Glue Data
Catalog:
A crawler runs any classifiers that you choose to infer the format and schema of your
data. You provide the code for custom classifiers, and they run in the order that you
specify.
The first custom classifier to successfully recognize the structure of your data is used
to create a schema. Custom classifiers lower in the list are skipped.
If no custom classifier matches your data's schema, built-in classifiers try to recognize
your data's schema. An example of a built-in classifier is one that recognizes JSON.
The crawler connects to the data store. Some data stores require connection properties
for crawler access.
The inferred schema is created for your data.
The crawler writes metadata to the Data Catalog. A table definition contains metadata
about the data in your data store. The table is written to a database, which is a
container of tables in the Data Catalog. Attributes of a table include classification,
which is a label created by the classifier that inferred the table schema.
Classifiers
In AWS Glue, classifiers are components that help identify the schema and structure
of your data when it's not immediately apparent. They're particularly useful for
handling data in formats that may not have explicit schema information or for custom
data formats.
A classifier reads the data in a data store. If it recognizes the format of the data, it
generates a schema. The classifier also returns a certainty number to indicate how
certain the format recognition was.
AWS Glue provides a set of built-in classifiers, but you can also create custom
classifiers. AWS Glue invokes custom classifiers first, in the order that you specify in
your crawler definition. Depending on the results that are returned from custom
classifiers, AWS Glue might also invoke built-in classifiers. If a classifier
returns certainty=1.0 during processing, it indicates that it's 100 percent certain that it
can create the correct schema. AWS Glue then uses the output of that classifier.
If no classifier returns certainty=1.0, AWS Glue uses the output of the classifier that
has the highest certainty. If no classifier returns a certainty greater than 0.0, AWS
Glue returns the default classification string of UNKNOWN.
Types of Classifiers:
AWS Glue provides several types of classifiers to handle different data formats:
Built-in Classifiers:
If AWS Glue doesn't find a custom classifier that fits the input data format with 100 percent
certainty, it invokes the built-in classifiers in the order shown in the following table. The
built-in classifiers return a result to indicate whether the format matches (certainty=1.0) or
does not match (certainty=0.0). The first classifier that has certainty=1.0 provides the
classification string and schema for a metadata table in your Data Catalog.
These are pre-defined classifiers provided by AWS Glue for common data formats like CSV,
JSON, Avro, and others. AWS Glue uses these classifiers out-of-the-box to recognize these
formats.
XML Classifier: This classifier is used for processing XML files. It extracts
information about the structure of XML files, including elements, attributes, and
namespaces.
JSON Classifier: This classifier is used for JSON files. It helps identify the structure
of JSON data, including nested objects and arrays.
Grok Classifier: This classifier is used for log files that follow a specific pattern
known as "Grok patterns." Grok is a pattern-matching syntax used for parsing log
data.
Avro Classifier: This classifier is used for Apache Avro data serialization format. It
helps AWS Glue understand the schema of Avro data.
OpenCSVSerDe Classifier: This classifier is used for comma-separated values (CSV)
data. It helps identify the columns and data types in CSV files.
Custom Classifiers:
Custom Classifiers in AWS Glue are user-defined rules or patterns that you can create to help
AWS Glue understand the schema and structure of your data. They are particularly useful
when working with data formats that may not be covered by the built-in classifiers provided
by AWS Glue or when you have specific patterns in your data that require custom handling.
Here's how Custom Classifiers work in AWS Glue:
User-Defined Patterns: With Custom Classifiers, you can define your own patterns using
regular expressions or custom code. These patterns are used to identify the structure of your
data.
Handling Non-Standard Formats: If your data is in a format that doesn't conform to
commonly recognized standards (like CSV, JSON, Avro, etc.), a Custom Classifier can be
invaluable in helping AWS Glue understand the data.
Specialized Data Formats: Custom Classifiers are especially useful for dealing with
proprietary or specialized data formats that may not have pre-defined classifiers available.
Usage in Crawlers: When setting up a Crawler in AWS Glue to scan and catalog your data,
you can specify which classifiers to use. If none of the built-in classifiers are suitable for your
data, you can select a Custom Classifier that you've defined.
Schema Inference: Once a Custom Classifier identifies the structure of your data, AWS Glue
can use this information to infer the schema, including details like column names and data
types.
Data Cataloging: The information inferred from the data by the Custom Classifier is then
used to create or update entries in the AWS Glue Data Catalog, which is a central repository
for metadata about your data sources.
In summary, Custom Classifiers in AWS Glue give you the flexibility to define your own
rules for recognizing the structure of your data. This is particularly valuable when working
with non-standard or specialized data formats, allowing you to effectively catalog and process
your data in AWS Glue workflows.
Options to create a job from scratch
Visual ETL – author in a visual interface focused on data flow
Code with a script editor – For those familiar with programming and writing ETL
scripts, choose this option to create a new Spark ETL job. Choose the engine (Python
shell, Ray, Spark (Python), or Spark (Scala). Then, choose Start fresh or Upload
script. uploading an existing script from a local file. If you choose to use the script
editor, you can't use the visual job editor to design or edit your job.
A Spark job is run in an Apache Spark environment managed by AWS Glue. By
default, new scripts are coded in Python.
Trigger
When fired, a trigger can start specified jobs and crawlers. A trigger fires on demand,
based on a schedule, or based on a combination of events.
Triggers allow you to schedule and automate the execution of your ETL (Extract,
Transform, Load) jobs or development endpoints based on specified criteria.
Triggers can be set up to run jobs at specific times, in response to events, or based on
a predefined schedule. For example, you can create a trigger to run a Glue job every
day at a certain time, or you can set up a trigger to launch a job when a new file is
added to an Amazon S3 bucket.
Only two crawlers can be activated by a single trigger. If you want to crawl multiple
data stores, use multiple sources for each crawler instead of running multiple crawlers
simultaneously.
A trigger can exist in one of several states. A trigger is
either CREATED, ACTIVATED, or DEACTIVATED. There are also transitional
states, such as ACTIVATING. To temporarily stop a trigger from firing, you can
deactivate it. You can then reactivate it later.
You can create a trigger for a set of jobs or crawlers based on a schedule. You can
specify constraints, such as the frequency that the jobs or crawlers run, which days
of the week they run, and at what time. These constraints are based on cron. When
you're setting up a schedule for a trigger, consider the features and limitations of
cron. For example, if you choose to run your crawler on day 31 each month, keep
in mind that some months don't have 31 days
Conditional
A trigger that fires when a previous job or crawler or multiple jobs or crawlers
satisfy a list of conditions.
When you create a conditional trigger, you specify a list of jobs and a list of
crawlers to watch. For each watched job or crawler, you specify a status to watch
for, such as succeeded, failed, timed out, and so on. The trigger fires if the watched
jobs or crawlers end with the specified statuses. You can configure the trigger to
fire when any or all of the watched events occur.
On-demand
A trigger that fires when you activate it. On-demand triggers never enter
the ACTIVATED or DEACTIVATED state. They always remain in
the CREATED state.
So that they are ready to fire as soon as they exist, you can set a flag to activate scheduled
and conditional triggers when you create them.
Important
Jobs or crawlers that run as a result of other jobs or crawlers completing are referred to
as dependent. Dependent jobs or crawlers are only started if the job or crawler that completes
was started by a trigger. All jobs or crawlers in a dependency chain must be descendants of a
single scheduled or on-demand trigger.
Workflow
AWS Glue workflow is a collection of jobs, triggers, and crawlers that are
orchestrated to perform a data ETL process. Triggers initiate the workflow based on a
schedule or event, jobs perform the actual data processing, and crawlers discover and
catalog the data sources. The Data Catalog provides metadata management, and
connections enable Glue jobs to interact with external data stores.
A workflow in AWS Glue is a directed acyclic graph (DAG) of Glue entities (such as
jobs, triggers, and crawlers) that you can execute on a schedule.
A workflow consists of one or more jobs that are orchestrated to execute in a specific order.
Triggers:Triggers in AWS Glue are events that initiate the execution of a workflow. They can
be scheduled to run at specific times or event based(like the completion of a previous job).
Triggers can be set to run once or on a recurring schedule.
Crawlers:
Crawlers are used to automatically discover the structure of your source data and create
metadata tables in the AWS Glue Data Catalog. This is especially useful when working with
semi-structured or unstructured data.
Crawlers can be used to scan data stored in various data stores, like Amazon S3, Amazon
RDS, etc.
Connections:
A connection in AWS Glue defines the connection information to an external data store. This
includes details like the endpoint, port, username, password, etc.
Connections are used by Glue jobs to connect to data sources and targets.
Data Catalog:
The AWS Glue Data Catalog is a centralized metadata repository that stores information
about the data sources, targets, transformations, and schema definitions used by Glue jobs.
It provides a unified view of your data, making it easier to manage and query.
Workflows:
A workflow in AWS Glue is a logical grouping of jobs, triggers, and crawlers that define the
ETL process. It's represented as a directed acyclic graph (DAG) where nodes represent Glue
entities and edges represent dependencies between them.
Dependencies:
In a Glue workflow, jobs and triggers can have dependencies on other jobs, triggers, or
crawlers. This means that a job or trigger will only execute once its dependencies have
completed successfully.
Schedulers:
AWS Glue provides a scheduling mechanism through triggers. Triggers can be set up to run
jobs and workflows at specific times or based on events.
You can use cron expressions or specific event conditions to trigger jobs.
Error Handling and Monitoring:
AWS Glue provides logging and monitoring capabilities to track the progress and status of
your jobs and workflows. You can view logs in Amazon CloudWatch and set up notifications
for job completion or failure events.
Glue Optimisation Technique
Data Partitioning:
Partitioning involves dividing large datasets into smaller, more manageable pieces
based on certain criteria (e.g., date, region).
This can significantly reduce the amount of data that needs to be processed during
each job run, resulting in faster execution times and lower costs.
AWS Glue supports data partitioning, and it's important to set up partitions
appropriately for your use case.
Dynamic Frame Optimizations:
AWS Glue uses DynamicFrames, which are similar to DataFrames in Apache Spark.
When working with large datasets, you can use the repartition method to control the
number of partitions in a DynamicFrame.
It's important to choose an appropriate number of partitions to balance parallelism and
memory consumption.
Choosing the Right Worker Type and Number:
AWS Glue allows you to choose between different worker types (Standard, G.1X,
G.2X) with varying CPU and memory configurations.
Depending on the nature of your ETL workload, you should select the appropriate
worker type and number to ensure optimal performance.
Tuning DataFrames and DynamicFrames:
When working with DataFrames or DynamicFrames, consider using operations like
select, filter, and join selectively to limit the amount of data that needs to be
processed.
Avoid unnecessary transformations and apply filters early in the pipeline to reduce the
volume of data being processed.
Using Custom Classifiers:
AWS Glue allows you to define custom classifiers for your data sources. These
classifiers can help improve the accuracy and efficiency of schema detection, which is
important for understanding the structure of your data.
Optimizing Data Storage:
Consider using columnar storage formats like Parquet or ORC, which can
significantly reduce storage costs and improve query performance.
Compressing data can also save on storage costs and improve read/write performance.
Cost Management:
AWS Glue can be cost-effective, but it's important to monitor and manage costs.
Consider factors like worker type, number of workers, and data storage options to
optimize costs.
Indexing and Optimization for Data Stores:
If you're loading data into a data store like Amazon Redshift or Amazon RDS, make
sure to apply best practices for indexing and optimizing queries to improve
performance.
Optimize Data Access and Storage:
Partitioning: Divide large datasets into smaller partitions based on relevant columns to
reduce unnecessary data processing.
Columnar File Formats: Use columnar file formats like Parquet or ORC, which enable
efficient data filtering and querying.
Compression: Compress data to reduce storage costs and network bandwidth usage.
S3 Optimized Committers: Utilize EMRFS S3-optimized committer for faster S3
writes and reduced metadata overhead.
Job Bookmarks: Leverage job bookmarks to process only new or updated data,
avoiding repeated processing of unchanged data.
Push-down Predicates: Push filter conditions down to the data source to reduce the
amount of data transferred to Spark.
Optimize Memory Usage:
Tune Spark Memory Allocation: Adjust Spark memory allocation based on job
requirements to avoid Out-of-Memory (OOM) errors.
Utilize Off-Heap Memory: Leverage off-heap memory for large datasets and UDFs to
reduce memory pressure on the driver node.
Optimize PySpark UDFs: Avoid buffering large records in off-heap memory with
PySpark UDFs by moving select and filter operations upstream.
Cache Frequently Accessed Data: Cache frequently accessed data in memory to
improve performance.
Optimize Execution Plan:
Shuffle Optimization: Minimize shuffle operations by partitioning data effectively and
using broadcast joins for small join datasets.
Repartitioning: Optimize data distribution across partitions to balance workload and
improve parallelism.
Coalesce: Combine smaller partitions into larger ones to reduce overhead and
improve performance for tasks that operate on larger blocks of data.
Utilize REPARTITION or COALESCE for Spark SQL: Use these query hints to
control data partitioning and improve query execution.
Optimize Workload Management:
Workload Partitioning: Divide complex ETL pipelines into smaller, independent jobs
to improve parallelism and reduce job execution time.
Job Scheduling: Schedule jobs to avoid overloading the cluster and optimize resource
utilization.
Monitoring and Alerting: Set up monitoring and alerting mechanisms to identify
performance bottlenecks and resource usage issues.
Autoscaling: Utilize autoscaling to automatically adjust cluster capacity based on
workload demands.
Leverage AWS Glue Features:
Custom Workflows: Design custom workflows to orchestrate complex data processing
pipelines efficiently.
AWS Glue Spark Libraries: Utilize AWS Glue Spark libraries for common data
processing tasks, such as data quality checks and data deduplication.
AWS Glue Studio: Use AWS Glue Studio for visual job development, debugging, and
optimization.
Glue Data Breu
AWS Glue DataBrew is a visual data preparation tool that enables users to clean and
normalize data without writing any code. Using DataBrew helps reduce the time it takes to
prepare data for analytics and machine learning (ML) by up to 80 percent, compared to
custom developed data preparation.
You can choose from over 250 built-in transformations to combine, pivot, and transpose the
data without writing code. AWS Glue DataBrew also automatically recommends
transformations such as filtering anomalies, correcting invalid, incorrectly classified, or
duplicate data, normalizing data to standard date and time values, or generating aggregates
for analyses. For complex transformations, such as converting words to a common base or
root word, DataBrew provides transformations that use advanced machine learning
techniques such as Natural Language Processing (NLP).
You can group multiple transformations together, save them as recipes, and apply the recipes
directly to newly incoming data.For input data, AWS Glue DataBrew supports commonly
used file formats, such as comma-separated values (.csv), JSON and nested JSON, Apache
Parquet and nested Apache Parquet, and Excel sheets. For output data,
AWS Glue DataBrew supports comma-separated values (.csv), JSON, Apache Parquet,
Apache Avro, Apache ORC and XML
What is AWS Glue?
Describe the AWS Glue Architecture
What are the primary benefits of using AWS Data Brew?
Describe the four ways to create AWS Glue jobs
How does AWS Glue support the creation of no-code ETL jobs?
What is a connection in AWS Glue?
What are the Features of AWS Glue?
When to use a Glue Classifier?
What are the main components of AWS Glue?
What Data Sources are supported by AWS Glue?
What is AWS Glue Data Catalog?
The AWS Glue Schema Registry assists us by allowing to validate and regulate the lifecycle
of streaming data using registered Apache Avro schemas at no cost. Apache Kafka, Amazon
Managed Streaming for Apache Kafka (MSK), Amazon Kinesis Data Streams, Apache Flink,
Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda benefit from Schema
Registry.
Validate schemas: Schemas used for data production are checked against schemas in a
central registry when data streaming apps are linked with AWS Glue Schema
Registry, allowing you to regulate data quality centrally.
Safeguard schema evolution: One of eight compatibility modes can be used to specify
criteria for how schemas can and cannot grow.
Improve data quality: Serializers compare data producers' schemas to those in the
registry, enhancing data quality at the source and avoiding downstream difficulties
caused by random schema drift.
Save costs: Serializers transform data into a binary format that can be compressed
before transferring, lowering data transfer and storage costs.
AWS Batch enables you to conduct any batch computing job on AWS with ease and
efficiency, regardless of the work type. AWS Batch maintains and produces computing
resources in your AWS account, giving you complete control over and insight into the
resources in use. AWS Glue is a fully-managed ETL solution that runs your ETL tasks in a
serverless Apache Spark environment. We recommend using AWS Glue for your ETL use
cases. AWS Batch might be a better fit for some batch-oriented use cases, such as ETL use
cases.
A tag is a label you apply to an Amazon Web Services resource. Each tag has a key and an
optional value, both of which are defined by you.
In AWS Glue, you may use tags to organize and identify your resources. Tags can be used to
generate cost accounting reports and limit resource access. You can restrict which users in
your AWS account have authority to create, update, or delete tags if you use AWS Identity
and Access Management.
How do I query metadata in Athena? What is the general workflow for how a
Crawler populates the AWS Glue Data Catalog?
AWS Glue Elastic Views makes it simple to create materialized views that integrate and
replicate data across various data stores without writing proprietary code. AWS Glue Elastic
Views can quickly generate a virtual materialized view table from multiple source data stores
using familiar Structured Query Language (SQL). AWS Glue Elastic Views moves data from
each source data store to a destination datastore and generates a duplicate of it. AWS Glue
Elastic Views continuously monitors data in your source data stores, and automatically
updates materialized views in your target data stores, ensuring that data accessed through the
materialized view is always up-to-date.
Use AWS Glue Elastic Views to aggregate and continuously replicate data across several
data stores in near-real-time. This is frequently the case when implementing new application
functionality requiring data access from one or more existing data stores. For example, a
company might use a customer relationship management (CRM) application to keep track of
customer information and an e-commerce website to handle online transactions. The data
would be stored in these apps or more data stores. The firm is now developing a new custom
application that produces and displays special offers for active website visitors.
AWS Glue DataBrew is a visual data preparation solution that allows data analysts and
scientists to prepare without writing code using an interactive, point-and-click graphical
interface. You can simply visualize, clean, and normalize terabytes, even petabytes, of data
directly from your data lake, data warehouses, and databases, including Amazon S3, Amazon
Redshift, Amazon Aurora, and Amazon RDS, with Glue DataBrew.
Who can use AWS Glue DataBrew?
AWS Glue DataBrew is designed for users that need to clean and standardize data before
using it for analytics or machine learning. The most common users are data analysts and data
scientists. Business intelligence analysts, operations analysts, market intelligence analysts,
legal analysts, financial analysts, economists, quants, and accountants are examples of
employment functions for data analysts. Materials scientists, bioanalytical scientists, and
scientific researchers are all examples of employment functions for data scientists.
You can combine, pivot, and transpose data using over 250 built-in transformations without
writing code. AWS Glue DataBrew also suggests transformations such as filtering anomalies,
rectifying erroneous, wrongly classified, duplicate data, normalizing data to standard date and
time values, or generating aggregates for analysis automatically. Glue DataBrew enables
transformations that leverage powerful machine learning techniques such as Natural
Language Processing for complex transformations like translating words to a common base
or root word (NLP). Multiple transformations can be grouped, saved as recipes, and applied
straight to incoming data.
AWS Glue DataBrew accepts comma-separated values (.csv), JSON and nested JSON,
Apache Parquet and nested Apache Parquet, and Excel sheets as input data types. Comma-
separated values (.csv), JSON, Apache Parquet, Apache Avro, Apache ORC, and XML are
all supported as output data formats in AWS Glue DataBrew.
Do we need to use AWS Glue Data Catalog or AWS Lake Formation to use AWS
Glue DataBrew?
No. Without using the AWS Glue Data Catalog or AWS Lake Formation, you can use AWS
Glue DataBrew. DataBrew users can pick data sets from their centralized data catalog using
the AWS Glue Data Catalog or AWS Lake
What is the best practice for managing the credentials required by a Glue
connection?
The best practice is for the credentials to be stored & accessed securely by
leveraging AWS Systems Manager (SSM), AWS Secrets Manager or Amazon Key
Management Service (KMS)
AWS Glue supports Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed
Streaming for Apache Kafka (Amazon MSK).
What is an interactive session in AWS Glue and what are its benefits?
Interactive sessions in AWS Glue are essentially on-demand serverless Spark runtime
environments that allow rapid build and test of data preparation and analytics
applications. Interactive sessions can be used via the visual interface, AWS command
line or the API.
Using interactive sessions, you can author and test your scripts as Jupyter notebooks.
Glue supports a comprehensive set of Jupyter magics allowing developers to develop
rich data preparation or transformation scripts.
Explain why and when you would use AWS Glue compared to other options to
set up data pipelines
AWS Glue makes it easy to move data between data stores and as such, can be used in a
variety of data integration scenarios, including:
Data lake build & consolidation: Glue can extract data from multiple sources and
load the data into a central data lake powered by something like Amazon S3. (Related
Reading: Building Data Lakes on AWS using S3 and Glue)
Data migration: For large migration and modernization initiatives, Glue can help
move data from a legacy data store to a modern data lake or data warehouse.
Data transformation: Glue provides a visual workflow to transform data using a
comprehensive built-in transformation library or custom transformation using
PySpark
Data cataloging: Glue can assist data governance initiatives since it supports
automatic metadata cataloging across your data sources and targets, making it easy to
discover and understand data relationships.
When compared to other options for setting up data pipelines, such as Apache NiFi or
Apache Airflow, AWS Glue is typically a good choice if:
You want a fully managed solution: With Glue, you don’t have to worry about
setting up, patching, or maintaining any infrastructure.
Your data sources are primarily in AWS: Glue integrates natively with many AWS
services, such as S3, Redshift, and RDS.
You are constrained by programming skills availability: Glue’s visual workflow
makes it easy to create data pipelines in a no-code or low-code code way.
You need flexibility and scalability: Glue can scale automatically to meet demand
and can handle petabyte-scale data.
Can you highlight the role of AWS Glue in big data environments?
AWS Glue plays a pivotal role in big data environments as it provides the ability to
handle, process and transform large data sets in distributed and parallel environments.
AWS Glue is engineered for large-scale data processing. It can scale horizontally,
providing the capability to process petabytes of data efficiently and quickly.
AWS Glue is highly beneficial in a big data environment due to its serverless
architecture and integration capabilities with other AWS services.
AWS Glue is a fully managed ETL (extract, transform, and load) service that
makes it easy for customers to prepare and load their data for analytics. AWS
EMR, on the other hand, is a service that makes it easy to process large
amounts of data quickly and efficiently.
AWS Glue and EMR are both used for data processing but they differ in how
they process and data and their typical use cases
AWS Glue can be easily used to process both structured as well as
unstructured data while AWS EMR is typically suited for processing
structured or semi-structured data.
AWS Glue can automatically discover and categorize the data. AWS EMR
does not have that capability.
AWS Glue can be used to process streaming data or data in near-real-time,
while AWS EMR is typically used for scheduled batch processing.
Usage of AWS Glue is charged per DPU hour while EMR is charged per
underlying EC2 instance hour.
AWS Glue is easier to get started than EMR as Glue does not require
developers to have prior knowledge of MapReduce or Hadoop.
What are some ways to orchestrate glue jobs as part of a larger ETL flow?
Glue Workflows and AWS Step Functions are two ways to orchestrate glue jobs as part of
large ETL flows.
Yes, AWS Glue is suitable for converting log files into structured data. Using the
AWS Glue Visual Canvas or by defining a custom glue job, we can define custom
data transformations to structure log file data.
Glue makes is possible to aggregate logs from various sources into a common data
lake that makes it easy to access and maintain these logs.
Our company’s spend on AWS Glue is increasing rapidly. How can we optimize
our AWS Glue spend?
Cost optimization is a critical aspect of running workloads in cloud and leveraging cloud
services, including AWS Glue. On going cost optimization ensures we are making most of
our cloud investments while reducing waste. When optimizing AWS Glue spend, the
following factors should be considered:
1. Use Glue Development Endpoints sparingly as these can get costly quickly.
2. Choose the right DPU allocation based on job complexity and requirements.
3. Optimize job concurrency
4. Use Glue job bookmarks to track processed data, allowing Glue to skip
previously processed records during incremental runs, thus reducing cost for
recurring jobs.
5. Some additional factors such as leveraging Glue Data Catalog, minimizing
costly transformations, etc.
EMR(Electric Map Reduce)
Amazon EMR (Elastic MapReduce) is a cloud-based big data platform provided by Amazon
Web Services (AWS). It allows for the processing of vast amounts of data quickly and cost-
effectively using popular frameworks such as Apache Hadoop, Apache Spark, Presto, and
Hive.
Scalability: EMR allows you to easily scale your cluster up or down based on your
processing needs. You can start with a small cluster and scale it to thousands of nodes if
necessary.
Managed Service: AWS EMR is a fully managed service, which means that AWS handles
the underlying infrastructure for you. This includes provisioning, monitoring, and managing
the compute resources.
Supported Frameworks:
Hadoop: EMR supports Apache Hadoop, which is an open-source framework for
processing and storing large datasets.
Spark: Apache Spark is another popular framework for big data processing that is
supported on EMR. It provides a more flexible and faster alternative to Hadoop's
MapReduce.
Presto: EMR supports Presto, an open-source distributed SQL query engine designed
for interactive analytics.
Hive and Pig: EMR also supports Hive and Pig, which are high-level languages for
querying and analyzing data in Hadoop.
Cost Management:
EMR provides features for cost optimization, such as the ability to use spot instances (which
are spare AWS capacity at a lower price) for cost savings.
Application Ecosystem:
EMR supports a wide range of applications and libraries that can be used for data processing,
analysis, and visualization.
EMR Notebooks:
EMR provides a notebook interface that allows you to create and manage Jupyter notebooks
for interactive data analysis and exploration.
EMR Studio:
EMR Studio is an integrated development environment for EMR that makes it easy to
develop, visualize, and debug big data applications.
Overview of Amazon EMR
The central component of Amazon EMR is the cluster. A cluster is a collection of Amazon
Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called
a node. Each node has a role within the cluster, referred to as the node type. Amazon EMR
also installs different software components on each node type, giving each node a role in a
distributed application like Apache Hadoop.
The following diagram represents a cluster with one primary node and four core nodes.
Master Node:
The master node is responsible for coordinating the overall workflow of the cluster.
It manages the distribution of tasks to the core and task nodes and monitors their
health and status.
The master node hosts the Hadoop Distributed File System (HDFS) NameNode,
which manages the metadata for the Hadoop cluster's data.
Core Nodes:
Core nodes store the Hadoop Distributed File System (HDFS) data blocks and
participate in data processing.
They run both task and storage services, contributing to both computation and data
storage.
The number of core nodes determines the amount of storage capacity in the cluster, as
well as the parallel processing capability.
Task Nodes:
Task nodes are responsible only for processing data and running tasks assigned by the
master node.
They do not store persistent data, and their primary purpose is to contribute
computational power to the cluster.
Task nodes are useful for handling transient workloads or for scaling the cluster's
processing capacity without increasing storage.
When you run a cluster on Amazon EMR, you have several options as to how you specify the
work that needs to be done.
Provide the entire definition of the work to be done in functions that you specify as
steps when you create a cluster. This is typically done for clusters that process a set
amount of data and then terminate when processing is complete.
Create a long-running cluster and use the Amazon EMR console, the Amazon EMR
API, or the AWS CLI to submit steps, which may contain one or more jobs
Create a cluster, connect to the primary node and other nodes as required using SSH,
and use the interfaces that the installed applications provide to perform tasks and
submit queries, either scripted or interactively.
Processing data
When you launch your cluster, you choose the frameworks and applications to install for your
data processing needs. To process data in your Amazon EMR cluster, you can submit jobs or
queries directly to installed applications, or you can run steps in the cluster.
Submitting jobs directly to applications
You can submit jobs and interact directly with the software that is installed in your Amazon
EMR cluster. To do this, you typically connect to the primary node over a secure connection
and access the interfaces and tools that are available for the software that runs directly on
your cluster.
You can submit one or more ordered steps to an Amazon EMR cluster. Each step is a unit of
work that contains instructions to manipulate data for processing by software installed on the
cluster.
Generally, when you process data in Amazon EMR, the input is data stored as files in your
chosen underlying file system, such as Amazon S3 or HDFS. This data passes from one step
to the next in the processing sequence. The final step writes the output data to a specified
location, such as an Amazon S3 bucket.
The following diagram represents the step sequence and change of state for the steps as they
are processed.
If a step fails during processing, its state changes to FAILED. You can determine what
happens next for each step. By default, any remaining steps in the sequence are set
to CANCELLED and do not run if a preceeding step fails. You can also choose to ignore the
failure and allow remaining steps to proceed, or to terminate the cluster immediately.
The following diagram represents the step sequence and default change of state when a step
fails during processing.
The following diagram represents the lifecycle of a cluster, and how each stage of the
lifecycle maps to a particular cluster state.
In Amazon EMR (Elastic MapReduce), instance groups and instance fleets are concepts
related to managing and configuring the EC2 instances that make up your EMR cluster.
These features allow you to define and manage the composition and behavior of the instances
in your cluster.
1. Instance Groups:
An instance group is a collection of Amazon EC2 instances within an EMR
cluster that share the same instance type and the same configuration.
There are three types of instance groups in EMR:
Core Instance Group: Instances in this group host the Hadoop
Distributed File System (HDFS) and run task and task instance
processes.
Master Instance Group: This group contains the master node, which
manages the distribution of tasks across the core and task nodes.
Task Instance Group: These instances are used to perform tasks and
are dynamically added or removed based on the load.
Each instance group is associated with an Amazon EC2 instance type, which
defines the computing resources (CPU, memory) available to instances in that
group.
You can specify the number of instances, instance type, and other
configurations for each instance group when you create an EMR cluster.
2. Instance Fleets:
Instance fleets provide a more flexible and efficient way to provision and
manage the instances in your EMR cluster compared to instance groups.
An instance fleet is a set of EC2 instance types and weighted capacity that
define the desired composition of instances in your cluster. It allows you to
specify multiple instance types and their weights, and EMR automatically
provisions and manages the instances based on your specifications.
This helps in optimizing costs and improving fault tolerance by diversifying
across multiple instance types.
Unlike instance groups, instance fleets allow EMR to automatically provision
instances based on the target capacity and instance type weights you define.
In summary, instance groups and instance fleets in AWS EMR provide mechanisms to
manage the composition and behavior of EC2 instances within your cluster. Instance groups
offer a more traditional and fixed approach, while instance fleets provide a more dynamic and
flexible way to manage instances, allowing for better cost optimization and fault tolerance.
The choice between them depends on your specific requirements and preferences.
Features of EMR
Amazon Elastic MapReduce (EMR) is a cloud-based big data platform provided by Amazon
Web Services (AWS). It simplifies the processing of large amounts of data by using popular
frameworks such as Apache Hadoop, Apache Spark, Apache Hive, Apache HBase, and
others. Here are some of the key features of AWS EMR:
1. Ease of Use:
AWS EMR provides a web-based console that makes it easy to launch and
manage clusters.
It integrates with other AWS services, allowing seamless interaction with
storage, databases, and other resources.
2. Flexibility:
Supports a variety of popular big data frameworks, including Apache Hadoop,
Apache Spark, Apache Hive, Apache HBase, and more.
Allows you to run custom applications and frameworks on EMR clusters.
3. Cluster Configuration:
Provides the ability to customize the size and configuration of the cluster
based on your specific workload requirements.
Allows both on-demand and spot instances to optimize costs.
4. Security:
Offers integration with AWS Identity and Access Management (IAM) for
access control.
Supports encryption of data at rest and in transit using AWS Key Management
Service (KMS).
Enables fine-grained access controls for data stored on Amazon S3.
5. Managed Scaling:
Automatically adjusts the size of the cluster based on the workload, helping to
optimize performance and costs.
Supports manual resizing of clusters for specific use cases.
6. Integration with AWS Services:
Seamless integration with other AWS services, such as Amazon S3, Amazon
DynamoDB, Amazon RDS, and more.
EMR can read and write data directly to and from Amazon S3, making it easy
to work with data stored in S3.
7. Logging and Monitoring:
Provides detailed logging and monitoring capabilities through AWS
CloudWatch and other tools.
Allows you to configure and monitor application-specific metrics.
8. Data Lake and Data Catalog Integration:
Supports integration with AWS Glue Data Catalog, making it easier to
discover and manage metadata.
Allows you to build data lakes on Amazon S3 and query data using
EMR.
9. Application Ecosystem:
Supports a wide range of applications and libraries within the Hadoop
and Spark ecosystems.
Provides pre-configured Amazon Machine Images (AMIs) for popular
big data frameworks.
10. Cost Optimization:
Supports spot instances for cost-effective cluster provisioning.
Allows you to use Reserved Instances to reduce costs for long-running
clusters.
11. Multi-Step Workflows:
Supports the creation of multi-step workflows using Apache Oozie,
making it easy to manage complex data processing workflows.
Optimization techniques for AWS EMR:
1. Instance Types and Sizes:
Choose appropriate EC2 instance types and sizes based on the nature of your workload.
Instances with more CPU or memory might be more suitable for certain tasks.
Utilize spot instances for cost savings, but be aware of the possibility of interruption.
2. Cluster Sizing:
Adjust the number of instances in your EMR cluster based on the size of your data and the
complexity of your processing tasks.
Use AWS Auto Scaling to automatically adjust the size of your cluster based on demand.
3. Instance Groups:
Leverage instance groups to allocate and manage resources efficiently.
Use core and task instance groups appropriately. Core nodes store HDFS data, while task
nodes are for processing only.
4. Instance Fleets:
Consider using instance fleets for better control over instance types and pricing models.
5. Bootstrap Actions:
Use bootstrap actions to install additional software, libraries, or configurations needed for
your specific use case.
6. Spot Instances:
Take advantage of spot instances to reduce costs, but be prepared for potential interruptions.
Consider using a mix of on-demand and spot instances for a balance between cost and
reliability.
7. EMR Release Version:
Keep your EMR cluster up to date with the latest release versions to benefit from
performance improvements, bug fixes, and new features.
8. Storage Optimization:
Optimize storage configurations, including the choice of EBS volumes and S3 storage
options.
Use instance store volumes for temporary data to avoid unnecessary EBS costs.
9. Data Compression:
Compress your data to reduce storage costs and improve data transfer efficiency. Choose
appropriate compression codecs based on your data and processing requirements.
10. Task Instance Groups:
Use task instance groups for transient and ephemeral workloads to further reduce costs.
11. Custom AMIs (Amazon Machine Images):
Create custom AMIs with pre-installed software and configurations to reduce cluster startup
times.
12. Monitoring and Logging:
Use AWS CloudWatch to monitor cluster performance, set up alarms, and identify
bottlenecks.
Enable logging to Amazon S3 for EMR cluster logs to facilitate debugging and optimization.
13. Tune Spark and Hadoop Configurations:
Adjust Spark and Hadoop configurations based on your specific workloads to optimize
performance.
14. Use EMR Notebooks:
Consider using EMR notebooks for interactive data exploration and analysis.
EMR Serverless
Amazon EMR Serverless is a new deployment option for Amazon EMR. EMR Serverless
provides a serverless runtime environment that simplifies the operation of analytics
applications that use the latest open source frameworks, such as Apache Spark and Apache
Hive. With EMR Serverless, you don’t have to configure, optimize, secure, or operate
clusters to run applications with these frameworks.
EMR Serverless helps you avoid over- or under-provisioning resources for your data
processing jobs. EMR Serverless automatically determines the resources that the application
needs, gets these resources to process your jobs, and releases the resources when the jobs
finish. For use cases where applications need a response within seconds, such as interactive
data analysis, you can pre-initialize the resources that the application needs when you create
the application.
With EMR Serverless, you'll continue to get the benefits of Amazon EMR, such as open
source compatibility, concurrency, and optimized runtime performance for popular
frameworks.
EMR Serverless is suitable for customers who want ease in operating applications using open
source frameworks. It offers quick job startup, automatic capacity management, and
straightforward cost controls.
Concepts
In this section, we cover EMR Serverless terms and concepts that appear throughout our
EMR Serverless User Guide.
Release version
An Amazon EMR release is a set of open-source applications from the big data ecosystem.
Each release includes different big data applications, components, and features that you select
for EMR Serverless to deploy and configure so that they can run your applications. When you
create an application, you must specify its release version. Choose the Amazon EMR release
version and the open source framework version that you want to use in your application
Application
With EMR Serverless, you can create one or more EMR Serverless applications that use open
source analytics frameworks. To create an application, you must specify the following
attributes:
The Amazon EMR release version for the open source framework version that you want to
use. To determine your release version, see Amazon EMR Serverless release versions.
The specific runtime that you want your application to use, such as Apache Spark or Apache
Hive.
After you create an application, you can submit data-processing jobs or interactive requests to
your application.
Each EMR Serverless application runs on a secure Amazon Virtual Private Cloud (VPC)
strictly apart from other applications. Additionally, you can use AWS Identity and Access
Management (IAM) policies to define which users and roles can access the application. You
can also specify limits to control and track usage costs incurred by the application.
Job run
A job run is a request submitted to an EMR Serverless application that the application
asychronously executes and tracks through completion. Examples of jobs include a HiveQL
query that you submit to an Apache Hive application, or a PySpark data processing script that
you submit to an Apache Spark application. When you submit a job, you must specify a
runtime role, authored in IAM, that the job uses to access AWS resources, such as Amazon
S3 objects. You can submit multiple job run requests to an application, and each job run can
use a different runtime role to access AWS resources. An EMR Serverless application starts
executing jobs as soon as it receives them and runs multiple job requests concurrently. To
learn more about how EMR Serverless runs jobs, see Running jobs.
Workers
An EMR Serverless application internally uses workers to execute your workloads. The
default sizes of these workers are based on your application type and Amazon EMR release
version. When you schedule a job run, you can override these sizes.
When you submit a job, EMR Serverless computes the resources that the application needs
for the job and schedules workers. EMR Serverless breaks down your workloads into tasks,
downloads images, provisions and sets up workers, and decommissions them when the job
finishes. EMR Serverless automatically scales workers up or down based on the workload
and parallelism required at every stage of the job. This automatic scaling removes the need
for you to estimate the number of workers that the application needs to run your workloads.
Pre-initialized capacity
EMR Serverless provides a pre-initialized capacity feature that keeps workers initialized and
ready to respond in seconds. This capacity effectively creates a warm pool of workers for an
application. To configure this feature for each application, set the initial-capacity parameter
of an application. When you configure pre-initialized capacity, jobs can start immediately so
that you can implement iterative applications and time-sensitive jobs. To learn more about
pre-initialized workers, see Configuring an application.
EMR Studio
EMR Studio is the user console that you can use to manage your EMR Serverless
applications. If an EMR Studio doesn't exist in your account when you create your first EMR
Serverless application, we automatically create one for you. You can access EMR Studio
either from the Amazon EMR console, or you can turn on federated access from your identity
provider (IdP) through IAM or IAM Identity Center. When you do this, users can access
Studio and manage EMR Serverless applications without direct access to the Amazon EMR
console.
1. Explain the architecture of Amazon Elastic MapReduce (EMR) and how it enables
effective data processing and analysis?
Amazon EMR architecture consists of a cluster with one master node, core nodes, and task
nodes. The master node manages the cluster, while core nodes store data in Hadoop
Distributed File System (HDFS) and run tasks. Task nodes only execute tasks without storing
data.
EMR uses Apache Hadoop, an open-source framework that processes large datasets across
distributed clusters. It leverages MapReduce programming model for parallel processing,
enabling efficient data analysis. Additionally, EMR supports other frameworks like Spark,
Hive, and Presto for diverse analytical needs.
EMR integrates with AWS services such as S3, DynamoDB, and Redshift, facilitating
seamless data storage and retrieval. Autoscaling adjusts the number of nodes based on
workload, optimizing resource usage and cost. Spot Instances further reduce costs by utilizing
spare EC2 capacity.
Security is ensured through encryption, IAM roles, VPCs, and private subnets. Monitoring
and logging are available via Amazon CloudWatch and EMR Console, allowing performance
tracking and issue resolution.
2. How does Amazon EMR differ from traditional Hadoop and Spark clusters? What are
the specific advantages and limitations of using Amazon EMR?
Amazon EMR differs from traditional Hadoop and Spark clusters by providing a managed,
scalable, and cost-effective service for big data processing. It simplifies cluster setup,
management, and scaling while integrating with other AWS services.
Advantages of Amazon EMR include:
1. Easy setup: Quick cluster creation with pre-configured applications.
2. Scalability: Automatic resizing based on workload demands.
3. Cost-effectiveness: Pay-as-you-go pricing model and Spot Instances support.
4. Integration: Seamless integration with AWS ecosystem (S3, DynamoDB, etc.).
5. Security: Built-in security features like encryption and IAM roles.
Limitations of Amazon EMR are:
1. Vendor lock-in: Limited to the AWS environment.
2. Customization constraints: Less flexibility compared to self-managed clusters.
3. Latency: Potential latency issues when accessing data stored outside EMRFS.
3. How do you optimize the performance of an EMR job? What factors should be
considered?
To optimize EMR job performance, consider these factors:
1. Cluster Configuration: Choose appropriate instance types and sizes based on workload
requirements. Utilize Spot Instances for cost savings.
2. Data Storage: Use HDFS or Amazon S3 with consistent view enabled to store data.
Optimize S3 read/write operations using partitioning and compression techniques.
3. Task Distribution: Balance the number of mappers and reducers according to input data
size and processing complexity. Configure speculative execution to handle slow tasks.
4. Tuning Parameters: Adjust Hadoop/Spark configurations such as memory allocation,
garbage collection settings, and parallelism levels to improve resource utilization.
5. Monitoring & Logging: Enable CloudWatch metrics and logs to identify bottlenecks and
track performance improvements over time.
6. Code Optimization: Profile application code to find inefficiencies and use efficient
algorithms and data structures.
4. Describe the process of resizing an Amazon EMR cluster. What are some best practices
to maintain high availability and optimal performance while resizing a cluster?
Resizing an Amazon EMR cluster involves modifying the number of instances in the cluster
to meet changing workload demands. There are two types of resizing: scale-out (adding
instances) and scale-in (removing instances).
To resize a cluster, use the AWS Management Console, CLI, or SDKs. First, identify the
instance groups you want to modify, then change their target capacities accordingly.
Best practices for maintaining high availability and optimal performance while resizing
include:
1. Use Auto Scaling policies to automatically adjust cluster size based on
predefined CloudWatch metrics.
2. Resize during periods of low demand to minimize impact on running jobs.
3. Monitor key performance indicators (KPIs) like CPU utilization, memory
usage, and HDFS capacity to determine when resizing is necessary.
4. Opt for uniform instance groups with similar configurations to simplify
management and ensure consistent performance.
5. Test different cluster sizes and configurations to find the best balance
between cost and performance.
6. Consider using Spot Instances for cost savings but be prepared for potential
interruptions.
7. Implement data backup strategies to prevent data loss during resizing
operations.
5. Discuss the use of EMR File System (EMRFS) in Amazon EMR. What benefits does it
provide when compared to HDFS?
EMR File System (EMRFS) is an implementation of Hadoop Distributed File System
(HDFS) that allows Amazon EMR clusters to utilize data stored in Amazon S3. It provides
several benefits compared to traditional HDFS:
1. Scalability: EMRFS can scale horizontally, allowing for increased storage and throughput
without impacting cluster performance.
2. Durability: Data stored in S3 has 11 nines durability, reducing the risk of data loss.
3. Cost-effectiveness: Users pay only for the storage they use in S3, avoiding over-
provisioning costs associated with HDFS.
4. Flexibility: EMRFS enables sharing of data across multiple EMR clusters or other AWS
services, simplifying data management.
5. Consistency: EMRFS offers consistent view feature, ensuring read-after-write consistency
for objects written by EMRFS or other S3 clients.
6. Security: Integration with AWS Identity and Access Management (IAM) allows granular
access control to data stored in S3.
6. Explain the role of spot instances in Amazon EMR, and how it can be used to achieve
cost-effective resource allocation.
Spot instances in Amazon EMR play a crucial role in cost-effective resource allocation by
allowing users to bid on unused EC2 capacity at a lower price than On-Demand instances.
When the bid price exceeds the current Spot price, the instances are provisioned and added to
the EMR cluster.
To achieve cost-effective resource allocation, users can specify a percentage of their core and
task nodes as spot instances during cluster creation or modify an existing cluster’s instance
groups configuration. By doing so, they leverage the cost savings from spot instances while
maintaining the stability of the cluster with On-Demand instances for critical components like
master nodes.
Additionally, users can set up instance fleets to define multiple instance types and bidding
strategies, enabling EMR to automatically provision the most cost-effective combination of
instances based on available capacity and user-defined constraints.
However, it is essential to consider that spot instances may be terminated when the Spot price
rises above the bid price or due to capacity constraints. To mitigate this risk, users should
implement checkpointing and data replication strategies to ensure minimal impact on job
progress and data integrity.
7. What are the different security configurations available in Amazon EMR, and how can
the security of an Amazon EMR cluster be improved?
Amazon EMR security configurations include:
1. Identity and Access Management (IAM): Define roles for EMR clusters, EC2 instances,
and service access control.
2. Encryption: Use AWS Key Management Service (KMS) to encrypt data at rest (HDFS, S3)
and in transit (Spark, MapReduce).
3. Network Isolation: Utilize Amazon VPCs, subnets, and security groups to isolate resources
and restrict traffic.
4. Logging and Monitoring: Enable CloudTrail, CloudWatch, and EMRFS audit logs for
tracking user activities and cluster performance.
5. Authentication: Integrate with Kerberos or LDAP for secure authentication of users and
services.
6. Authorization: Implement Apache Ranger or similar tools for fine-grained access control
over Hadoop components.
To improve the security of an Amazon EMR cluster:
– Regularly review and update IAM policies, ensuring least privilege access.
– Enforce encryption for sensitive data and communication channels.
– Limit network exposure by using private subnets and strict security group rules.
– Monitor logs for suspicious activity and set up alerts for potential threats.
– Keep software versions updated and apply security patches promptly.
8. Describe the different types of EMR clusters (transient and long-running) and their
appropriate use cases.
Transient and long-running EMR clusters serve distinct purposes in data processing.
Transient clusters are temporary, created for specific tasks like batch processing or ETL jobs.
They’re cost-effective as they auto-terminate upon job completion, minimizing resource
usage. Use cases include log analysis, recommendation engines, and data transformations.
Long-running clusters persist even after job completion, suitable for interactive analytics or
streaming applications. They enable continuous data ingestion and real-time processing. Use
cases encompass real-time fraud detection, IoT data processing, and ad-hoc querying using
tools like Apache Zeppelin or Jupyter notebooks.
9. Can you explain how Amazon EMR supports the use of custom machine learning (ML)
algorithms? What is the process for integrating custom ML libraries into an EMR cluster?
Amazon EMR supports custom ML algorithms by allowing users to install and configure
additional libraries, frameworks, or applications on the cluster. This flexibility enables
integration of custom ML libraries into an EMR cluster.
To integrate custom ML libraries, follow these steps:
1. Create a bootstrap action script that installs and configures the required dependencies for
your custom library.
2. Upload the script to Amazon S3.
3. Launch an EMR cluster with the specified bootstrap action using AWS Management
Console, CLI, or SDKs.
4. Develop your ML application using the custom library and upload it to S3.
5. Add a step in the EMR cluster to execute your ML application, referencing the uploaded
code in S3.
6. Monitor the progress of your application through the EMR console or logs stored in S3.
Amazon EMR is a managed Hadoop framework that simplifies big data processing, while
Amazon Data Pipeline is a web service for data movement and transformation. Key
differences include:
1. Purpose: EMR focuses on distributed data processing using Hadoop ecosystem tools,
whereas Data Pipeline orchestrates data workflows across various AWS services.
2. Scalability: EMR automatically scales underlying infrastructure, while Data Pipeline
requires manual scaling of resources.
3. Flexibility: EMR supports multiple programming languages and frameworks, but Data
Pipeline is limited to AWS-specific components.
Choose EMR when dealing with large-scale data processing tasks requiring complex
analytics or machine learning capabilities. Opt for Data Pipeline when orchestrating data
workflows between AWS services, focusing on data movement and simple transformations.
11. Discuss how Amazon EMR integrates with AWS Glue, AWS Lake Formation, and
Amazon Athena. How can these services complement each other?
Amazon EMR integrates with AWS Glue, AWS Lake Formation, and Amazon Athena to
create a comprehensive data processing ecosystem.
AWS Glue is a serverless ETL service that simplifies data extraction, transformation, and
loading tasks. It provides an interface for defining jobs and crawlers, which can discover and
catalog metadata in the AWS Glue Data Catalog. EMR can use this catalog as a central
repository for schema information, enabling seamless integration between various big data
applications.
AWS Lake Formation streamlines the process of setting up, securing, and managing data
lakes. It automates tasks like data ingestion, cleaning, and cataloging. EMR clusters can
access data stored in a lake created by Lake Formation, leveraging its security policies and
permissions for fine-grained access control.
Amazon Athena is an interactive query service that allows users to analyze data in Amazon
S3 using standard SQL. By integrating with the AWS Glue Data Catalog, Athena can utilize
the same metadata as EMR, ensuring consistency across queries. This enables users to run ad-
hoc analyses on data processed by EMR without needing to set up additional infrastructure.
12. Explain how to monitor the performance of an Amazon EMR cluster in real-time.
What metrics and tools are used for this purpose?
To monitor the performance of an Amazon EMR cluster in real-time, use Amazon
CloudWatch and Ganglia. CloudWatch provides metrics such as CPU utilization, memory
usage, and disk I/O operations. Enable detailed monitoring for more frequent data points.
Create custom alarms to notify you when specific thresholds are breached.
Ganglia is an open-source tool that offers a web-based interface for visualizing cluster
performance. Install it on your EMR cluster by adding the “Ganglia” application during
cluster creation or modifying an existing cluster. Access the Ganglia dashboard through the
EMR Console or directly via its URL.
Combine both tools for comprehensive monitoring: CloudWatch for metric collection and
alerting, and Ganglia for visualization and historical analysis.
13. How does Amazon EMR handle data durability and data loss prevention? Discuss the
various data backup options available.
Amazon EMR ensures data durability and loss prevention through replication, backup, and
recovery mechanisms. It leverages Hadoop’s HDFS for distributed storage, which replicates
data blocks across multiple nodes to prevent data loss due to node failures.
For additional data protection, Amazon EMR offers various backup options:
1. S3 DistCp: Use the S3 DistCp tool to copy data from HDFS to Amazon S3 periodically,
providing a durable backup of your data in case of cluster failure or termination.
2. EMR File System (EMRFS): EMRFS allows direct access to data stored in Amazon S3,
enabling you to use it as a persistent storage layer. This eliminates the need to store data on
local HDFS, reducing the risk of data loss.
3. AWS Glue Data Catalog: Integrate with AWS Glue Data Catalog to store metadata about
your data, making it easier to discover, search, and manage datasets.
4. Snapshots: Create snapshots of EBS volumes attached to EMR instances for point-in-time
backups. These can be used to restore data if needed.
5. Cross-Region Replication: Enable cross-region replication in Amazon S3 to automatically
replicate data across different regions, ensuring high availability and disaster recovery.
14. Describe the bootstrap actions in Amazon EMR. What are their use cases, and how can
you create custom bootstrap actions?
Bootstrap actions in Amazon EMR are scripts executed on cluster nodes during the setup
phase before Hadoop starts. They enable customization of clusters, such as installing
additional software or configuring system settings.
Use cases include:
1. Installing custom applications
2. Modifying configuration files
3. Adjusting system parameters
To create a custom bootstrap action, follow these steps:
1. Write a script (e.g., Bash) to perform desired tasks.
2. Upload the script to an S3 bucket.
3. Specify the S3 location when creating the EMR cluster using AWS Management Console,
CLI, or SDKs.
Example: Install Python 3 and pip:
#!/bin/bash
sudo yum install -y python3
curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py --user
15. How does Amazon EMR handle instance failures? Explain the process of automatic
failover and recovery.
Amazon EMR handles instance failures through automatic failover and recovery
mechanisms. When a failure is detected, the system takes several steps to recover:
1. Identifies failed instances: EMR monitors cluster health and detects when an instance
becomes unresponsive or fails.
2. Reassigns tasks: Tasks running on the failed instance are reassigned to other available
instances in the cluster.
3. Launches replacement instances: EMR automatically provisions new instances to replace
the failed ones, maintaining the desired capacity.
4. Recovers data: If the failed instance was a core node with HDFS data, EMR recovers the
data by replicating it from other nodes.
5. Updates metadata: The system updates metadata about the cluster’s state, including
information about active instances and task assignments.
This process ensures minimal disruption to ongoing processing jobs and maintains data
integrity.
16. Can you discuss the concept of data locality in Amazon EMR and its impact on job
performance?
Data locality in Amazon EMR refers to the placement of data and computation tasks on
nodes within a cluster. It aims to minimize data movement across the network, thus
improving job performance. Hadoop Distributed File System (HDFS) stores data in blocks,
which are distributed across multiple nodes. When processing data, EMR attempts to
schedule tasks on nodes where the required data is already present.
There are three levels of data locality: node-local, rack-local, and off-switch. Node-local
means data and task reside on the same node; rack-local indicates they’re on different nodes
but within the same rack; off-switch implies they’re on separate racks. EMR prioritizes node-
local assignments, followed by rack-local, then off-switch.
Job performance benefits from data locality as it reduces network congestion and latency.
However, strict adherence to data locality may lead to resource underutilization or
imbalanced workloads. To address this, EMR uses delay scheduling, waiting for a short
period before assigning non-local tasks, allowing time for local resources to become
available.
Auto-scaling in Amazon EMR plays a crucial role in optimizing resource usage and cost-
efficiency. It dynamically adjusts the number of instances in an EMR cluster based on
workload demands, ensuring optimal performance while minimizing costs.
Auto-scaling helps improve efficiency by:
1. Automatically adding instances when demand increases, preventing bottlenecks and
maintaining high throughput.
2. Removing excess instances during low-demand periods, reducing unnecessary expenses.
3. Balancing workloads across instances to maximize utilization.
4. Allowing users to define scaling policies based on custom CloudWatch metrics or
predefined YARN metrics.
5. Supporting both core nodes (HDFS storage) and task nodes (processing only), enabling
fine-grained control over cluster resources.
6. Integrating with Spot Instances for additional cost savings without compromising
availability.
18. Describe how Amazon EMR can be integrated with Amazon S3, and discuss the
benefits of storing input and output data in S3.
Amazon EMR integrates with Amazon S3 through the use of EMR File System (EMRFS),
which allows EMR clusters to access and process data stored in S3. This integration enables
seamless data transfer between EMR and S3, as well as efficient querying using tools like
Hive and Spark.
Storing input and output data in S3 offers several benefits:
1. Durability: S3 provides 99.999999999% durability, ensuring data reliability.
2. Scalability: S3 can store unlimited amounts of data, allowing for growth
without capacity constraints.
3. Cost-effectiveness: Pay-as-you-go pricing model reduces storage costs
compared to traditional Hadoop Distributed File System (HDFS).
4. Data accessibility: S3 data is accessible from multiple EMR clusters or
other AWS services, enabling parallel processing and reducing data silos.
5. Decoupling storage and compute: Separating storage (S3) from compute
resources (EMR) allows independent scaling and optimization of each
component.
6. Simplified data management: Lifecycle policies and versioning features in
S3 help manage data efficiently over time.
19. Explain the process of migrating an on-premises Hadoop cluster to Amazon EMR.
What are the key steps and considerations when performing this migration?
Migrating an on-premises Hadoop cluster to Amazon EMR involves several key steps and
considerations:
1. Assess the current Hadoop environment, including data size, processing requirements, and
dependencies.
2. Choose appropriate Amazon EMR instance types based on resource needs and cost
optimization.
3. Set up necessary AWS services such as S3 for storage, IAM for access control, and VPC
for networking.
4. Modify existing Hadoop applications to work with EMR, considering differences in file
systems (HDFS vs. S3) and APIs.
5. Migrate data from on-premises storage to S3 using tools like AWS Snowball or DataSync
for efficient transfer.
6. Test migrated applications on EMR, ensuring correctness and performance meet
expectations.
7. Plan a cutover strategy, minimizing downtime during migration.
Key considerations include:
– Ensuring data security and compliance during migration
– Optimizing costs by selecting suitable instances and leveraging spot pricing
– Monitoring and managing EMR clusters effectively
20. How can Amazon EMR be used for data warehousing and data analytics workloads?
Discuss some use cases and architectural patterns.
Amazon EMR is a managed Hadoop framework that simplifies running big data workloads,
enabling data warehousing and analytics. It supports various distributed processing engines
like Apache Spark, Hive, and Presto.
Use cases:
1. Log analysis: Analyze web server logs to gain insights into user behavior and improve
website performance.
2. ETL processing: Extract, transform, and load large datasets from multiple sources for
further analysis or storage in Amazon S3 or Redshift.
3. Machine learning: Train ML models on vast amounts of data using libraries like
TensorFlow or PyTorch.
4. Real-time analytics: Process streaming data with low-latency requirements using Apache
Flink or Kafka.
Architectural patterns:
1. Decoupling storage and compute: Store raw data in Amazon S3 and use EMR clusters for
processing, allowing independent scaling of storage and compute resources.
2. Data lake architecture: Ingest structured and unstructured data into an S3-based data lake,
then process and analyze it using EMR.
3. Lambda architecture: Combine batch and real-time processing by using EMR for batch
layer and Kinesis/Flink for the speed layer, merging results for final output.
4. Federated querying: Use EMR with Amazon Athena or Redshift Spectrum to query data
across different storage systems without moving it.
21. Discuss the best practices for cost optimization in Amazon EMR. What are the different
pricing models and billing options available?
To optimize costs in Amazon EMR, consider the following best practices:
1. Choose appropriate instance types: Select instances that provide optimal performance for
your workload at the lowest cost.
2. Use Spot Instances: Leverage Spot Instances to save up to 90% compared to On-Demand
pricing.
3. Utilize Reserved Instances: Purchase RIs for long-term workloads to achieve significant
savings over On-Demand rates.
4. Optimize cluster size: Scale clusters based on demand and use auto-scaling policies to
minimize costs.
5. Compress data: Reduce storage and processing costs by compressing input/output data.
6. Monitor usage: Track resource utilization with CloudWatch metrics and adjust
configurations accordingly.
Amazon EMR offers three pricing models:
1. On-Demand Instances: Pay-as-you-go model without upfront commitment.
2. Reserved Instances (RIs): Commitment-based model offering discounts for 1 or 3-year
terms.
3. Spot Instances: Bid-based model allowing you to purchase unused capacity at a lower
price.
Billing options include:
1. Per-second billing: Charges are calculated per second of usage.
2. Savings Plans: Flexible plans providing discounts for consistent usage across AWS
services.
22. Explain the role of AWS Identity and Access Management (IAM) policies in controlling
access to Amazon EMR resources.
AWS Identity and Access Management (IAM) policies play a crucial role in controlling
access to Amazon EMR resources by defining permissions for users, groups, and roles. These
policies determine the actions that can be performed on specific resources within an AWS
account.
IAM policies are JSON documents containing statements with elements like Effect, Action,
Resource, and Condition. The Effect specifies whether to allow or deny access, while Action
lists the operations allowed or denied. Resource identifies the target resource(s), and
Condition defines any constraints applied to the policy.
In the context of Amazon EMR, IAM policies help manage access to clusters, instances, and
other related services such as S3 buckets and EC2 instances. For example, you can create a
policy allowing certain users to launch EMR clusters but restrict them from terminating
existing ones.
Additionally, IAM policies can be used to control access to EMRFS data stored in S3,
ensuring only authorized users have read/write access to specific paths. This is achieved
through the use of EMRFS authorization rules, which map IAM policies to Hadoop user
accounts.
23. How can you use Amazon EMR and Lambda functions together for real-time data
processing? Provide an example use case.
Amazon EMR and Lambda functions can be used together for real-time data processing by
leveraging the strengths of both services. EMR is ideal for large-scale, distributed data
processing tasks, while Lambda excels at handling event-driven architectures with low
latency requirements.
In a typical use case, you could set up an Amazon Kinesis Data Stream to ingest real-time
data from various sources like IoT devices or social media feeds. Then, create a Lambda
function that processes incoming records in the stream and transforms them as needed. This
transformed data can then be written back to another Kinesis Data Stream or directly into an
S3 bucket.
Next, configure an EMR cluster to consume the processed data from the second Kinesis Data
Stream or S3 bucket. The EMR cluster can run Spark Streaming jobs to perform further
analysis, aggregation, or machine learning on the data. Finally, store the results in a persistent
storage system such as Amazon Redshift or DynamoDB for querying and visualization
purposes.
This architecture enables real-time data processing using Lambda functions for initial
transformations and EMR for more complex analytics, providing scalability and flexibility in
handling diverse workloads.
24. How to configure encryption options for data at rest and data in transit in Amazon
EMR?
To configure encryption options for data at rest and in transit in Amazon EMR, follow these
steps:
1. Enable server-side encryption (SSE) for S3 using AWS Key Management Service (KMS)
or SSE-S3 to protect data stored in input/output buckets.
2. Use HDFS Transparent Data Encryption (TDE) with KMS to encrypt intermediate data on
cluster nodes.
3. Configure local disk encryption by enabling LUKS (Linux Unified Key Setup) on the
instance store volumes of your EMR cluster instances.
4. For data in transit, enable TLS/SSL encryption for applications like Spark, Hive, and
Presto by setting up security configurations in EMR.
5. Create a security configuration JSON file specifying encryption settings for each
component (e.g., S3, HDFS, Local Disk, and TLS).
6. When creating an EMR cluster, specify the created security configuration using the
--security-configuration
flag.
Example:
aws emr create-security-configuration --name MySecurityConfig --security-configuration '{
"EncryptionConfiguration": {
"AtRestEncryptionConfiguration": {...},
"InTransitEncryptionConfiguration": {...}
}
}'
aws emr create-cluster ... --security-configuration MySecurityConfig
25. Explain the process of creating and deploying Docker containers in an Amazon EMR
cluster. What benefits does containerization bring to the EMR environment?
To create and deploy Docker containers in an Amazon EMR cluster, follow these steps:
1. Set up the EMR cluster with a custom bootstrap action to install Docker.
2. Create a Dockerfile defining your container’s environment, dependencies, and application
code.
3. Build the Docker image using
docker build
command and push it to a container registry like Amazon ECR or Docker Hub.
4. Configure EMR step(s) to pull the Docker image and run the containerized application
using
docker run
command.
Containerization benefits in EMR environment include:
– Isolation: Containers encapsulate applications and their dependencies, ensuring consistent
execution across environments.
– Versioning: Container images can be versioned, allowing easy rollback to previous versions
if needed.
– Portability: Containers can run on any platform supporting Docker, simplifying migration
between environments.
– Resource Efficiency: Containers share underlying OS resources, reducing resource
overhead compared to running separate VMs for each application.
– Scalability: Containers can be easily scaled horizontally by adding more instances to handle
increased workloads.
RELATIONAL DATABASE SERVICES (RDS)
What is DATABASE?
Database is a systematic collection of data. Databases supports storage and manipulation of data.
e.g: facebook, telecom companies, amazon.com
What is DBMS?
DBMS is a collection of programs which enable its users to access database, manipulate data,
reporting/ representation of data.
Types of DBMS
1. Hierarchical
2. Network
3. Relational
4. Object oriented
Relational Database:
A relational database is a data structure that allows you to link information from different
tables of different types of data bucket.
Tables are related to each other.
All fields must be filled.
Best suited for OLTP (online transaction processing)
Relational DB: MySQL, Oracle, DBMS, IBM DB2
A row of a table is also called records. It contains the specific information of each individual
entry in the table.
Each table has its own primary key.
A schema (design of database) is used to strictly define tables, columns, indexes and relation
between tables.
Relational DB are usually used in enterprises application/scenario. Exception in MySQL
which is used for web application.
Common application for MySQL include php and java based web applications that requires a
database storage backend. E.g: JOOMLA
Cannot scale out horizontally.
Virtually all relational DB uses SQL.
1. Columnar Database:
A columnar database is a DBMS that stores data in columns instead of Rows.
In a columnar DB all the column-1 values are physically together followed by all the
column-2 values.
In a row oriented DBMS the data would be stored like this:
(1, bob, 30, 8000: 2, arun, 26, 4000: 3, vian, 39, 2000 ;)
In a column based DBMS the database would be stored like this:
(1, 2, 3: bob, arun, vian; 30, 26, 39; 8000, 4000, 2000 ;)
Benefit is that because a column based DBMS is self-indexing, it uses less disk space that a
RDBMS containing the same data. It easily perform operation like min, max, average.
2. Document Database:
Document DB make it easier for developer to store and querying data in a DB by using the
same document model format they use in their application code.
Document DBs are efficient for storing catalogue.
Store semi-structure data as document typically in JSON or XML format. (example)
A document database is a great choice for contain management application such as blogs and
video platform.
3. Key-Value Database:
A key-value DB is a simple DB that uses an associative array (like dictionary) as a
fundamental model where each key is associated with one and only one value in a collection.
It allows horizontal scaling.
Used cases: shopping cart, and session store in app like fb and twitter.
They improve application performance by storing critical pieces of data in memory for low
latency access.
Amazon elasticache as an in-memory key-value stores.
RDS Limits:
Up to 40DB instances per account.
10 of this 40 can be Oracle or MS-SQL server under license included model.
Or
Under BYOL model, all 40 can be any DB engine you need.
DB Instance Size:
a. Standard class
b. Memory-Optimized class
c. Burstable class
What is Multi-AZ in RDS:
You can select multi AZ option during RDS DB instance launch.
RDS service creates a standby instances in a different AZ in the same region and configure
“synchronous replication” between the primary and standby.
You cannot read/write to the standby RDS DB instances.
You cannot select which AZ in the region will be chosen to create the standby DB instance.
You can however view, which AZ is selected after the standby is created.
Depending on the instance class it may take 1 to few minutes to failover to the standby
instance.
AWS recommends the use of provisioned IOPS instances for multi-AZ RDS instance.
A DB instance reboot is required for changes to take effect when you change the DB
parameter group on when you change a static DB parameter.
There are two methods to backup and restore your RDS DB instances:
1. AWS RDS automated backup
2. User initiated manual backup
Either you can take backup of entire DB instance or just the DB.
You can create a restore volume snapshots of your entire DB instances.
Automated backups by AWS, backup your DB data to multiple AZ to provide for data
durability.
Select-automated backup in AWS console.
Stored in Amazon S3.
Multi-AZ automated backups will be taken from the standby instance.
The DB instance must be in “ACTIVE” state for automated backup.
RDS automatically backups the DB instances daily by creating a storage volume snapshots
of your DB instance (fully daily snapshots) including the DB transaction logs.
You can decide when you would like to take backup (window)
No additional charge for RDS backing up your DB instance.
For multi-AZ deployment, backups are taken from the standby DB instance (true for Maria
DB, MySQL, Oracle, Postgre SQL).
Automated backups are deleted when you delete your RDS DB instance.
An outage occurs if you change the backup retention period from zero to non-zero value or
the other way around.
Retention period of automate backup is 7 days (by default) via AWS console.
AWS Aurora is an exception. Its default is 1 day.
Via CLI or API 1 day by default.
You can increase it up to 35 days.
If you don’t want backup, put zero “0” in the retention period.
In case of manual snapshot, point-in-time recovery is not possible.
Manual snapshot is also stored in S3.
They are not deleted automatically, if you delete RDS instance.
Take a final snapshot before deleting your RDS DB instance.
You can share manual snapshot directly with other AWS Account.
When you restore a DB instance only the default DB parameters and security groups are
associated with the restored instance.
You cannot restore a DB snapshot into an existing DB instance rather it has to create a new
DB instance it has new endpoint.
Restoring from the backup or a DB snapshot changes the RDS instance endpoint.
At the time of restoring, you can change the storage type (general purpose or provisioned.)
You cannot encrypt an existing unencrypted DB instance.
To do that you need to: create a new encrypted instance and migrate your data to it (from
unencrypted to encrypted) or you can restore from a backup/snapshot into a new encrypted RDS
instance.
RDS supports encryption-at-rest for all DB engines using KMS.
What actually encrypted when data-at-rest:
a. All its snapshots.
b. Backups of DB (S3 storage.)
c. Data on EBS volume.
d. Read replica created from the snapshots.
The Leader Node in an Amazon Redshift Cluster manages all external and internal
communication. It is responsible for preparing query execution plans whenever a
query is submitted to the Cluster. Once the query execution plan is ready, the Leader
Node distributes the query execution code on the Compute Nodes and assigns Slices
of data to each to Compute Node for computation of results.
Leader Node distributes query load to Compute Node only when the query involves
accessing data stored on the Compute Nodes. Otherwise, the query is executed on the
Leader Node itself. There are several functions in Redshift Architecture that are
always executed on the Leader Node.
You can read SQL Functions Supported on the Leader Node for more information on
these functions, here.
Redshift Architecture Component 2: Compute Nodes
Compute Nodes are responsible for the actual execution of queries and have data stored with
them. They execute queries and return intermediate results to the Leader Node which further
aggregates the results.
Dense Storage (DS): Dense Storage Nodes allow you to create large Data
Warehouses using Hard Disk Drives (HDDs) for a low price point.
Dense Compute (DC): Dense Compute nodes allow you to create high-performance
Data Warehouses using Solid-State Drives (SSDs).
A more detailed explanation of how responsibilities are divided among Leader and Compute
Nodes are depicted in the diagram below:
Redshift Architecture Component 3: Node Slices
A Compute Node consists of Slices. Each Slice has a portion of Compute Node’s memory
and disk assigned to it where it performs query operations. The Leader Node is responsible
for assigning a query code and data to a slice for execution. Slices once assigned query load
work in parallel to generate query results.
Data is distributed among the Slices on the basis of the Distribution Style and Distribution
Key of a particular table. An even distribution of data enables Redshift to assign workload
evenly to Slices and maximizes the benefit of parallel processing.
The number of Slices per Compute Node is decided on the basis of the type of node. You can
find more information on Clusters and Nodes.
Amazon Redshift Architecture allows it to use Massively Parallel Processing (MPP) for fast
processing even for the most complex queries and a huge amount of data set. Multiple
compute nodes execute the same query code on portions of data to maximize Parallel
Processing.
Data in the Amazon Redshift Data Warehouse is stored in a Columnar fashion which
drastically reduces the I/O on disks. Columnar storage reduces the number of disk I/O
requests and minimizes the amount of data loaded into the memory to execute a query.
Reduction in I/O speeds up query execution and loading less data means Redshift can
perform more in-memory processing.
Redshift uses Sort Keys to sort Columns and filter out chunks of data while
executing queries. You can read more about Sort Keys in our post on Choosing the
best Sort Keys here.
Redshift Architecture Component 6: Data Compression
Data compression is one of the important factors in ensuring query performance. It reduces
the storage footprint and enables the loading of large amounts of data in the memory fast.
Owing to Columnar storage, Redshift can use adaptive compression encoding depending on
the Column data type. Read more about using compression encodings in Compression
Encodings in Redshift here.
Redshift’s Query Optimizer generates query plans that are MPP-aware and takes advantage
of Columnar Data Storage. Query Optimizer uses analyzed information about tables to
generate efficient query plans for execution. Read more about Analyze to know how to make
the best of Query Optimizer here.
Amazon Redshift provides private and high-speed network communication between leader
node and compute nodes by leveraging high-bandwidth network connections and custom
communication protocols. The compute nodes run on an isolated network that can never be
accessed directly by Client Applications.
Performance Features:
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the
cloud. It is designed to analyze large datasets using SQL queries and provides high-
performance analysis of structured data. Here are some key features of Amazon Redshift:
1. Columnar Storage:
Amazon Redshift stores data in a columnar format, which is more efficient for analytical
queries. This allows for high compression rates and faster query performance.
2. Massively Parallel Processing (MPP):
Redshift uses a massively parallel processing architecture, distributing the workload across
multiple nodes to enable parallel query execution. This helps to handle large datasets and
complex queries efficiently.
3. Scalability:
Redshift is scalable, allowing you to easily add or remove nodes to accommodate changing
workloads. You can scale the cluster up or down based on your performance and storage
requirements.
4. Automated Backup and Restore:
Redshift automatically takes incremental snapshots of your data to enable point-in-time
recovery. You can also create manual snapshots for data backup and archiving purposes.
5. Data Compression:
Redshift uses various compression techniques to minimize storage requirements and improve
query performance. This includes columnar storage, run-length encoding, and dictionary
encoding.
6. Advanced Query Optimization:
Redshift provides features such as zone maps, query rewrites, and automatic table
optimization to enhance query performance. The query optimizer analyzes and chooses the
most efficient query plan for execution.
7. Security:
Amazon Redshift offers several security features, including encryption of data in transit and
at rest, Virtual Private Cloud (VPC) integration, AWS Identity and Access Management
(IAM) for access control, and support for Virtual Private Network (VPN) and Direct
Connect.
8. Integration with AWS Ecosystem:
Redshift seamlessly integrates with other AWS services, allowing you to easily transfer data
to and from services like Amazon S3, AWS Glue, and Amazon EMR for data processing and
analytics.
9. Concurrency and Workload Management:
Redshift provides robust concurrency controls and workload management features, allowing
you to manage multiple concurrent queries efficiently. You can set query queues and allocate
resources based on your specific workload requirements.
10. Monitoring and Logging:
AWS Redshift provides tools for monitoring and logging, including Amazon CloudWatch
metrics, query and performance logging, and detailed system views. This helps you monitor
the health and performance of your data warehouse.
11. Data Loading Options:
Redshift supports various data loading options, including direct data injection from Amazon
S3, data streaming, and bulk data loading using the COPY command.
These features collectively make Amazon Redshift a powerful and flexible data warehousing
solution for businesses dealing with large-scale analytical workloads in the cloud.
4]Data Security and Protection:
#Data Security:
-SSL (Secure Sockets Layer) encryption
-TLS (Transport Layer Security) encryption,
#Data Protection:
Data encryption:
1] Server-side encryption
2] Client-side encryption
Encryption at rest : AES-256 encryption
Encryption in Transit : SSL/TLS encryption
What is Workgroup?
-collection of compute resources from which an endpoint is created
-compute-related Container
-groups together compute resources like RPUs(Redshift Processing Units), VPC
subnet groups, security groups
What is Namespace?
-namespace is a collection of database objects and users
-storage-related
-groups together schemas, tables, users, or AWS Key Management Service keys for
encrypting data
==>>When using Redshift Serverless, we need to provision Workgroup for availing compute
resources and Namespace for availing storage resources
You can create one or more namespaces and workgroups. Each namespace can have only one
workgroup associated with it. Conversely, each workgroup can be associated with only one
namespace.
================================================================
Evolution of Data Processing Frameworks:
1]ETL :
Transform-->Spark(Traditional DW)
2]ELT :
Load into DW
-Transform in modern dw
3]EtLT:
t->transform on spark :schema conversion,column trancation
Load into warehouse
-Transform in modern dw : aggregation related
UPSERT operation-->Update and Insert
High Cardinality:
1. Unique Values:
High cardinality columns have a large number of distinct values.
Examples could include columns like "user_id" or "email_address" where each row has a
unique value.
2. Indexes:
High cardinality columns are good candidates for indexing because they provide efficient
access to specific rows in a table.
3. Filtering:
When querying on high cardinality columns, the database engine can quickly filter down the
result set as each value is likely to be unique.
4. Join Conditions:
High cardinality columns are often used in join conditions between tables.
5. Storage:
Storing high cardinality columns efficiently may require more storage space.
Low Cardinality:
EX:
1000 Records and 10 unique values:
product 1000MB
---------------------------------------------------------------------------------------------------------------
Query result caching(result set caching): involves storing the actual result sets of
executed queries for later retrieval
Query caching:
-involves caching the execution plan or metadata associated with a query, rather
than the actual data
-helps save on planning and optimization time
1] Copy Command ::
COPY table_name [column_list]
FROM data_source
[options]
explaination:
o table_name: name of the target table where you want to load data
o column_list: (Optional)
-A comma-separated list of columns in the target table.
-If not specified, Redshift assumes that the column order in the source
file matches the order of columns in the target table
o data_source:
-Specifies the source of the data you want to copy
-This can be an Amazon S3 bucket, an Amazon EMR cluster, a data
file on your local file system, or a remote host using SSH
1]FORMAT format_type: Specifies the file format of the source data. Supported formats
include CSV, JSON, AVRO, PARQUET, ORC, and more.
2]DELIMITER 'delimiter': Specifies the field delimiter used in the source data.
3]IGNOREHEADER n: Specifies the number of header lines to skip in the source data.
4]FILLRECORD: Adds null columns to match the target table's column count if the source
data has missing columns.
5]ENCRYPTED: Indicates that the source data is encrypted.
6]MAXERROR n: Sets the maximum number of allowed data load errors before the COPY
operation fails
7]CREDENTIALS 'aws_access_key_id=access-key-id;aws_secret_access_key=secret-
access-key': Specifies AWS access credentials when loading data from Amazon S3--->psycopg
-postgre sql database
8]COMPUPDATE ON|OFF: Specifies whether to recalculate table statistics after the COPY
operation
9]GZIP: Indicates that the source data is in GZIP compressed format
10]TRUNCATECOLUMNS: Truncates data that exceeds the column length in the target table
11]REMOVEQUOTES: Removes surrounding quotation marks from data fields
2]Redshift Spectrum:
-a feature of Amazon Redshift that enables you to run SQL queries directly against
data stored in your Amazon S3 buckets
-extends the data warehousing capabilities of Redshift by allowing you to analyze and
join data from multiple sources, both within your Redshift data warehouse and external data
stored in Amazon S3
#Features:
1]Data in Amazon S3
2]External Tables
-not stored in your Redshift cluster but act as metadata for querying the data in S3
3]Querying :
-can run SQL queries that join your internal Redshift tables with your external S3 data
4]Performance:
-optimizes query performance by pushing down filters to the S3 data, minimizing data
movement
4)Key:
create table your_table (
column1 int,
column2 varchar(50),
distribution_key_column int distkey);
5]Sort Keys ::
1)Compound :
-composed of one or more columns
-Data is initially sorted based on the first column in the sort key and
then within each of those groups, it is further sorted based on the second column and so on
-known access patterns that frequently filter, join, or aggregate data
based on multiple columns in a predictable order
-DDL:
create table your_table (
column1 int,
column2 varchar(50),
column3 date)
sortkey (column1, column3);
2)Interleaved ::
-also composed of one or more columns
-doesn't prioritize one column over the others ==>it interleaves the
data across all columns in the sort key evenly
-can help improve query performance for tables with unpredictable
query patterns(varying filtering and grouping conditions)
-DDL:
create table your_table (
column1 int,
column2 varchar(50),
column3 date)
interleaved sortkey (column1, column2, column3);
6]Redshift Workload Management(WLM):
-WLM helps you manage and prioritize queries in your Redshift cluster
-ensuring that different workloads and queries can coexist and perform
efficiently in a multi-user environment
-enables you to allocate resources, control concurrency, and manage query
performance by defining query queues and assigning query groups
1]Vaccum Command:
-used to optimize and reclaim storage space in database tables
-two main types:
1]Vaccum:
- reclaims space and resorts rows in the specified table
-When DML happens-->does not immediately release the space==> lead to
fragmented storage and decreased query performance
-Running a VACUUM on a table consolidates the data, removes deleted rows, and sorts the
remaining rows
-Syntax : VACUUM table_name;
2]Vaccum Full:
- performs a more aggressive vacuum operation
- it can be more resource-intensive and time-consuming
-syntax :VACUUM FULL table_name;
2]Analyze Command :
-used to update and refresh statistics about the data in database tables
-these statistics are essential for the query planner to make informed
decisions about query execution plans
-Syntax : ANALYZE my_table;
-important to regularly run ANALYZE on your tables after significant
data changes, to ensure that the query planner has the most accurate information for
optimizing query performance
##Amazon Redshift best practices for designing tables::
1.Choose the best sort key
2.Choose the best distribution style
3.Use automatic compression
4.Define primary key and foreign key constraints between tables wherever
appropriate. Even though they are informational only, the query optimizer uses those
constraints to generate more efficient query plans
5. Use smallest possible column size
4. Why use an AWS Data Pipeline to load CSV into Redshift? And How?
AWS Data Pipeline facilitates the extraction and loading of CSV(Comma Separated Values)
files. Using AWS Data Pipelines for CSV loading eliminates the stress of putting together a
complex ETL system. It offers template activities to perform DML(data manipulation) tasks
efficiently.
To load the CSV file, we must copy the CSV data from the host source and paste that into
Redshift via RedshiftCopyActivity.
382 22 370
14. What is Redshift Spectrum? What data formats does Redshift Spectrum support?
Redshift Spectrum is released by AWS(Amazon Web Services) as a companion to Amazon
Redshift. It uses Amazon Simple Storage Service (Amazon S3) to run SQL queries against
the data available in a data lake. Redshift Spectrum facilitates the query processing against
gigabytes to exabytes of unstructured data in Amazon S3, and no ETL or loading is required
in this process. Redshift Spectrum is used to produce and optimize a query plan. Redshift
Spectrum supports various structured and semi-structured data formats, including AVRO,
TEXTFILE, RCFILE, PARQUET, SEQUENCE FILE, RegexSerDe, JSON, Geok, Ion, and
ORC. Amazon suggests using columnar data formats like Apache PARQUET to improve
performance and reduce cost.
16. What are the key differences between SQL Server and Amazon Redshift?
Here are some key differences between SQL Server and Amazon Redshift:
SQL Server is a traditional, on-premises or cloud-based database management system with a
relational data model and a powerful database engine. It is fast and flexible, but can be
expensive to scale and may not integrate well with other AWS services.
Amazon Redshift is a fully managed, cloud-based data warehousing service with a relational
data model and a PostgreSQL-based database engine. It is designed for fast query
performance and low cost, and has strong integration with other AWS services such as
Amazon S3 and Amazon EMR. However, it may not have all of the features and capabilities
18. What is a data warehouse and how does AWS Redshift help?
A data warehouse is designed as a warehouse where the data from the systems and other
sources generated by the organization are collected and processed.
A high-level data warehouse has three-tier architecture:
1. In the bottom tier, we have the tools which cleanse and collect the data.
2. In the middle level, we have tools to transform the data using the Online Analytical
Processing Server.
3. At the top level, we have different tools where data analysis and data mining are carried out
at the front end.
As data grows continuously in an organization and the company constantly has to update its
expensive storage servers. Here AWS Redshift is generated in the cloud-based warehouses
offered by Amazon where businesses store their data.
19. Is there any support for time zones in Redshift while storing data?
Timezones aren’t supported by Redshift while storing data.
All times data are stored without timezone information and are considered to be UTC.
When you insert a value into a TIMESTAMPTZ, for example, the timezone offset is applied
to the timestamp to convert it to UTC, and the corrected timestamp is saved. The original
timezone information is not kept.
We’ll need an extra column to store the timezone if you wish to keep track of it.
24. What are clusters in Redshift? How do I create and delete a cluster in AWS
redshift?
Computing resources in Amazon Redshift data warehouse are called nodes which are further
arranged in a group known as a cluster.
This cluster contains at least one database and it works on the Amazon Redshift engine.
To create a Cluster, you have to follow these steps: –
The very first step to creating a cluster is to open the Amazon ECS console
After that, you need to select the region to use which you can find from the navigation bar.
When it is done, select cluster in the navigation panel.
Then, select Create Cluster can be seen on the Cluster page.
At last, you should select the selection compatibility which might be networking, EC2
Linux+ networking, or EC2 window + networking.
To delete a cluster in AWS, follow these steps: –
The very first step to delete a cluster is to need you to open the Amazon Redshift console.
After that, select the Cluster which you want to remove from the navigation panel
When it is done, on the Configuration tab of the cluster details page and then select Cluster,
and after that select the Delete option.
Before going through the end, you need to do some final steps one of the following in the
Delete Cluster dialog box.
You must choose YES to remove the cluster in creating a snapshot and then take the last
snapshot. And then you give the name to that snapshot. And finally, select the delete option.
Or you can choose NO to delete in creating a snapshot without the taking final snapshot and
then select the delete option.
26. How do you query Amazon Redshift to show your table data?
Below is the command to list tables in a public schema :
SELECT DISTINCT employee
FROM pg_table_def
WHERE schemaname = 'public'
ORDER BY employee;
# Below is the command to describe the columns from a table called table_data
SELECT *
FROM pg_table_def
WHERE tablename = 'employee'AND schemaname = 'public'
27. Why should I use Amazon Redshift over an on-premises data warehouse?
On-premises data warehouses require a considerable amount of time and resources to
manage, especially for large datasets. In addition, the financial costs of constructing,
maintaining, and increasing self-managed on-site data warehouses are very high.
As your data expands, you must continuously exchange what data to load into your data
warehouse and what data to store in order to control costs, keep ETL complexity low, and
deliver good results. Amazon Redshift not only greatly decreases the expense and operating
overhead of a data center, but with Redshift Bandwidth, it also makes it easy to analyze vast
volumes of data in its native format without forcing you to load the data.