0% found this document useful (0 votes)

7 views

AWS Notes-1

Cloud computing delivers various computing services over the internet, allowing users to access resources hosted on remote servers. Key characteristics include on-demand self-service, broad network access, and rapid elasticity, while advantages encompass cost-efficiency, scalability, flexibility, and improved collaboration. Deployment models include public, private, hybrid, and community clouds, with service types such as IaaS, PaaS, and SaaS, all supported by cloud data centers designed for reliability, security, and scalability.

Uploaded by

patilshivansh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

AWS Notes-1

Uploaded by

patilshivansh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 131

Basics of Cloud

Cloud computing refers to the delivery of computing services, including servers, storage,
networking, databases, analytics, software, and intelligence, over the internet to offer faster
innovation, flexible resources, and economies of scale. In essence, cloud computing allows
individuals and businesses to access and use computing resources hosted on remote servers
instead of relying on their own local servers or personal computers.

There are several key characteristics of cloud computing:

On-Demand Self-Service: Users can provision and manage computing resources, such as
server time and network storage, as needed without requiring human intervention from the
service provider.
Broad Network Access: Cloud services are available over the internet and can be accessed
by users through various devices (e.g., laptops, smartphones, tablets).
Resource Pooling: Cloud providers use multi-tenant models, which means that resources are
pooled and used by multiple customers. The resources are dynamically allocated and
reassigned based on demand.
Rapid Elasticity: Cloud resources can be rapidly and elastically provisioned to quickly scale
up or down based on demand. This allows for flexibility and cost optimization.
Measured Service: Cloud systems automatically control and optimize resource use by
leveraging a metering capability at some level of abstraction. Resources are monitored,
controlled, and reported, providing transparency for both the provider and consumer.

Advantages Of Cloud

Cost-Efficiency: Cloud computing eliminates the need for organizations to invest in and
maintain physical hardware and infrastructure. Instead, they can use cloud services on a pay-
as-you-go or subscription basis, which can lead to significant cost savings. This includes
reduced expenses for hardware, software licenses, maintenance, and energy consumption.

Scalability: Cloud services offer the ability to easily scale resources up or down based on
demand. This allows organizations to quickly adapt to changing workloads, ensuring that
they have the right amount of computing power and storage at any given time. This flexibility
is particularly valuable for businesses with fluctuating resource needs.

Flexibility and Mobility: Cloud computing allows users to access applications and data from
anywhere with an internet connection and on a variety of devices. This provides greater
flexibility for remote work, collaboration among team members, and enables employees to be
productive even when they are not in the office.

Improved Collaboration: Cloud-based collaboration tools, such as document sharing and

real-time editing, make it easier for teams to work together on projects, regardless of their
physical location. This can lead to increased productivity and innovation.
Automatic Updates and Maintenance: Cloud service providers handle the maintenance of
the underlying hardware and software, including updates, security patches, and performance
optimization. This reduces the burden on IT departments and ensures that systems are always
up-to-date.
Disaster Recovery and Data Backup: Cloud providers typically have robust data backup
and disaster recovery solutions in place. This means that in the event of a hardware failure,
natural disaster, or other unforeseen event, data can be quickly restored, minimizing
downtime and data loss.
Security: Cloud providers invest heavily in security measures to protect data and resources.
They often have dedicated security teams, advanced encryption protocols, and rigorous
compliance standards. In many cases, using a reputable cloud provider can actually enhance
security compared to traditional on-premises solutions.
Environmentally Friendly: Cloud computing can be more environmentally friendly
compared to traditional on-premises data centers. Cloud providers can optimize their data
centers for energy efficiency and leverage renewable energy sources, reducing the
environmental impact of computing operations.
Innovation and Time-to-Market: Cloud platforms often offer a wide range of advanced
services, such as machine learning, artificial intelligence, big data analytics, and more. This
allows organizations to leverage cutting-edge technologies without the need for extensive in-
house expertise, enabling faster innovation and quicker time-to-market for new products and
services.
Global Reach and Accessibility: Cloud services are typically provided through a network of
data centers located around the world. This allows organizations to serve customers and users
in different geographic locations with low latency and high availability.

Deployment Model In Cloud

Public Cloud: Resources are owned and operated by a third-party cloud service provider.
Available to the general public and multiple organizations.
Examples: Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP).
Private Cloud: Resources are used exclusively by a single organization.
Can be hosted on-premises or by a third-party provider.
Offers greater control, security, and customization.
Commonly used for sensitive data or compliance requirements.
Hybrid Cloud: Combination of public and private clouds, allowing data and applications to
be shared between them.
Provides flexibility to balance performance, security, and cost-efficiency.
Useful for organizations with varying workload demands.
Community Cloud: Shared infrastructure by several organizations with common interests or
requirements.
Built and managed by the organizations themselves or a third-party provider.
Can address specific regulatory, compliance, or security concerns of the community
members.
Types of Cloud

Infrastructure as a Service (IaaS):

In IaaS, the cloud provider manages the underlying infrastructure, such as virtualization,
networking, storage, and servers. Users have control over the operating system, applications,
and data.
This model is suitable for businesses that need to manage their own applications and want
flexibility in terms of scaling resources up or down.
Example: Amazon Web Services (AWS) Elastic Compute Cloud (EC2). With EC2, you can
launch virtual servers (EC2 instances) and have control over the operating system, software,
and applications you run on them.

Platform as a Service (PaaS):

PaaS provides a platform that includes the underlying infrastructure, as well as the runtime,
development framework, and tools needed to build, deploy, and manage applications. Users
focus on developing and deploying their applications without worrying about the underlying
infrastructure.
This model is suitable for developers who want to focus solely on application development
and don't want to deal with the complexities of managing infrastructure.
Example: Google App Engine. It provides a platform for building and deploying web
applications without managing the infrastructure. Developers can write code in supported
languages like Python, Java, Go, etc., and Google takes care of the rest.
Software as a Service (SaaS):
SaaS delivers software applications over the internet on a subscription basis. The software is
hosted and maintained by a third-party provider, and users access it through a web browser.
Users don't need to worry about installation, maintenance, or infrastructure; they can just use
the software for its intended purpose.
Example: Google Workspace (formerly G Suite). It includes applications like Gmail, Google
Docs, Google Sheets, etc., which are accessed through a web browser. Users don't need to
install anything on their local devices.
On-Prem infra vs Cloud infra differences

Aspect On-Premises Infrastructure Cloud Infrastructure

Within organization's premises or Hosted by third-party cloud service

Location data centers provider in multiple data centers

Cloud provider owns and manages

Organization owns, maintains, and physical hardware and infrastructure
Ownership and manages all hardware, software, components; Customers manage
Management and networking equipment applications, data, and configurations

Scaling up or down can be complex

and time-consuming, requires
Scalability and purchasing and installing additional On-demand scalability, resources can be
Flexibility hardware provisioned or de-provisioned quickly

Upfront capital expenditure (high Pay-as-you-go or subscription-based

Cost Model initial investment in hardware) model, pay for resources used

Responsibility shared between cloud

Organization has direct control over provider and customer, customers
Security and security measures and compliance configure security settings and manage
Compliance standards access controls

Access limited to physical location Accessible from anywhere with an

Accessibility and or implemented remote access internet connection, supports remote
Remote Work solutions work

Disaster Organization responsible for Cloud providers often have built-in

Recovery and implementing own disaster recovery redundancy and disaster recovery
Redundancy plans and redundancy features

Maintenance and Organization responsible for all Cloud provider handles maintenance and
Upgrades maintenance, updates, and upgrades upgrades of physical infrastructure

Limited agility, lead time required

Flexibility and for acquiring and setting up new Highly flexible, resources can be
Agility hardware provisioned or de-provisioned quickly

Limited to physical location of Globally accessible from anywhere with an

Geographic Reach organization's premises internet connection
Cloud Data Center
A cloud data center is a centralized facility or infrastructure that houses computing resources,
such as servers, storage devices, networking equipment, and other hardware and software
components. Its primary purpose is to store, process, manage, and deliver data and
applications over the internet. Cloud data centers are the backbone of cloud computing
services provided by companies like Amazon Web Services (AWS), Microsoft Azure, Google
Cloud Platform, and others.

Key characteristics of a cloud data center include:

Virtualization: Cloud data centers often utilize virtualization technology to create virtual
instances of servers and resources. This allows for better resource utilization and scalability.
Redundancy and Reliability: They are designed to be highly reliable and often have
redundant systems in place to ensure that services remain available even in the event of
hardware failures.
Scalability: Cloud data centers are designed to scale resources up or down based on demand.
This allows for flexibility in resource allocation, which can be particularly beneficial for
businesses with fluctuating workloads.
Geographic Distribution: Larger cloud providers often have data centers located in different
regions around the world. This allows them to offer services with low latency and comply
with local data regulations.
Security Measures: Cloud data centers implement various security measures to protect data
and infrastructure, including firewalls, access controls, encryption, and physical security
measures.
Automation and Orchestration: They often leverage automation and orchestration tools to
manage and allocate resources efficiently. This can lead to cost savings and improved
performance.
Elasticity: Cloud data centers provide the ability to quickly and dynamically allocate and de-
allocate resources based on demand. This allows businesses to efficiently use resources and
only pay for what they use.
Service Offerings: Cloud data centers offer a range of services, including Infrastructure as a
Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS), and other
specialized services like databases, machine learning, and content delivery networks.
Overall, cloud data centers play a crucial role in modern computing, enabling businesses to
leverage the benefits of cloud computing, including cost efficiency, scalability, and
accessibility, without the need to manage their own physical infrastructure.
Relation between AWS Region ,availability zone and data center
AWS Region:
A Region is a geographical area where AWS has multiple data centers. Each Region is
essentially a cluster of data centers that are in close proximity to each other. AWS currently
has multiple Regions around the world.
Regions are completely independent of one another, and they are designed to be isolated. This
means that a problem or maintenance activity in one Region should not affect the others.
Availability Zone (AZ):
An Availability Zone is a logical data center within an AWS Region. Each Availability Zone
is a distinct location with its own power, cooling, and networking infrastructure.Availability
Zones are designed to be isolated from each other in terms of physical infrastructure. They
are typically located close enough to each other to be connected with low-latency, high-
bandwidth networking, but far enough apart that they are not susceptible to the same disasters
or failures.
Data Center:
A data center is a physical facility where computing and networking hardware is located. It
houses servers, storage devices, networking equipment, and other infrastructure components
required to support IT operations.
An AWS Region is composed of multiple Availability Zones, and each Availability Zone
contains one or more data centers. These data centers are where the actual physical hardware
that runs AWS services is housed.

Here's an example to the relationship:

Let's consider the AWS Region us-east-1 (North Virginia), which typically consists of
multiple Availability Zones (e.g., us-east-1a, us-east-1b, us-east-1c, etc.).
In this Region, each Availability Zone is a separate logical data center. For example, us-east-
1a, us-east-1b, and us-east-1c might each be distinct Availability Zones.
Within each Availability Zone, there are one or more physical data centers that house the
actual hardware. These data centers have redundant power, cooling, and network connectivity
to ensure high availability.
So, to summarize, an AWS Region contains multiple Availability Zones, and each Availability
Zone contains one or more physical data centers. This layered structure provides redundancy,
fault tolerance, and high availability for AWS services and applications hosted in the cloud.
IAM(Identity And Access Management)
IAM refers to a framework or policy and technologies for ensuring that the proper people in an
organization have the appropriate access to technology resources.
OR
AWS Identity and Access Management is a web service that you security control access to AWS
resources. We use IAM to control who is authenticated (signed-in) and authorized (has
permission) to use resources.
When you first create AWS account, you begin in a single sign-in identity that has completely
access to all AWS services and resources in the account.
This identity is called the AWS account “Root-User” and is accessed by sighed-in with the
email address and password that you used to create the account.
AWS strongly recommends that you do not use the root user for you everyday tasks, even the
administrative ones.
Use other IAM user account to manage the administrative task of your account and securely
lock away the root user credentials and use them to perform only a few account and service
management task.
IAM user limit is 5000 per AWS account. You can add up to 10 users at one time.
You are also limit to 300 groups per AWS account.
Default limits of managed policies attached to an IAM role and IAM user is 10.
IAM user can be a member of maximum 10 groups.
We can assign maximum two access keys to an IAM user.

IAM Features:
1. Shared access to your AWS account:
You can grant other people permission to administer and use resources in your AWS account
without having to share your access credentials.

2. Granular permission:
You can grant different permission to different people for different resources.
For instance, you can allow some users complete access to EC2, S3, Dynamo DB, Redshift
while for others, you can allow read only access to just some S3 buckets, or permission to
administer just some EC2 instances or to access your billing information but nothing else.

3. Secure access to AWS resources for application that run on Amazon EC2:
You can use IAM features to securely give application that run on EC2 instances the credentials
that they need in order to access other AWS resources. For example, include S3 buckets and RDS
or Dynamo DB databases.

4. Multifactor Authentication (MFA):

You can add two factor authentication to your account and to individual users for extra security.
You can use physical hardware or virtual MFA (for e.g: Google Authenticator)

5. Identity federation:
You can allow users who already have passwords elsewhere. For e.g: in your corporate network
of with an internet identity provider to get temporary access to your AWS account.
IAM Terms:
Following are the major terms which are used in an IAM account.
1. Principal
2. Request
3. Authentication
4. Authorization
5. Action/Operation
6. Resources

1. Principal:

A principal is a person or application that can make a result for an action or operation on an
AWS resources.
Your administrative IAM user is your first principal.
You can allow users and services to assume a role.
IAM users, roles, federated users and application are all AWS principals.

2. Request:

When a principal tries to use the AWS management console, the AWS API of the AWS CLI that
principal sends a request to AWS. The request includes the following information:
Actions: That the principal wants to perform.
Resources: upon which the actions are performed.
Principal information: it’s including the environment from which the request was made.

Request context: before AWS can evaluate and authorize a request, AWS gathers the request
information. Principal (the requester) which is determined based on the authorization data. This
includes the aggregate permissions that the associated with that principal.
Environment data: such as IP address, user agent, SSL enabled status, or the time of the day.
Resource data: it is related to the resource that is being requested.

3. Authentication:

A principal sending a request must be authenticated (sighed into AWS) to send a request to
AWS.
Some AWS services, like AWS S3 allow request from anonymous users, they are exception to
the role.
To authenticate from the console as a root user, you must sign-in with your user name and
password.
To authenticate from the API to CLI, you must provide your access key and secrete key.
You might also be required to provide additional security information like MFA (e.g: Google
Authentication )

4. Authorization:
To authorize request, IAM uses value from the request context to check, for matching policies
and determine whether to allow or deny the request.
IAM policies are stored in IAM as JSON documents and specify the permission that are
allowed or denied.
5. Actions:

Actions are defined by a service, and more the things that you can do to a resource such as
viewing, creating, editing, and deleting that resource.
IAM supports approx. 40 actions for a user resource including create user, delete user etc.

Any actions or resources that are not explicitly allowed are denied by default.
After your request has been authenticated and authorized, AWS approves the actions in your
request.

6. Resource:

A resource is an entity that exists within a service.

Examples are EC2 instances, S3 bucket, IAM users, and Dynamo DB table.
Each AWS service defines a set of actions that can be performed on each resource.
After AWS approves the actions in your request those actions can be performed on the related
resources within your account.
If you create a request to perform an unrelated action on a resource that request is denied.
When you provide permissions using an identity based policy in IAM then you provide
permissions to access resources only within the same account.

IAM Identities:

IAM identities is what you create under your AWS account to provide authentication for
people, application and process in your AWS account.
Identities represents the user and can be authenticated and then authorized to perform actions
in AWS.
Each of these can be associated with one or more policies to determine what actions a user,
role or member of the group can do with which resources and under what conditions.
IAM group is a collection of IAM user.
IAM role is very limit IAM user.

A. IAM Users:
An IAM user is an entity that you create in AWS. It represents the person or service who uses
the IAM user to interact with AWS.
You can create 5 users at time.
An IAM user can represent an actual person or an application that requires AWS access to
perform actions on AWS resources.
A primary use of IAM users is to give people the ability to sign-in to the AWS management
console for interactive task and to make programmatic request to AWS services using the API or
CLI.
For any user you can assign them:
A username and password to access the AWS console.
An access key ID and secrete key that can use for programmatic access.
The newly created IAM user have no password and no access key. You need to create the user
password.
Each IAM user is associated with one and only one AWS account.
Users are defined within your account, so user do not have to do payment. Bill would be pay
by the parent account.
B. IAM Groups:

An IAM group is a collection of IAM users.

It is a way to assign permission/policies to multiple users at once.
Use groups to specify permissions for a collection of users, which can make those permissions
easier to manage for those users.
For E.g: you could have a group called HR and give that group the types of permissions that
HR department typically needs.
Any user in that group automatically has the permission that are assigned to the group.
If a new user joins your organization and should have administrator privileges, you can assign
the appropriate permissions by adding the user to that group.
If a person changes job in your organization, instead of editing that user’s permission, you can
remove him or her from the old groups and add him or her to the appropriate new groups.

IAM Group Limitations:

A group is not truly an identity in IAM because it cannot be identified as a principal in a
permission policy.
Group cannot be nested.
One have a limit of 300 groups in an AWS account.
A user can be a member of up to 10 IAM groups.

C. IAM Roles:

An IAM role is very similar to a user, in that it is an identity with permission policies that
determine what the identity can and cannot do in AWS.
An IAM role does not have any credentials (password or access key) associated with it.
Instead of being associated with one person, a role is intended to be assumable by anyone who
needs it.
An IAM user can assume a role to temporarily take on different permissions for a specific task.
An IAM role can be assigned to a federated user who sign-in by using an external identity
provider instead of IAM.

D.IAM Policies:

IAM (Identity and Access Management) policies in the AWS Management Console are a set
of permissions that define what actions are allowed or denied for different AWS resources.
These policies help you control who can do what within your AWS account, whether it's
accessing services, creating or modifying resources, or performing other operations.
IAM policies are written in JSON (JavaScript Object Notation) format and they can be
attached to IAM users, groups, or roles. Policies can be as broad or as specific as needed,
allowing you to grant or restrict access to individual services, actions, or even specific
resources.
Difference Between Roles and Policies

Aspect IAM Roles IAM Policies

Delegates permissions to entities that need Defines what actions are allowed or
Purpose to access AWS resources. denied on which resources.

Associated with entities that assume the

role (e.g., AWS services, applications, or Attached to users, groups, roles, or
Association users). resources.

Not associated with a specific user or Associated with specific users, groups, or
Ownership group of users. roles.

Credential Entities assume the role to obtain Not related to the acquisition of
Usage temporary security credentials. temporary credentials.

Often used to grant permissions to

resources like EC2 instances, Lambda Used to grant broad or specific access
Use Cases functions, etc. permissions to users, groups, or roles.

Can be very granular, specifying exactly

Specifies permissions for a specific set of which actions are allowed on specific
Granularity actions on AWS resources. resources.

Single entity, assumed by other entities JSON documents that define permissions
Type for a defined duration. and can be attached to multiple entities.

IAM Role Delegation:

Delegation is the granting of permission to someone to allow access to resource that you
control.
Delegation involves setting up a trust between the account that owns the resource (the trusting
account) and the account that contains the users that need to access the resource (the trusted
account).
The trusted and trusting accounts can be of the following:
i. The same account
ii. Two accounts that are both under your organization’s control.
iii. Two account owned by different organizations.

To delegate permission to access a resource you create an IAM role that has two policies
attached.
i. The Trust Policy
ii. The Permission Policy
The trusted entity is included in the policy as the principal element in the document.
When you create a trust policy, you cannot specify a wildcard (*) as a principal.
Cross Account Permissions:
You might need to allow user from another AWS account to access resources in your AWS
account. If so, don’t share security credentials, such as access keys between accounts. Instead use
IAM roles.
You can define a role in the trusted account that specifies what permissions the IAM users in
the other account are allowed.
You can also designate which AWS account have the IAM users that are allowed to assume the
role. We do not define users here rather AWS account.

Role for Cross-Account Access:

Granting access to resources in one account to a trusted principal in a different account.
Roles are the primary way to grant cross-account access.
However with some of the web services offered by AWS, you can attach a policy directly to a
resource. These are called resource-based policy. You can use them to grant principals in another
AWS account access to the resource.
The following services support resource-based policy:
Amazon S3.
Amazon Simple Notification Service
Amazon Simple Queue Service
Amazon Glacier Vault

Access Key Secret Access Key

Unique identifier associated with an AWS A cryptographic key used along with the Access
Definition IAM user or the root account. Key ID for authentication and request signing.

Purpose Identifies the source of API requests. Signs API requests for authentication.

20-character string of uppercase letters and 40-character string of uppercase and lowercase
Format numbers. letters, and numbers.

Included in request headers to identify the

Usage requester. Used to sign API requests for authentication.

Sensitive, but not as critical as the Secret Extremely sensitive and must be kept
Security Access Key. confidential. Should never be shared or exposed.

Provided by AWS upon creation, but only

Generated Provided by AWS upon creation. shown once.

Created, managed, and deleted through

AWS Management Console, AWS CLI,
Management SDKs, or IAM APIs. Same methods as Access Key.

Rotate periodically for security. Disable or Rotate immediately if compromised. Never

Best Practices delete if no longer needed. expose or share.
Virtual Machine

In computing, a virtual machine (VM) is the virtualisation of a computer system. Virtual

machines are based on computer architecture and provide the functionality of a physical
computer. Their implementations may involve specialized hardware, software, or a
combination of the two.

A virtual machine (VM) is a software-based emulation of a physical computer that runs an

operating system (OS) and applications. It allows multiple operating systems to run on a
single physical machine, enabling better resource utilization and isolation between different
environments.
Components Of Virtual Machine
Hypervisor: The hypervisor is the core component of virtualization technology. It is
responsible for managing and allocating the physical resources of the host machine (CPU,
memory, storage, etc.) to the virtual machines. There are two types of hypervisors:
Type 1 Hypervisor: This is a bare-metal hypervisor that runs directly on the physical
hardware. Examples include VMware vSphere/ESXi, Microsoft Hyper-V, and Xen.
Type 2 Hypervisor: This is a hosted hypervisor that runs on top of a host operating system.
Examples include VMware Workstation, Oracle VirtualBox, and Parallels Desktop.
Virtual Machine Monitor (VMM): This is a software layer that sits between the guest
operating systems and the physical hardware. It controls the execution of virtual machines
and manages the allocation of physical resources.
Virtual CPU (vCPU): Each virtual machine has one or more virtual CPUs, which are
presented to the guest operating system as if they were physical processors. The hypervisor
manages the scheduling of vCPUs on the physical CPU(s) of the host machine.
Virtual Memory: The hypervisor allocates a portion of the physical memory (RAM) to each
virtual machine. Each VM believes it has its own dedicated memory, isolated from other
VMs on the same host.
Virtual Disk (vDisk): Virtual machines have virtual hard drives or disks that are essentially
files on the host machine's physical storage. The hypervisor manages read and write
operations to these virtual disks.
Virtual Network Interface Card (vNIC): A vNIC is a virtual network adapter that allows a
virtual machine to communicate over the network. The hypervisor manages the mapping of
vNICs to physical network interfaces on the host.
Guest Operating System: Each virtual machine runs its own guest operating system, which
interacts with the virtual hardware provided by the hypervisor. The guest OS is unaware that
it is running in a virtualized environment.
EC2 (Elastic Compute Cloud)
Amazon EC2 provides scalable computing capacity in the AWS cloud.
You cans use Amazon EC2 to lunch as many or as few virtual servers as you need,
configure security and networking and manage storage.
Amazon EC2 enables you to scale up or scale down the instance.
Amazon EC2 is having two storage options i.e EBS and instance store.
Preconfigured templates are available known as amazon machine image.
By default when you create an EC2 account with amazon your account is limited to a
maximum of 20 instances per EC2 region with two default high I/O instances.

Virtual machine Components of Launch EC2 Instance

AMI (Amazon Machine Image):

An AMI is a pre-configured virtual machine image, which includes an operating system, applications,
and other software configurations. It serves as a template for launching EC2 instances.Users can
choose from a wide range of existing AMIs provided by AWS or create their custom AMIs based on
their specific requirements.
Eg: Windows,Linux,Apple OS,

Instance Type:
The instance type determines the hardw
are of the host computer used for the instance. Different instance types have varying combinations of
CPU, memory, storage, and networking capacity.
Examples of instance types include t2.micro, m5.large, c5.xlarge, etc.

Key Pairs:
When you launch an EC2 instance, you can associate it with a key pair. This is a set of public and
private keys used for secure SSH (Linux) or RDP (Windows) access to the instance.The private key is
kept secure by the user, and the public key is placed on the instance. When you connect to the
instance, you use the private key to authenticate.

Security Groups:
Security groups act as a virtual firewall for the instance, controlling inbound and outbound traffic.
They can be configured to allow or deny specific types of traffic based on rules defined by the user.

Network Interfaces (ENIs):

An Elastic Network Interface (ENI) is a virtual network interface that you can attach to an EC2
instance. It provides networking capabilities to the instance, allowing it to communicate with other
resources in your VPC (Virtual Private Cloud).

Elastic IP Addresses: An Elastic IP address is a static, public IPv4 address that you can allocate to
your AWS account. It can be associated with an EC2 instance, providing a fixed public IP that can be
remapped to different instances.
Storage (EBS Volumes): EC2 instances can be associated with Elastic Block Store (EBS) volumes
for persistent storage. These volumes can be attached and detached from instances, providing a way to
store data independently of the instance's lifecycle.

Instance Metadata: EC2 instances have access to instance metadata, which provides information
about the instance's configuration and environment. This information can be accessed from within the
instance.

Tags: Tags are key-value pairs that you can assign to EC2 instances (and other AWS resources). They
are useful for organizing and managing your resources, and they can be used for cost allocation and
billing purposes.
Types of Instance
1. General purpose
2. Compute optimized
3. Memory optimized
4. Storage optimized
5. Accelerated computing or GPU
6. Storage optimized
7. High memory optimized

1. General purpose:

General purpose instances provide a balance of compute, memory and networking resources and
can be used for a variety of workloads.
There are 3 series are available in general purpose instance:

a. A series: A1
b. M series: M4, M5, M5a, M5d, M5ad (large)
c. T series: T2 (free tier eligible), T3, T3a

Instances are available in four sizes: Nano, Small, Medium, Large

a. A series: A1-instances:

A1 instances are ideally suited for scale out workloads that are supported by the ARM
Ecosystem.
These instances are well suited for the following applications:
1. Web server
2. Containerized micro services
3. Caching fleets
4. Distributed data stores
5. Application that requires ARM instruction set

b. M series: M4, M5, M5a, M5d, M5ad

M4 instance:
The new M4 instances features a custom Intel Xeon E5-2676 v3 Haswell processor optimized
specifically for EC2.
vCPU- 2 to 40 (max)
RAM- 8GB to 160GB(max)
Instance storage: EBS only (root volume storage)
M5, M5a, M5d and M5ad instances:
These instances provide an ideal cloud infra, offering a balance of compute, memory and
networking resources for a broad range of applications.
Used in : gaming server, webserver, small and medium database
vCPU- 2 to 96(max)
RAM- 8 to 384(max)
Instance storage- EBS and NVMe SSD
c. T series: T2, T3, T3a instances:

These instances provide a baseline level of CPU performance with the ability to burst to a higher
level when required by your workload. An unlimited instances can sustain high CPU performance
for any period of time whenever required.
vCPU- 2 to 8
RAM- 0.5 to 32 GB
Used for:
i. Website and web app
ii. Code repositories
iii. Development, build, test
iv. Micro services
2. Compute optimized:
Compute optimized are ideal for compute bound applications that benefits from high
performance processors.

C Series: Three types are available: C4, C5, C5n [C3- previous instance]
C4:
C4 instances are optimized for compute intensive workloads and deliver very cost effective high
performance at a low price per complete ratio.
vCPU- 2 to 36
RAM- 3.75 to 60GB
Storage- EBS only
Network BW- 10 Gbps
Usecase: web server, batch processing, MMO gaming, Video encoding

3. Memory Optimized:
Memory optimized instances are designed to deliver fast performance for workloads that large
data sets in memory.
There are 3 series are available:
R series, X series, Z series
A. R Series:
R4, R5, R5a, R5ad, R5ad
High performance, relational (MySQL) and NoSQL (MongoDB, Casssandra) databases.
Distributed web scale cache stores that provide in memory caching of key volume type data.
Used in financial services, Hadoop
vCPU- 2 to 96
RAM- 16 768GB
Instance storage- EBS only and NVMe SSD

B. X Series:
X1, X1e instances:
Well suited for high performance database, memory intensive enterprise application, relational
database workload, SAP HANA.
Electronic design automation
vCPU- 4 to 128
RAM- 122 to 3904GB ,Instance storage- SSD

C. Z1d instance:
High frequency Z1d delivers a sustained all core frequency of up to 4.0 GHz, the fastest of any
cloud instances.
AWS Nitro System, Xeon processor, up to 1.8 TB of instances storage.
vCPU- 2 to 48
RAM- 16 to 384 GB
Storage- NVM SSD
Use case: electronic design automation and certain database workloads with high per-core
licensing cost.

4. Storage optimized:
Storage optimized instances are designed for workloads that require high, sequential Read and
Write access to very large data sets on local storage. They are optimized to deliver tens of
thousands of low latency, random I/O operations per second (IOPS) to application.
It is of three types:
A. D series- D2 instance
B. H series- H1 instance
C. I series- I3 and I3en instance

A. D2 instance:
Massive parallel processing (MPP) data warehouse.
Map reduce and Hadoop distributed computing.
Log or data processing app
vCPU- 4 to 36
RAM- 30.5 to 244GB
Storage- SSD

B. H series- H1 instance:
This family features up to 16GB of HDD based local storage, high disk throughput and balance
of compute and memory.
Well suited for app requiring sequential access to large amounts of data on direct attached
instance storage.
Application that requires high throughput access to large quantities of data.

vCPU- 8 to 64
RAM- 32 to 256GB
Storage- HDD

C. I3 and I3en instances:

High frequency online transaction processing system (OLTP)
Relational databases:

NoSQL database
Distributed file system
Data warehousing application
vCPU- 2 to 96
RAM- 16 to 768GB
Local storage- NVMe SSD
Networking performance- 25 Gbps to 100 Gbps
Sequential throughput: Read- 16GBps Write- 6.4 GBps (I3)

5. Accelerated Computing Instances

Accelerated computing instance families use hardware accelerators or co-processors to perform
some functions such as floating point number calculations, graphics processing or data pattern
matching more efficiently than is possible in software running on CPUs.
It is of 3 types:
A. F series- F1 instance
B. P series- P2 and P3 instance
C. G series- G2 and G3 instance

a. F1 instance:
F1 instances offers customizable hardware acceleration with field programmable gate
arrays.(FPGA)
Each FPGA contains 2.5 million logic elements and 6800 DSP (Digital Processing Unit)
engines.
Designed to accelerate computationally intensive algorithms such as data flow or highly
parallel operations.
Used in- genomics research, financial analytics, real time video processing and big data
search.

b. P2 and P3 Instance:
It uses NVIDIA Tesla GPUs.
Provide high bandwidth networking.
Up to 32GB of memory per GPUs which makes them ideal for deep learning and
computational fluid dynamics.
Used in- machine learning, databases, seismic analysis, genomics, molecular modeling, AI,
deep learning

c. G2 and G3 instances:
Optimized for graphics intensive application.
Well suited for app like 3D visualization.
G3 instances use NVIDIA Tesla M60 GPU and provide a cost effective, high performance
platform for graphics applications.
Used in: video creation service, 3D visualization, streaming, graphic intensive application

6. High Memory Instance:

High memory instances are purpose built to run large-in-memory databases, including
production developments of SAP HANA in the cloud.
It has only on series i.e U series.
Features:
Latest generation intel Xeon Pentium 8176M processor.
6, 9, 12 TB of instance memory, the largest of any EC2 instance.
Powered by the AWS Nitro System, a combination of dedicated hardware and light weight
hypervisor.
Bare metal performance with direct access to host hardware.
EBs optimized by default at no additional cost.
Model number- U-6tb1.metal, U-9tbi.metal, U-12tb1.metal
Network performance- 25 GBps
Dedicated bandwidth- 14GBps
EC2 PURCHASING OPTION:
There are 6 ways of purchasing options available for AWS EC2 instances, but there are 3 ways to
pay for Amazon EC2 instance i.e On demand, Reserved instance and Spot instance.
You can also pay for dedicated host which provide you with EC2 instance capacity on physical
servers dedicated for your use.

1. On demad
2. Dedicated instance
3. Dedicated Host
4. Spot instance
5. Scheduled instance
6. Reserved instance

1. On-Demand Instance:
AWS on demand instances are virtual servers that run in AWS of AWS relational database
service (RDS) and are purchased at a fixed rate per hour.
AWS recommends using on demand instances for applications with short term irregular
workloads that cannot be interrupted.
They also suitable for use during testing and development of applications on EC2.
With on demand instances you only pay for EC2 instances you use.

The use of on demand instances frees you from the cost and complexities of planning,
purchasing, and maintaining hardware and transforms what are commonly large fixed costs into
mush smaller variable cost.
Pricing is per instance hour consumed for each instance from the time an instance is launched
until if it terminated of stopped.
Each partial instance hour consumed will be billed per second for linux instances and as a full
hour for all other instance types.

2. Dedicated Instance:
Dedicated instances are run in a VPC on hardware that is dedicated to a single customer.
Your dedicated instances are physically isolated at the host hardware level from instances that
belong to other AWS account.
Dedicated instances may share hardware with other instances from the same account that are
not dedicated instance.
Pay for dedicated instances on demand save up to 70% by purchasing reserved instance or save
up to 90% by purchasing spot instances.

3. Dedicated Host:
An Amazon EC2 dedicated host is a physical server with EC2 instance capacity fully dedicated
to your use.
Dedicated host can help you address compliance requirement and reduce costs by allowing you
to use your existing server bound software licenses.
Pay for a physical host that is fully dedicated to running your instances and bring your existing
per socket, per core, per VM software license to reduce cost.
Dedicated host gives you additional visibility and control over how instances are placed in a
physical server and you can consistently deploy your instances to the same server over time.
As a result dedicated host enables you to use your existing server bound software license and
address corporate compliance and regulatory requirements.
Instances that run on a dedicated host are the same virtualized instances that you had get with
traditional EC2 instance3s that use the XEN Hypervisor.
Each dedicated host supports a single instance size and type (for e.g C3.XLARGE)
Only BYOL, Amazon linux and AWS marketplace AMIs can be launched onto dedicated
hosts.

4. Spot Instances:
Amazon EC2 spot instances let you take advantage of unused EC2 capacity in the AWS cloud.
Spot instances are available at up to 90% discount compared to on-demand prices.
You can use spot instances for various test and development workloads.

You can also have the options to hibernate, stop or terminate your spot instances when EC2
reclaims the capacity back with two minutes of notice.
Spot instances are spare EC2 capacity that can save you up 90% off of on-demand prices that
AWS can interrupt with a 2 minute notification. Spot uses the same underlying EC2 instances as
on-demand and reserved instances, and is best suited for flexible workloads.
You can request spot instances up to your spot limit for each region.
You can determine the status of your spot request via spot request status code and message.
You can access spot request status information on the spot instance page of the EC2 console of
the AWS management console.
In case of hibernate, your instance gets hibernated and RAM data persisted. In case of stop,
your instance gets shutdown and RAM is cleared.
With hibernate, spot instances will pause and resume around any interruptions so your
workloads can pick up from exactly where they left off.

Question: when would my spot instance get interrupted?

Ans: primary reason would be Amazon EC2 capacity requirement (e.g: on-demand or reserved
instances). Secondarily, if you have chosen to set a ‘max spot price’ and the spot price raise
above.

5. Scheduled Instance:
Scheduled reserve instances enable you to purchase capacity reservations that recur on a daily,
weekly or monthly basis, with a specified start time and duration for one year term.
You reserve the capacity in advance so that you know it is available when you need it.
You pay for the time that the instances are scheduled even if you do not use them.
Scheduled instances are a good choice for workloads that do not run continuously but do run
on a regular schedule.
Purchase instances that are always available on the specified recurring schedule for a one year
term.
For example: you can use schedule instances for an application that runs during business hours of
for batch processing that run at the end of the week.
6. Reserved Instances:
Amazon EC2 RI provides a significant discount up to 75% compared to on-demand pricing
and provide a capacity reservation when used in a specific availability zone.
Reserved instances give you the option to reserve a DB instance for a one or three year term
and in turn receive a significant discount compared to the on-demand instance pricing for the DB
instance.
There are 3 types of RI are available such as
a. Standard RI: these provide the most significant discount up to 75% off on-demand and are best
suited for steady-state usage.
b. Convertible RI: these provide a discount up to 54% and the capability to change the attributes
of the RI as long as the exchange results in the creation of reserved instances of greater or equal
value.
c. Scheduled RI: these are available to lunch within the time window you reserve.

AMI(Amazone Machine Image)

An Amazon Machine Image (AMI) is a template that contains a software configuration (for
example, an operating system, an application server, and applications). From an AMI, you
launch an instance, which is a copy of the AMI running as a virtual server in the cloud. You
can launch multiple instances of an AMI, as shown in the following figure.

An Amazon Machine Image

(AMI) is a supported and
maintained image provided by
AWS that provides the
information required to launch an
instance. You must specify an
AMI when you launch an
instance. You can launch multiple
instances from a single AMI
when you require multiple
instances with the same
configuration. You can use
different AMIs to launch
instances when you require
instances with different
configurations.
Elastic Block Storage(EBS)
Amazon Elastic Block Store (Amazon EBS) provides block level storage volumes for use
with EC2 instances. EBS volumes behave like raw, unformatted block devices. You can
mount these volumes as devices on your instances. EBS volumes that are attached to an
instance are exposed as storage volumes that persist independently from the life of the
instance. You can create a file system on top of these volumes, or use them in any way you
would use a block device (such as a hard drive). You can dynamically change the
configuration of a volume attached to an instance.

There are two types of block store devices are available for EC2.
1. Elastic Block Store (persistent, network attached virtual drive)
2. Instances Store Backed EC2:
Basically the virtual hard drive on the host allocated to this EC2 instance.
Limit to 10GB per device
Ephemeral storage (non-persistent storage)
The EC2 instance can’t be stopped, can only be rebooted or terminated. Terminate will delete
data.
EBS volume behaves like RAW, unformatted, external block storage devices that you can
attached to your EC2 instance.
EBS volumes are block storage devices suitable for database style data that requires frequent
reads and writes.
EBS volumes are attached to your EC2 instances through the AWS network, like virtual hard
drive.
An EBS volume can attach to a single EC2 instances only at a time.
Both EBS volumes and EC2 instances must be in the same AZ.
An EBS volume data is replicated by AWS across multiple servers in the same AZ to prevent
data loss resulting from any single AWS component failure.

EBS Volume Types:

1. SSD backed volume
2. HDD backed volume
3. Magnetic standard

1-SSD backed volume is also two types:

A. General purpose SSD (GP2,GP3)
B. Provisioned IOPS SSD (io1)

General Purposed SSD (gp2)

GP2 is the default EBS volume type for the amazon EC2 instance.
GP2 volumes are backed by SSDs.
General purpose balances both price and performances.
Ratio of 3IOPS/GB with up to 10,000 IOPS.
Boot volume having low latency.
Volume size: 1 GB to 16 GB.


General purpose SSD (GP3):

General Purpose SSD (gp3) volumes are the latest generation of General Purpose SSD
volumes, and the lowest cost SSD volume offered by Amazon EBS. This volume type helps
to provide the right balance of price and performance for most applications. It also helps you
to scale volume performance independently of volume size. This means that you can
provision the required performance without needing to provision additional block storage
capacity. Additionally, gp3 volumes offer a 20 percent lower price per GiB than General
Purpose SSD (gp2) volumes.

gp3 volumes provide single-digit millisecond latency and 99.8 percent to 99.9 percent
volume durability with an annual failure rate (AFR) no higher than 0.2 percent, which
translates to a maximum of two volume failures per 1,000 running volumes over a one-year
period. AWS designs gp3 volumes to deliver their provisioned performance 99 percent of the
time.

B. Provisioned IOPS SSD (io1)

These volumes are ideal for both IOPS intensive and throughput intensive workloads that
requires extremely low latency or for mission critical applications.
Designed for I/O intensive applications such as large relational or NoSQL databases.
Use if you need more than 10,000 IOPS.
Can provision up to 32,000 IOPS per volume.
Volume size: 4GB to 16TB
Price : $ 0.125/GB/month

2-HDD backed volume is also two types:

A. Throughput optimized HDD (st1)
B. Cold HDD (SC1)

A. Throughput optimized HDD (st1)

ST1 is backed by hard disk drives and is ideal for frequently accessed, throughput intensive
workloads with large datasets.
ST1 volumes deliver performance in term of throughput, measured in MB/S.
Big data, data warehouse, log processing.
It cannot be a boot volume.
Can provisioned up to 500 IOPS per volume.
Volume size: 500GB to 16 TB
Price: $0.045/GB/month

B. Cold HDD (SC1)

SC1 is also backed by HDD and provides the lowest cost per GB of all EBS volume types.
Lowest cost storage for infrequent access workloads.
Used in file servers.
Cannot be a boot volume.
Can provisioned up to 250 IOPS per volume.
Volume size: 500 GB to 16TB

3-Magnetic Standard:
Lowest cost per GB of all EBS volume type that is bootable.
Magnetic (standard) volumes are previous generation volumes that are backed by magnetic
drives. They are suited for workloads with small datasets where data is accessed infrequently
and performance is not of primary importance. These volumes deliver approximately 100
IOPS on average, with burst capability of up to hundreds of IOPS, and they can range in size
from 1 GiB to 1 TiB.
Magnetic volumes are ideal for workloads where data is accessed infrequently and
applications where the lowest storage cost is important.
Price: $0.05/GB/month
Volume size: 1GB to 1TB
Max IOPS/volume: 40-200
Snapshots
In the context of AWS (Amazon Web Services) and EC2 (Elastic Compute Cloud), a snapshot
is a point-in-time copy of an Amazon Machine Image (AMI) or the data on an Amazon
Elastic Block Store (EBS) volume.
Snapshots are incremental, which means that only the blocks on the device that have changed
after the last snapshot are saved. This helps to reduce storage costs and minimize the time it
takes to create the snapshot.
Snapshots are typically stored in Amazon S3 (Simple Storage Service) and can be used to
create new volumes or restore existing ones. They are an important part of creating reliable
and durable backups in AWS.
Keep in mind that while snapshots are a crucial component of backup and disaster recovery
strategies, they do not replace the need for regular data backup and retention policies. It's
important to have a comprehensive backup strategy that includes both snapshots and regular
backups.
Advantages of Snapshot:
 Data Backup and Recovery: Snapshots provide a reliable way to back up and restore
your data.
 Incremental Backups: They save only the changed data, reducing storage costs.
 Cost-Effective: You pay only for the changed data blocks.
 Efficient Disk Management: Easily manage and restore to specific states.
 Disaster Recovery: Quickly recover from failures by creating new volumes or
instances from snapshots.
 Cloning and Replication: Duplicate instances or volumes easily.
 Migration and Data Transfer: Move data between AWS regions or Availability Zones.
 Customized Environments (AMI Snapshots): Create custom machine images with
specific software and configurations.
 Version Control: Keep track of changes and roll back if needed.
 Security and Compliance: Snapshots can be encrypted for added security.
 Tagging and Organization: Add tags to easily manage and track your backups.
 Flexibility: Allows for various operations like creating, copying, sharing, and deleting
snapshots based on your needs.

1. EBS Volume Snapshots:

An EBS volume is a block-level storage device that can be attached to an EC2 instance. It
provides durable and resizable block-level storage, and it's used for storing data that requires
frequent updates and access.
Why Use EBS Snapshots:
Data Protection: EBS snapshots provide a reliable way to back up your data. In case of
accidental data loss or corruption, you can use a snapshot to restore your volume.
Disaster Recovery: Snapshots are a crucial part of a disaster recovery plan. If your EBS
volume fails, you can create a new volume from a snapshot and attach it to a new EC2
instance.
Data Migration: Snapshots can be used to migrate data between regions or Availability Zones.
You can create a snapshot in one region and then use it to create a new volume in another
region.
Cost-Effective Backup: Snapshots capture only the changed blocks since the last snapshot,
which can help reduce storage costs compared to storing full backups.
How EBS Snapshots Work:
Incremental Backups: When you create a snapshot, only the blocks that have changed since
the last snapshot are saved to Amazon S3. This means that multiple snapshots of the same
volume share common data blocks, which helps to minimize storage costs.
Retention Policy: You can manage snapshots by creating a retention policy. This allows you
to keep a certain number of snapshots for a specified period of time, helping you manage
storage costs and ensure you have access to historical backups.

2. AMI Snapshots:
An Amazon Machine Image (AMI) is a pre-configured virtual machine image, which
includes an operating system and any additional software or configurations you've installed.
It serves as the basis for launching EC2 instances.
Why Use AMI Snapshots:
Instance Replication: AMI snapshots allow you to replicate your EC2 instances, ensuring that
you can launch new instances with the same configurations as the original.
Customized Environments: You can create custom AMIs with specific software,
configurations, and data pre-installed. This allows you to easily deploy instances with a
known and consistent environment.
How AMI Snapshots Work:
Creating an AMI from an Instance:You can create an AMI from a running or stopped EC2
instance. This process involves specifying the source instance and then AWS takes a snapshot
of the root volume to create the image.
Launching an Instance from an AMI:When you launch an instance from an AMI, AWS
creates a new EBS volume from the snapshot and attaches it to the new instance. This means
that any data or changes made to the instance after the AMI snapshot was taken will not be
included in the new instance.
(Network & Security) Security Group :
In Amazon Web Services (AWS), a security group is a fundamental component of the
networking and security model for Amazon Elastic Compute Cloud (EC2) instances. It acts
as a virtual firewall that controls inbound and outbound traffic for one or more EC2
instances.
Security groups play a crucial role in controlling network access to your EC2 instances and
are an important part of your overall security posture in AWS. It's recommended to follow the
principle of least privilege when configuring security group rules, meaning that you should
only allow the minimum necessary access to accomplish a specific task.
Here are some key points about security groups:
Stateful Filtering: Security groups are stateful, meaning if you allow inbound traffic from a
specific IP address, the corresponding outbound traffic is automatically allowed, regardless of
outbound rules. This simplifies the process of managing network access.
Rule-Based: Security groups operate based on rules. Each rule specifies a type of traffic (e.g.,
HTTP, SSH), a protocol (TCP, UDP, ICMP), and a range of allowable IP addresses (CIDR
blocks) or specific IP addresses.
Allow Rules: Allow rules permit specific types of inbound or outbound traffic. For example,
you might have an allow rule that allows inbound traffic on port 80 (HTTP) to your web
server.
Deny Rules: Deny rules are not used in security groups. Instead, if a particular type of traffic
is not explicitly allowed, it is implicitly denied.
Bound to Instances: Security groups are associated with EC2 instances. When you launch an
instance, you can specify one or more security groups to be associated with it. You can also
modify the security groups associated with an existing instance.
Multiple Security Groups per Instance: An instance can be associated with multiple security
groups. The rules from all associated security groups are effectively combined.
Priority of Rules: If a traffic type is allowed by one security group but denied by another, the
"allow" rule takes precedence.
Basic Rules for Defining Security Group:
Rules for a security group in AWS EC2 define the type of inbound and outbound traffic that
is allowed or denied. Each rule specifies the following:
Type: This defines the type of traffic, such as SSH (for remote access via Secure Shell), HTTP,
HTTPS, etc.
Protocol: This specifies the network protocol used for the rule, which can be TCP, UDP, or
ICMP.
Port Range: For TCP and UDP, this indicates the range of ports that the rule applies to. For
example, for HTTP, you would use port 80.
Source/Destination: This is the source of inbound traffic or the destination of outbound
traffic. It can be specified as an IP address, an IP range (in CIDR notation), or another
security group.
(Network & Security)
(Network & Security) Placement Groups
A placement group in AWS is like a special area where you can put your computer servers
(EC2 instances). Depending on the type of placement group, the servers will be arranged in a
way that helps them work together better.
Placement groups are used to influence the placement of instances to meet the needs of your
workload.

Cluster Placement Group: If you want your servers to talk to each other very quickly, this is
like putting them on the same team in a sports game. They'll be placed really close together to
reduce the time it takes for them to communicate.

Instance Types: Instances in a cluster placement group must be of the same instance type.
Availability Zone: All instances in a cluster placement group must be in the same Availability
Zone.
Limitations: There is a limit on the number of instances you can launch in a cluster placement
group, and you cannot move an existing instance into a cluster placement group.

Partition Placement Group:

Partition placement groups are used for distributing instances across logical partitions, which
can help reduce the risk of correlated hardware failures. Each partition within the group has
its own set of racks and network resources.
Think of this like dividing your servers into groups, each with its own space and resources.
It's useful if you want to spread your servers out to reduce the risk of problems affecting them
all at once.

instance Types: Instances in a partition placement group can be of different instance types.
Availability Zone: Instances in a partition placement group can span multiple Availability
Zones in a region.
Limitations: There are limits on the number of partitions and instances you can have in a
partition placement group.

Spread Placement Group:

Spread placement groups are used for applications that have a small number of critical
instances that need to be placed on distinct underlying hardware. This helps reduce the risk of
simultaneous failures.

Imagine each server is given its own special spot in a big room. This helps make sure that if
something goes wrong with one server, it won't affect the others.

Instance Types: Instances in a spread placement group can be of different instance types.
Availability Zone: Instances in a spread placement group are placed on distinct hardware in a
single Availability Zone.
Limitations: There are limits on the number of instances you can have in a spread placement
group.
(Network & Security) Network Interference
Network Interference is a virtual network interface that you can attach to an EC2 instance. It
acts as a networking component for an EC2 instance, allowing it to communicate with other
resources within a Virtual Private Cloud (VPC) or over the internet.

A network interface, in simple words, is like a virtual plug that allows a computer or server to
connect to a network. It's a way for your device to communicate with other devices, like other
computers, servers, or the internet.

Imagine your computer as a house with many rooms. Each room has a different door. The
network interface is like a special door that connects your house to the outside world. It lets
you send and receive information over a network, which could be a local network in your
home or a global network like the internet.

Network interfaces in AWS can be used for various purposes:

High Availability: You can attach multiple network interfaces to an instance for redundancy.
This way, if one network interface fails, the instance can still communicate using the other
interfaces.

Security: Network interfaces can be associated with security groups and Network Access
Control Lists (NACLs), which allow you to control inbound and outbound traffic to and from
the instance.

Elastic Load Balancing: You can attach multiple network interfaces to an instance and
associate them with different subnets and security groups to distribute traffic using an Elastic

Load Balancer. Virtual Private Cloud (VPC): Network interfaces play a crucial role in
enabling communication between instances in a VPC, as well as allowing instances to
connect to the internet or other AWS services.

Elastic IP Addresses: You can associate Elastic IP addresses with a specific network interface,
allowing you to have a consistent public IP address even if you stop and start the associated
instance.

Multiple IP Addresses: You can assign multiple IP addresses to a single network interface,
which is useful in scenarios where an instance needs to have multiple IP addresses.

It's important to note that when you terminate an EC2 instance, all the associated network
interfaces are also terminated by default. However, you have the option to detach a network
interface from an instance, which keeps it alive even after the instance termination.
Overall, network interfaces in AWS EC2 provide flexibility and control over the networking
capabilities of your instances, allowing you to design and configure your network to suit your
specific requirements.
(Load Balancing)Load Balancer

A load balancer in AWS (Amazon Web Services) EC2 is a service that helps distribute
incoming network traffic across multiple EC2 instances. This helps improve the
availability and fault tolerance of your application by ensuring that no single instance
becomes overwhelmed with too much traffic.

There are different types of load balancers in AWS:

Classic Load Balancer (CLB): This is the original AWS load balancer that provides basic
load balancing across multiple EC2 instances.

The Classic Load Balancer is designed to work across multiple Availability Zones (AZs) to
increase fault tolerance. You can distribute your EC2 instances across different AZs, and the
load balancer will route traffic to healthy instances in each zone.
The load balancer regularly checks the health of registered instances. If an instance is
determined to be unhealthy, the load balancer stops sending traffic to it.
The load balancer regularly checks the health of registered instances. If an instance is
determined to be unhealthy, the load balancer stops sending traffic to it.
Application Load Balancer (ALB):
ALB operates at the application layer (Layer 7) and is ideal for routing HTTP/HTTPS traffic.
It can route requests based on content of the request like URL path or hostname, making it
suitable for modern web applications.

Application Load Balancers provide advanced routing and visibility features targeted at
application architectures, including microservices and containers.

Network Load Balancer (NLB):

NLB operates at the transport layer (Layer 4) and is designed to handle TCP, UDP, and TLS
traffic. It's often used for applications that require extreme performance and low latency.

Operating at the connection level, Network Load Balancers are capable of handling millions
of requests per second securely while maintaining ultra-low latencies.

Gateway Load Balancer(GLB):

Gateway Load Balancer (GWLB) is a service provided by AWS that allows you to deploy,
scale, and manage third-party virtual appliances like firewalls, intrusion detection systems,
and other network security and monitoring tools in the cloud.

The Gateway Load Balancer is part of AWS Network Firewall, which helps protect your VPC
resources.
Auto Scaling
Creating group of EC2 instances that scale up or down depending on the conditions you set.

Amazon EC2 Auto Scaling helps you ensure that you have the correct number of
Amazon EC2 instances available to handle the load for your application. You create
collections of EC2 instances, called Auto Scaling groups. You can specify the minimum
number of instances in each Auto Scaling group, and Amazon EC2 Auto Scaling ensures
that your group never goes below this size. You can specify the maximum number of
instances in each Auto Scaling group, and Amazon EC2 Auto Scaling ensures that your
group never goes above this size. If you specify the desired capacity, either when you
create the group or at any time thereafter, Amazon EC2 Auto Scaling ensures that your
group has this many instances. If you specify scaling policies, then Amazon EC2 Auto
Scaling can launch or terminate instances as demand on your application increases or
decreases.

Example:

For example, the following Auto Scaling group has a minimum size of one instance, a
desired capacity of two instances, and a maximum size of four instances. The scaling
policies that you define adjust the number of instances, within your minimum and
maximum number of instances, based on the criteria that you specify.


Enable elasticity by scaling horizontally through adding or terminating EC2 instances.
Auto scaling ensures that you have the right number of AWS EC2 instances for your
needs at all time.
Auto scaling helps you to save cost by cutting down the number of EC2 instances when
not needed and scaling out to add more instances only it is required.
Auto Scaling Components:
1-Auto Scaling Groups:
 An Auto Scaling group contains a collection of EC2 instances that are treated as a
logical grouping for the purposes of automatic scaling and management

 The Auto Scaling group continues to maintain a fixed number of instances even if an
instance becomes unhealthy. If an instance becomes unhealthy, the group terminates
the unhealthy instance and launches another instance to replace it.

 When creating an Auto Scaling group, you can choose whether to launch On-Demand
Instances, Spot Instances, or both. You can specify multiple purchase options for your
Auto Scaling group only when you use a launch template

2-Scaling Policies:
Scale manually
Manual scaling is the most basic way to scale your resources, where you specify only the change
in the maximum, minimum, or desired capacity of your Auto Scaling group. Amazon EC2 Auto
Scaling manages the process of creating or terminating instances to maintain the updated
capacity.

Scale based on a schedule :

Scaling by schedule means that scaling actions are performed automatically as a function of time
and date. This is useful when you know exactly when to increase or decrease the number of
instances in your group, simply because the need arises on a predictable schedule.

Scale based on demand:

A more advanced way to scale your resources, using dynamic scaling, lets you define a
scaling policy that dynamically resizes your Auto Scaling group to meet changes in demand.
For example, let's say that you have a web application that currently runs on two instances
and you want the CPU utilization of the Auto Scaling group to stay at around 50 percent
when the load on the application changes. This method is useful for scaling in response to
changing conditions, when you don't know when those conditions will change. You can set up
Amazon EC2 Auto Scaling to respond for you.

Predictive scaling:
Predictive Scaling is a feature within Amazon EC2 Auto Scaling that uses machine learning
to predict your application's future traffic and adjust the capacity of your Auto Scaling groups
accordingly. This helps ensure that you have the right amount of capacity to handle expected
changes in demand.

Predictive scaling is well suited for situations where you have:

 Cyclical traffic, such as high use of resources during regular business hours and low
use of resources during evenings and weekends
 Recurring on-and-off workload patterns, such as batch processing, testing, or periodic
data analysis

Horizontal Scaling (Scaling Out) vs Vertical Scaling (Scaling Up)

Aspect Horizontal Scaling (Scaling Out) Vertical Scaling (Scaling Up)

Adding more instances to distribute Increasing the capacity of a

Definition workload single instance

AWS EC2 Launching multiple instances and Changing instance type to a

Implementation distributing load larger one

Load Balancing
Required Yes No

Limited (Relies on a single

Fault Tolerance High (Redundancy across multiple instances) instance)

Can scale almost indefinitely with proper Limited by the maximum

Scalability Limit design capacity of largest instance

Management May require more management due to Simpler management with fewer
Complexity multiple instances instances

Often requires applications to be designed for Suitable for applications not

Application Design stateless and distributable architecture designed for horizontal scaling

Can be more cost-effective for certain Can be more expensive in the

Cost Effectiveness workloads long run

Web servers, stateless applications, Databases, legacy applications,

Use Cases microservices specialized workloads
AWS Storage
Amazon Web Services (AWS) provides several types of storage options through its AWS
Management Console. Some of the main types of storage services offered by AWS include:
Amazon Simple Storage Service (Amazon S3): Amazon S3 is an object storage service that
offers industry-leading scalability, data availability, security, and performance. It allows you
to store and retrieve any amount of data from anywhere on the web.
Amazon Elastic Block Store (Amazon EBS): Amazon EBS provides block-level storage
volumes for use with Amazon EC2 instances. These volumes are designed to be highly
available and reliable.
Amazon Elastic File System (Amazon EFS): Amazon EFS provides scalable file storage
that can be accessed by multiple Amazon EC2 instances and on-premises servers.
Amazon FSx: This service provides fully managed file storage that is optimized for
Windows and Lustre workloads. There are two types: Amazon FSx for Windows File Server
and Amazon FSx for Lustre.
Amazon Glacier and Glacier Deep Archive: Glacier is a low-cost storage service for
archiving data, while Glacier Deep Archive provides an even more cost-effective option for
long-term archival.
Amazon Storage Gateway: This is a hybrid storage service that enables on-premises
applications to use AWS cloud storage seamlessly.
AWS Snow Family: This is a suite of physical devices designed to help you transfer large
amounts of data in and out of the AWS Cloud.
Amazon S3 Glacier Storage Classes: Within Amazon S3, there are different storage classes
such as Standard, Intelligent-Tiering, Standard-IA (Infrequent Access), One Zone-IA, Glacier,
and Glacier Deep Archive. Each class is optimized for different use cases and offers different
pricing tiers.
Amazon Elastic Container Registry (Amazon ECR): While primarily used for container
images, Amazon ECR provides managed Docker container image storage, making it easy for
developers to store, manage, and deploy Docker images.
AWS Storage Gateway: This is a hybrid storage service that allows on-premises applications
to use AWS cloud storage. It integrates with S3, EBS, and Glacier.
AWS Backup: AWS Backup is a fully managed backup service that centralizes and
automates the backup of data across AWS services.
These are some of the main storage options available through the AWS Management
Console. Each of these services serves specific use cases, and the choice of which to use
depends on factors like performance requirements, access patterns, and cost considerations.
Difference between Object Storage and Block Storage

Block Storage:
Block storage is suitable for transitional databases, random read/write loads and structured
database storage.
Block storage divides the data to be stored in evenly sized blocks called data chunks for
instance, a file can be split into evenly sized blocks before it is stored.
Data blocks stored in block storage would not contain metadata. (Data created, data
modified, content type etc.)
Block storage only keeps the address (index number) where the data blocks are stored, it
does not care what is in that block, just how to retrieve it when required.

Object Storage:
Object storage stores the files as a whole and does not divide them.
In object storage an object is: the file/ data itself, its Meta data, object global unique ID.
The object global unique ID is a unique identifier for the object (can be the object name
itself) and it must be unique such that it can be retrieved disregarding where it’s physical
storage location is.
Object storage cannot be mounted as a drive.
Example of object storage solutions are Dropbox, AWS S3, Facebook.

Feature Object Storage Block Storage

Data Structure Objects with unique IDs Fixed-sized blocks

Access Method RESTful API over HTTP/HTTPS iSCSI protocol

Scalable, but typically requires additional

Scalability Highly scalable management

Metadata is typically managed by the operating

Metadata Can have associated metadata system

Storing media files, backups, Databases, virtual machines, applications requiring

Use Cases archives, log files, etc. high-performance storage

Provides finer control over storage at the block

Access Granularity Good for large, unstructured data level

Can be configured with fine- Access controlled by the underlying operating

Data Security grained access control system and file system

Availability &
Durability High availability and durability High availability and durability
Simple Storage Service(S3)
Amazon S3, or Simple Storage Service, is a widely used object storage service provided by
Amazon Web Services (AWS). It allows you to store and retrieve data over the internet.

S3 is a storage for the internet. It has a simple web service interface for simple storing and
retrieving of any amount of data, anytime from anywhere on the internet.
S3 is object based storage.
S3 has a distributed data store architecture where objects are redundantly stored in multiple
locations. (minimum 3 locations in same region)
A bucket is a flat container of objects.
Maximum capacity of a bucket is 5TB.
You cannot create nested buckets.
Bucket ownership is non transferrable.
S3 bucket is region specific.
You can have up to 100 buckets per account. (may expand on request)

Here are some key features and characteristics of Amazon S3:

Object Storage: Amazon S3 stores data as objects, which can range from a few bytes to
multiple terabytes in size. Each object is assigned a unique identifier and is stored in a bucket.
Scalability: S3 is designed to be highly scalable, allowing you to store virtually unlimited
amounts of data.
Durability and Availability: Amazon S3 is designed for 99.999999999% (11 9's) durability,
meaning your data is highly protected against loss. It's also highly available, which means
your data is accessible when you need it.
Security: S3 provides a range of security features, including encryption, access control lists
(ACLs), and bucket policies to help secure your data.
Versioning: You can enable versioning on your S3 buckets, which allows you to keep
multiple versions of an object in the same bucket.
Lifecycle Policies: S3 allows you to define rules to automatically transition objects to
different storage classes or expire them after a specified period of time.
Multi-region Replication: You can set up replication between S3 buckets in different AWS
regions, which provides additional redundancy and disaster recovery capabilities.
Data Transfer Costs: You are charged based on the amount of data you store and the amount
of data you transfer in and out of S3. Above 5 gb
S3 Bucket Naming Rules:

S3 bucket names (keys) are globally unique across all AWS regions.
Bucket names cannot be change after they are created.
If bucket is deleted its name become available again to you or other account to use.
Bucket names must be at least 3 and no more than 63 characters long.
Bucket names are part of URL used to access a bucket.
Bucket name must be a series of one or more labels (xyz bucket)
Bucket names can contain lowercase, numbers and hyphen but cannot use uppercase
letters.
Bucket name should not be an IP address.
Each label must start and end with a lowercase letter or a number.
By default buckets and its objects are private, and by default only owner can access the
bucket.

S3 Objects:

An object size stored in an S3 bucket can be 0 byte to 5TB.

Each object is stored and retrieve by unique key. (ID or name)
An object in AWS S3 is uniquely identified and addressed through:
service endpoint
bucket name
object key (name)
optionally object version
Object stored in a S3 bucket in a region will never leave that region unless you specifically
move them to another region or CRR.
A bucket owner can grant cross account permissions to another AWS account (or users in
another account) to upload objects.
You can grant S3 bucket / object permission to:
Individual users
AWS account
Make the resource public
To all authenticate user
STORAGE CLASSES OF AMAZON S3:
There are 6 types of storage classes of Amazon S3 is available such as:
1. Amazon S3 Standard
2. Amazon S3 Glacier Deep Archive
3. Amazon Glacier
4. Amazon S3 Standard Infrequent Access
5. Amazon S3 one-zone-IA
6. Amazon S3 Intelligent Tiering

1. Amazon S3 Standard:
S3 standard offers high durability, availability and performance object storage for
frequently accessed data.
Durability is 99.999999999%.
Designed for 99.99% availability over a given year.
Supports SSL for data in transit and encryption of data at rest.
The storage cost for the object is fairly high but there is very less charge for accessing the
objects.
Largest object that can be uploaded in a single PUT in 5GB.

2. Amazon S3 IA (standard):
S3-IA is for data that is accessed less frequently but requires rapid access when needed.
The storage cost is much cheaper than S3-standard almost half the price, but you are
charged more heavily for accessing your objects.
Durability is 99.999999999%.
Resilient against events that impact an entire AZ.
Availability is 99.9% in a year.
Supports SSL for data in transit and encryption of data at rest.
Data that is deleted from S3-IA within 30 days will be charged for a full 30 days.
Backed with the Amazon S3 service level agreement for availability.

3. Amazon One-Zone IA

S3 one zone IA is for data that is accessed less frequently but requires rapid access when
needed.
Data store is single AZ.
Ideal for those who want lower cost option of IA data.
It is good choice for storing secondary backup copies of on-premise data of easily re-
creatable data.
You can use S3 lifecycle policies.
Durability is 99.999999999%.
Availability is 99.5%.
Because S3 one zone IA stores data in a single AZ, data stored in this storage class will be
lost in the event of AZ destruction.
4. Amazon S3 Glacier:

S3 glacier is a secure, durable, low cost storage class for data archiving.
To keep cost low yet suitable for varying needs S3 glacier provides three retrieval options
that ranges from a few minutes to hours.
You can upload object directly to glacier or use lifecycle policies.
Durability is 99.999999999%.
Data is resilient in the event of one entire AZ destruction.
Supports SSL for data in transit and encryption data at rest.
You can retrieve 10GB of your amazon S3 glacier data per month for free with free tier
account.

5. Amazon S3 Glacier Deep Archive:

S3 glacier deep archive is amazon S3 cheapest storage.

Design to retain data for long period even if for 10 years.
All objects stored in S3 glacier deep archive are replicated and stored across at least at
three geographically AZ.
Durability is 99.999999999%.
Ideal alternative to magnetic tape libraries.
Retrieval time within 12 hours.
Storage cost is up to 75% less than for the existing S3 glacier storage class.
Availability is 99.9%.
6. Amazon S3 Intelligent Tiering:

The S3 intelligent tiering storage class is designed to optimize cost by automatically

moving data to the most cost effective access tier.
It works by storing objects in two access tiers.
If an object in the frequent access tier is accessed it is automatically moved back to the
frequent access tier.
There is no retrieval fees when using the S3 intelligent tiering storage class and no
additional tering fees when objects are moved between access tiers.
Same low latency and high performance of S3 standard.
Objects less than 128kb cannot move to IA.
Durability is99.999999999%.
Availability is 99.9%.
S3 Bucket Versioning:

S3 bucket versioning is a feature provided by Amazon S3 that allows you to keep multiple
versions of an object in the same S3 bucket. When versioning is enabled for a bucket, any
new version of an object that is uploaded will not overwrite the existing version. Instead, it
will create a new version of the object.
Bucket versioning is a S3 bucket sub resource used to protect against accidental object/data
deletion or overwrites.
Versioning can also be used for data retention and archive.
Once you enable versioning on a bucket it cannot be disabled however it can be suspended.
When enable, bucket versioning will protect existing and new objects and maintains their
versions as they are updated.
Updating objects refers to PUT, POST, COPY, DELETE actions on objects.
When versioning is enable and you try to delete an object a delete marker is placed on the
object.
You can still view the object and delete the marker.
If you reconsider deleting the objects you can delete the delete marker and the object will
be enable again.
You will be charged for all S3 storage cost for all object versions stored.
You can use versioning with S3 lifecycle policies to delete older version or you can move
them to a cheaper S3 storage (Glacier.)

Bucket version state:-

Enabled
Suspended
Un-versioned
Versioning applies to all objects in a bucket and not partially applied.
Object existing before enable versioning will have a version ID or NULL.
If you have a bucket that is already versioned then you suspended versioning existing
objects and their versions remain as it is.
However they will not be updated/ version further with future updates while the bucket
versioning is suspended.
New objects (uploaded after suspension) they will have a version ID “null” if the same key
(name) is used to stone another objects it will override the existing one.
An object deletion in a suspended versioning buckets will only delete the objects with ID
“null”.

S3 Bucket Versioning-MFA Delete:

Multifactor authentication delete is a versioning capacity that adds another level of security
in case your account is compromised.
This adds another layer of security for the following:
Changing your bucket’s versioning state.
Permanently deleting on objects version.
MFA delete requires:
Your security credentials.
S3 Lifecycle Rule

An S3 bucket lifecycle rule in AWS defines a set of actions that should be automatically
applied to objects stored in the bucket over time. These actions typically include transitioning
objects to different storage classes or deleting them altogether based on specified criteria.
This helps you manage your storage costs and optimize your data storage strategy.
A typical S3 bucket lifecycle rule consists of the following components:

Transition Actions:
Transition to S3 Standard-IA (Infrequent Access): You can specify a certain number of days
after which objects are automatically moved to the Standard-IA storage class. This class
offers lower storage costs compared to S3 Standard, but with a retrieval fee for accessing the
data.
Transition to S3 One Zone-IA (Infrequent Access): Similar to Standard-IA, but data is stored
in a single Availability Zone, providing a lower-cost option with a slight trade-off in
durability compared to Standard-IA.
Transition to S3 Glacier and Glacier Deep Archive: You can specify a certain number of days
after which objects are automatically moved to these archival storage classes. Glacier
provides even lower storage costs but with a longer retrieval time compared to S3 Standard-
IA.

Expiration Actions:
Expiration: You can set a number of days after which objects are automatically deleted from
the bucket. This is useful for managing data retention policies.

Noncurrent Version Expiration: If versioning is enabled on the bucket, you can set a
specific number of days after which non-current versions of objects are automatically deleted.

Object Size: You can specify minimum and maximum size of object . For example, you
could set up a rule to transition objects larger than a certain size to a different storage class
after a specified number of days.

Object Tags: You can optionally specify object tags as a condition for applying the lifecycle
rule. This allows you to target specific objects based on their tags.
Status: You can enable or disable a lifecycle rule to control its current applicability.

Here is an example of how a lifecycle rule might be configured:

Transition objects to S3 Standard-IA after 30 days.
Transition objects to S3 Glacier after 60 days.
Permanently delete objects after 365 days.

Keep in mind that once a lifecycle rule is set up, AWS S3 will automatically manage the
transitions and deletions based on the specified criteria, which can help you optimize your
storage costs and data management processes.
Replication in S3

Replication in AWS S3 (Simple Storage Service) refers to the process of automatically

copying objects (files) from one S3 bucket to another. This can be done either within the
(Same-Region Replication) or across different AWS regions (Cross-Region Replication).

Same-Region Replication (SRR):

Same-Region Replication involves replicating objects from one S3 bucket to another within
the same AWS region. This can be useful for compliance, data privacy, or specific
requirements where you need to keep copies of your data within the same geographic area.

Cross-Region Replication (CRR):

Cross-Region Replication is the process of replicating objects from one S3 bucket in one
AWS region to another S3 bucket in a different AWS region. This is commonly used to
achieve objectives like data redundancy, disaster recovery, and easier access for users in
different regions.

Key Concepts and Considerations:

Versioning: Before setting up replication, you need to enable versioning on both the source
and destination buckets. Versioning allows you to keep multiple versions of an object in the
same bucket.
Replication Rules: Replication rules are defined in the source bucket. These rules determine
which objects should be replicated and specify the destination bucket.
Prefixes and Tags: Replication rules can be based on object name prefixes, tags, or both.
This allows you to specify criteria for which objects should be replicated.
Data Transfer Costs: Enabling replication can incur additional costs due to data transfer
between regions or within the same region. Be aware of the cost implications.
Object Deletion: It's important to note that replication does not automatically delete objects
in the destination bucket if they are deleted in the source bucket. You need to manage object
deletions separately.

Use Cases:
Data Redundancy: Replication provides redundancy by storing copies of your data in
multiple locations. This helps protect against data loss due to hardware failures or other
issues.
Disaster Recovery: Cross-Region Replication can be a critical part of a disaster recovery
strategy, ensuring that your data is stored in geographically separate locations.
Compliance: Same-Region Replication can be used to meet compliance requirements that
mandate data be kept within a specific region.
Global Access: Replication can improve access times for users located in different regions,
as they can retrieve objects from a bucket located closer to them.

Remember, while replication is a powerful tool, it is not a substitute for regular backups. It's
recommended to implement a comprehensive backup strategy alongside replication for
critical data.
Access Point:

Amazon S3 access points simplify data access for any AWS service or customer application
that stores data in S3. Access points are named network endpoints that are attached to buckets
that you can use to perform S3 object operations, such as GetObject and PutObject.
An access point in an S3 bucket is a way to manage access to your S3 storage resources. It
allows you to define specific access policies for individual applications or users.

Here are some key Advantages about access points in AWS S3:
Granular Access Control: Access Points allow you to define specific access policies for
individual applications, users, or resources. This enables fine-grained control over who can
access your S3 resources.
Network Isolation: You can associate an Access Point with a specific Virtual Private Cloud
(VPC). This ensures that data transfers stay within the VPC, providing an additional layer of
security.
Enhanced Security: By controlling access through Access Points, you can limit the exposure
of your S3 bucket to only the resources that need it, reducing the risk of unauthorized access.
Private Endpoint: Each Access Point provides a unique DNS name that can be used to
access your S3 bucket. This allows you to keep the access restricted and controlled.
Easier Multi-Tenancy: In scenarios where multiple teams or applications need access to the
same S3 bucket, Access Points make it easier to manage permissions separately for each
entity.
Cost Control with Requester Pays: You can configure an Access Point to enforce "requester
pays", which means the requester (the one making the API request) is responsible for data
transfer and request costs. This can be useful in scenarios where you want to share data but
not the cost.
Simplified Management: Access Points provide a clear and structured way to manage access
to your S3 resources. This can make it easier to understand and control who has access to
your data.
Resource-Level Permissions: Access Points can be associated with specific IAM roles,
allowing you to grant access to specific resources within your S3 bucket.
Scalability: As your applications and teams grow, Access Points can help you scale access
control without having to manage complex bucket policies.
Policy Flexibility: You can use both bucket policies and access point policies to control
access to your S3 resources, providing flexibility in how you define access rules.
Overall, Access Points provide a powerful tool for managing access to your S3 buckets with
increased granularity and security options. They are particularly valuable in scenarios where
you need fine-grained control over access permissions or when dealing with complex, multi-
tenant environments.
Multi-Region Access Points
Multi-Region Access Points in Amazon S3 provides a powerful tool for managing and
optimizing access to your S3 data across multiple regions, improving both performance and
availability for your applications.
When you create a Multi-Region Access Point, you specify a set of AWS Regions where you
want to store data to be served through that Multi-Region Access Point. You can use S3
Cross-Region Replication (CRR) to synchronize data among buckets in those Regions. You
can then request or write data through the Multi-Region Access Point global endpoint.

Advantages
Faster Access: Automatically routes requests to the nearest AWS region for improved
performance.
Simplified Setup: Provides a single access point, eliminating the need to manage separate
buckets in multiple regions.
Fine-Tuned Routing: Lets you define policies for which regions to use based on your needs.
Better Availability: Redirects requests to a backup region if the primary region is down.
Disaster Recovery: Works with Cross-Region Replication for data backup in separate
regions.
Consistency: Maintains strong read-after-write consistency, even across different regions.
Global Data Distribution: Enables easy distribution of data to serve a global audience.
Centralized Access Control: Inherits access policies from underlying buckets for fine-
grained control.
Cost Efficiency: Reduces manual data management efforts and optimizes costs.
Athena
Amazon Athena is an interactive query service provided by Amazon Web Services (AWS)
that allows you to analyze data stored in Amazon S3 using SQL. It's part of the AWS Big
Data and Analytics suite of services.
Here are some key details about AWS Athena:
Serverless and Managed Service: Athena is a serverless service, which means you don't
need to manage any underlying infrastructure. You don't have to provision or scale clusters.
You simply submit SQL queries, and Athena handles the rest.
Querying Data in Amazon S3: Athena is designed to work with data stored in Amazon S3,
which is AWS's object storage service. You can use it to analyze various types of data like
CSV, JSON, Parquet, ORC, and more.
Schema on Read: Athena uses a schema-on-read approach, meaning it doesn't require you to
define a schema before querying your data. Instead, it infers the schema from the data when
you run a query.
Supports Standard SQL: You can use standard SQL (Structured Query Language) to query
your data in Athena. This makes it accessible to a wide range of users who are already
familiar with SQL.
Integration with AWS Glue: Athena can leverage AWS Glue Data Catalog, which is a
managed metadata catalog that integrates with various AWS services. This makes it easier to
manage metadata and discover and query your data.
Partitioning and Compression: You can optimize your queries by partitioning your data in
S3 and using appropriate compression techniques. This can significantly improve query
performance.
Cost Model: With Athena, you pay only for the queries you run. You are billed based on the
amount of data scanned by your queries. This can be cost-effective for sporadic, ad-hoc
querying, but it's important to manage your data and queries efficiently to control costs.
Result Output and Export: Athena allows you to save query results in a variety of formats
including CSV, JSON, or Parquet. You can also integrate Athena with other AWS services
like Amazon QuickSight for data visualization and analysis.
Workgroup Management: Athena allows you to manage multiple workgroups, each with its
own set of users, queries, and settings. This can help you separate and manage workloads
effectively.
Geospatial Functions: Athena has support for geospatial functions, which allows you to
perform spatial queries on geospatial data.
Integration with Other AWS Services: Athena can be integrated with various AWS services
like AWS Glue, Amazon QuickSight, AWS Lambda, and more, allowing you to build
comprehensive data pipelines and analytical solutions
Types of Tables in Cluster Ecosystem:
In a cluster computing ecosystem, such as Apache Hadoop, there are mainly two types of tables:

Internal Tables:

Storage and Management:

Internal tables are tightly integrated with the cluster's storage system, typically Hadoop
Distributed File System (HDFS). This means that when you create an internal table, Hive
manages the data files directly in the HDFS directory associated with the table.
Ownership and Lifecycle:
Hive owns and manages the data associated with internal tables. If you drop an internal table,
Hive will also remove the associated data from HDFS. This means that the data's lifecycle is
tied to the table itself.
Metadata in Hive Metastore:
The table's metadata, including the schema and location information, is stored in the Hive
Metastore, a database that Hive uses to keep track of tables, partitions, and their properties.
Optimizations and Statistics:
Hive can apply various optimizations, such as statistics collection, on internal tables. This
helps improve query performance by allowing Hive to make informed decisions about how to
execute queries.
Transactional Tables:
Hive supports transactional tables, which enable ACID (Atomicity, Consistency, Isolation,
Durability) operations on data stored in internal tables. This is particularly useful for
applications that require data consistency.

External Tables:

Data Location Independence:

External tables are not tightly coupled with the cluster's storage system. Instead, they are
associated with data that resides outside of the Hadoop cluster, such as in an external file
system or object store (like Amazon S3, Azure Blob Storage, etc.).
Data Ownership:
Unlike internal tables, the data for external tables is not owned or managed by Hive. If you
drop an external table, it will not affect the underlying data. The data remains in its original
location.
Schema and Metadata:
The schema information of an external table is stored in the Hive Metastore, but the location
of the data is external and not managed by Hive. This allows for more flexibility in handling
data from various sources.
Data Sharing and Accessibility:
External tables are often used when the same dataset needs to be accessed by multiple
systems or clusters. For example, you might have data in a central location that is used by
multiple Hive clusters or other applications.
Data Updates and Maintenance:
Since external tables don't manage the data, you have to ensure that the data in the external
location is maintained and updated independently. Hive will read the data from this location,
but it won't modify or manage it.

Use Cases:
Internal Tables are typically used for intermediate or temporary data, or when you want Hive
to have full control over the data's lifecycle and storage.
External Tables are useful when you want to maintain data independently from Hive, share
data across different systems or clusters, or work with data that's already stored in an external
location.
Both types have their own advantages and use cases, and the choice between them depends
on your specific requirements and workflows.

Feature Internal Table External Table

Storage External to Hive, stored in an external

Location Managed by Hive within HDFS. location.

Data is not owned by Hive, remains in its

Ownership Data is owned and managed by Hive. location.

Dropped table does not affect underlying

Data Lifecycle Dropped table leads to data deletion. data.

Metadata Hive Metastore stores schema and Hive Metastore stores schema, location is
Storage location information. external.

Various optimizations and statistics Optimizations specific to Hive's

Optimizations collection. interaction.

Transactions may not be supported,

Transactional Supports ACID transactions for data. depending on source.

Data Location Tightly integrated with Hadoop storage. Location is external to the cluster.

Typically used for isolated, Hive-centric Used for data accessible by multiple
Data Sharing data. systems/clusters.

Hive manages data and can perform Data must be maintained and updated
Data Updates updates. externally.
DPU(Data Processing Unit)
 In Amazon Athena, DPU stands for Data Processing Unit. It is a unit of measure for
the amount of resources consumed by a query when it runs in Athena. DPUs are used
to quantify the processing capacity needed to execute a query on your data.
 Each DPU provides a certain amount of CPU, memory, and networking capacity. The
amount of processing power required for a query depends on factors like the
complexity of the query, the volume of data being scanned, and the types of
operations being performed.
 When you run a query in Athena, the number of DPUs consumed is determined by the
amount of data scanned by the query. You are billed based on the total amount of data
scanned in your query, rounded up to the nearest megabyte, and the number of DPUs
used.
 It's worth noting that DPUs in Athena are specific to the execution of a query and do
not directly correlate to any specific hardware or instance type. They are a logical unit
used for billing purposes. Different types of queries will use different amounts of
DPUs depending on their resource requirements.
 Keep in mind that pricing details and DPU values may change over time, so it's a
good idea to check the official Amazon Athena pricing documentation for the most
up-to-date information.
S3 Query Editor Vs Athena Query Editor
Feature Amazon S3 Query Editor Amazon Athena

Web-based query tool within AWS

Service Type Management Console Fully-managed interactive query service

SQL-like (limited to basic querying Standard SQL with advanced capabilities (joins,
Query Language capabilities) aggregations, etc.)

Data Formats CSV, JSON, TSV, Parquet, ORC, Avro, RCFile,

Supported CSV, JSON, TSV, Parquet, Avro SequenceFile, and more

Query Execution Uses Amazon S3 Select (limited to

Engine basic querying) Presto-based query execution engine

Supports complex SQL queries, including joins,

Complex Queries Limited support for complex queries window functions, etc.

Limited support (may require Supported for improved performance and cost
Data Partitioning manual handling) efficiency

Limited support (may require Supported for improved performance and cost
Data Compression manual handling) efficiency

Data Catalog
Integration Does not have its own data catalog Integrates with AWS Glue Data Catalog

Performance Less optimized for complex queries Optimized for performance, can handle large
Optimization and large datasets datasets and complex queries

Limited scalability for large datasets Designed to handle large volumes of data and
Scalability and complex queries complex queries

Separate pricing structure, charges based on

Cost Structure Generally simpler cost structure data scanned during queries

Access controlled through AWS Access controlled through IAM policies, and
Identity and Access Management integrates with AWS Lake Formation for fine-
Data Security (IAM) policies grained access control

Integration with Directly interacts with data stored in Integrates with various AWS services for data
Other AWS Services S3 buckets ingestion, transformation, and visualization
ETL (Extract, Transform, Load)
 ETL stands for Extract, Transform, Load. It is a process used in data warehousing and
data integration to move data from various sources, transform it into a usable format,
and then load it into a target database or data warehouse for analysis and reporting.
 ETL processes are commonly used in business intelligence, data warehousing, and
data integration projects to ensure that data is available and in the right format for
analysis and reporting.

Components of ETL
A.Extract
 Data Extraction : Retrieving data from source systems, which can include databases,
flat files, APIs, or other repositories ==>Salesforce,Data Lakes,Data Warehouse.
 Change Data Capture (CDC) :Identifying and capturing only the changed data since
the last extraction to optimize efficiency
B.Transform
 Data Cleaning : Removing or correcting errors, inconsistencies,or inaccuracies in the
source data
 Data Transformation : Restructuring and converting data into a format suitable for the
target system
 Data Enrichment : Enhancing data by adding additional information or attributes
C.Load :
 Data Staging : Temporary storage of transformed data before loading it into the target
system
 Data Loading : Inserting or updating data into the destination database or data
warehouse (UPSERT operation)
 Error Handling : Managing and logging errors that may occur during the loading
process

ETL Process Flow:

A.Extraction Phase
 Connect to Source Systems : Establishing connections to source databases, APIs, or
files,JDBC/ODBC,
 Data Selection : Defining criteria for selecting data to be extracted

B.Transformation Phase
 -Data Mapping : Creating a mapping between source and target data structures
 -Data Cleansing : Identifying and correcting data quality issues
 -Data Validation : Ensuring transformed data meets specified ,quality standards

C.Loading Phase
 Data Staging : Moving transformed data to a staging area for further processing
 Bulk Loading : Efficiently inserting large volumes of data into the target system
 Indexing : Creating indexes to optimize data retrieval in the target database
 Post-Load Verification : Confirming that the data has been loaded successfully
Cloud Based ETL Tools:
AWS Glue: AWS Glue is a fully managed ETL service provided by Amazon Web Services. It
automatically discovers, catalogs, and transforms your data, making it easier to prepare and
load it for analytics.
AWS EMR:
Amazon EMR (Elastic MapReduce) is a cloud-based big data processing service provided by
Amazon Web Services (AWS). It allows for the processing of large amounts of data using
popular open-source frameworks like Apache Spark, Hadoop, Hive, and Presto, among
others.

Microsoft Azure Data Factory: Azure Data Factory is a cloud-based ETL and data
integration service provided by Microsoft Azure. It allows you to create, schedule, and
manage data pipelines.
Google Cloud Dataflow: Dataflow is a fully managed stream and batch data processing
service provided by Google Cloud. It allows for building data pipelines using both batch and
streaming data.
Talend Cloud: Talend also offers a cloud-based version of their ETL platform, which
provides data integration and transformation capabilities in a cloud environment.
Matillion: Matillion is a cloud-native ETL platform that is purpose-built for popular cloud
data platforms like AWS, Google Cloud, and Snowflake. It offers native connectors to
various cloud data sources.

Non Cloud Based Tools (ETL-On Prim)

Apache NiFi: Open-source data integration platform for automating data flows between
systems.
Apache Spark: Powerful, open-source framework for big data processing, including ETL
tasks, on-premises.
Talend: Open-source ETL tool with a user-friendly interface and support for on-premises
deployment.
Pentaho: Open-source Business Intelligence suite with ETL capabilities for on-premises or
private data center use.
Apache Kafka: Distributed streaming platform, also used for real-time ETL tasks on-
premises.
Microsoft SSIS (SQL Server Integration Services): Component of SQL Server for on-
premises ETL processes.
SAP Data Services: ETL tool by SAP for enterprise-level data integration, transformation,
and cleansing in on-premises environments.
CloverETL: Data integration platform with ETL capabilities, available in both open-source
and commercial versions.
Jaspersoft ETL: Open-source data integration tool as part of the Jaspersoft Business
Intelligence suite for on-premises use.
Ab Initio: High-performance data integration and ETL tool often used in large-scale
enterprise environments on-premises..
Difference Between ETL and ELT

Parameter ETL ELT

1) Support for Data Yes, ETL is the traditional process for Yes, ELT is the modern process for
Warehouse transforming and integrating transforming and integrating structured or
structured or relational data into a unstructured data into a cloud-based data
cloud-based or on-premises data warehouse.
warehouse.
2) Support for Data No, ETL is not an appropriate process Yes, the ELT process is tailored to provide
Lake/Mart/Lakehouse for data lakes, data marts or data a data pipeline for data lakes, data marts
lakehouses or data lakehouses.
3) Size/type of data set ETL is most appropriate for ELT can handle any size or type of data
processing smaller, relational data sets and is well suited for processing both
which require complex structured and unstructured big data.
transformations and have been Since the entire data set is loaded, analysts
predetermined as being relevant to the can choose at any time which data to
analysis goals. transform and use for analysis.
4) Implementation The ETL process has been around for The ELT process is a newer approach and
decades and there is a mature the ecosystem of tools and experts needed
ecosystem of ETL tools and experts to implement it is still growing.
readily available to help with
implementation.
5) Transformation In the ETL process, data In the ELT process, data transformation is
transformation is performed in a performed on an as-needed basis in the
staging area outside of the data target system itself. As a result, the
warehouse and the entire data must be transformation step takes little time but
transformed before loading. As a can slow down the querying and analysis
result, transforming larger data sets processes if there is not sufficient
can take a long time up front but processing power.
analysis can take place immediately
once the ETL process is complete.
6. Loading In ELT, the full data set is loaded directly
The ETL loading step requires data to into the target system. Since there is only
be loaded into a staging area before one step, and it only happens one time,
being loaded into the target system. loading in the ELT process is faster than
This multi-step process takes longer ETL.
than the ELT process

7) Maintenance/Ease ETL processes that involve an on- The ELT process typically requires low
of Use premise server require frequent maintenance given that all data is always
maintenance by IT given their fixed available and the transformation process
tables, fixed timelines and the is usually automated and cloud-based.
requirement to repeatedly select data
to load and transform. Newer
automated, cloud-based ETL solutions
require little maintenance.
8) Cost ETL can be cost-prohibitive for many ELT benefits from a robust ecosystem of
small and medium businesses. cloud-based platforms which offer much
lower costs and a variety of plan options
to store and process data.
9) Hardware The traditional, on-premises ETL Given that the ELT process is inherently
process requires expensive hardware. cloud-based, no additional hardware is
Newer, cloud-based ETL solutions do required.
not require hardware
10) Compliance ETL is better suited for compliance ELT carries more risk of exposing private
with GDPR, HIPAA, and CCPA data and not complying with GDPR,
standards given that users can omit HIPAA, and CCPA standards given that
any sensitive data prior to loading in all data is loaded into the target system.
the target system.
AWS Glue
AWS Glue is a serverless data integration service that makes it easy for analytics users to
discover, prepare, move, and integrate data from multiple sources.

With AWS Glue, you can discover and connect to more than 70 diverse data sources and
manage your data in a centralized data catalog. You can visually create, run, and monitor
extract, transform, and load (ETL) pipelines to load data into your data lakes. Also, you can
immediately search and query cataloged data using Amazon Athena, Amazon EMR, and
Amazon Redshift Spectrum.

AWS Glue consolidates major data integration capabilities into a single service. These
include data discovery, modern ETL, cleansing, transforming, and centralized cataloging. It's
also serverless, which means there's no infrastructure to manage. With flexible support for all
workloads like ETL, ELT, and streaming in one service, AWS Glue supports users across
various workloads and types of users.

Also, AWS Glue makes it easy to integrate data across your architecture. It integrates with
AWS analytics services and Amazon S3 data lakes

AWS Glue features

AWS Glue features fall into three major categories:

 Discover and organize data

 Transform, prepare, and clean data for analysis
 Build and monitor data pipelines

Discover and organize data

 Unify and search across multiple data stores – Store, index, and search across
multiple data sources and sinks by cataloging all your data in AWS.
 Automatically discover data – Use AWS Glue crawlers to automatically infer
schema information and integrate it into your AWS Glue Data Catalog.
 Manage schemas and permissions – Validate and control access to your databases
and tables.
 Connect to a wide variety of data sources – Tap into multiple data sources, both on
premises and on AWS, using AWS Glue connections to build your data lake.
Transform, prepare, and clean data for analysis

 Visually transform data with a drag-and-drop interface – Define your ETL

process in the drag-and-drop job editor and automatically generate the code to extract,
transform, and load your data.
 Build complex ETL pipelines with simple job scheduling – Invoke AWS Glue jobs
on a schedule, on demand, or based on an event.
 Clean and transform streaming data in transit – Enable continuous data
consumption, and clean and transform it in transit. This makes it available for analysis
in seconds in your target data store.
 Deduplicate and cleanse data with built-in machine learning – Clean and prepare
your data for analysis without becoming a machine learning expert by using
the FindMatches feature. This feature deduplicates and finds records that are imperfect
matches for each other.
 Built-in job notebooks – AWS Glue job notebooks provide serverless notebooks
with minimal setup in AWS Glue so you can get started quickly.
 Edit, debug, and test ETL code – With AWS Glue interactive sessions, you can
interactively explore and prepare data. You can explore, experiment on, and process
data interactively using the IDE or notebook of your choice.
 Define, detect, and remediate sensitive data – AWS Glue sensitive data detection
lets you define, identify, and process sensitive data in your data pipeline and in your
data lake.

Build and monitor data pipelines

 Automatically scale based on workload – Dynamically scale resources up and down

based on workload. This assigns workers to jobs only when needed.
 Automate jobs with event-based triggers – Start crawlers or AWS Glue jobs with
event-based triggers, and design a chain of dependent jobs and crawlers.
 Run and monitor jobs – Run AWS Glue jobs with your choice of engine, Spark or
Ray. Monitor them with automated monitoring tools, AWS Glue job run insights, and
AWS CloudTrail. Improve your monitoring of Spark-backed jobs with the Apache
Spark UI.
 Define workflows for ETL and integration activities – Define workflows for ETL
and integration activities for multiple crawlers, jobs, and triggers.
Glue Architecture

You define jobs in AWS Glue to accomplish the work that's required to extract, transform,
and load (ETL) data from a data source to a data target. You typically perform the following
actions:

 For data store sources, you define a crawler to populate your AWS Glue Data Catalog
with metadata table definitions. You point your crawler at a data store, and the
crawler creates table definitions in the Data Catalog. For streaming sources, you
manually define Data Catalog tables and specify data stream properties.
In addition to table definitions, the AWS Glue Data Catalog contains other metadata
that is required to define ETL jobs. You use this metadata when you define a job to
transform your data.
 AWS Glue can generate a script to transform your data. Or, you can provide the script
in the AWS Glue console or API.
 You can run your job on demand, or you can set it up to start when a
specified trigger occurs. The trigger can be a time-based schedule or an event.
When your job runs, a script extracts data from your data source, transforms the data, and
loads it to your data target. The script runs in an Apache Spark environment in AWS Glue.
Crawlers and Data Catalog
Crawlers:
 Crawlers are used to automatically discover and catalog metadata about your data
sources, which can be stored in various formats and locations, such as Amazon S3,
Amazon RDS, Amazon Redshift, and more.
 Crawlers automatically scan and discover the schema and metadata of your data
sources. This includes information like column names, data types, and relationships
between tables.
 Catalog Creation: Once the crawler has scanned the data, it creates or updates a
metadata catalog in the AWS Glue Data Catalog. The Data Catalog is a central
repository where information about your data sources is stored. This catalog can be
used by other AWS services and applications for tasks like data querying and
transformation.
 Schema Evolution: Crawlers are capable of detecting changes in the underlying data,
such as new columns or modified data types. They can update the catalog to reflect
these changes.
 Support for Multiple Data Formats: AWS Glue Crawlers support a wide range of
data formats including CSV, JSON, Parquet, Avro, and more.
 Integration with Other AWS Services: The metadata catalog created by AWS Glue
Crawlers can be used by other AWS services like Amazon Athena (for querying data),
Amazon Redshift Spectrum, and Amazon EMR.


Data Catalog
 The AWS Glue Data Catalog is a centralized metadata repository provided by
Amazon Web Services (AWS) as part of the AWS Glue service.

 It stores and organizes metadata about your data sources, transformations, and targets.

 The AWS Glue Data Catalog is an index to the location, schema, and runtime metrics
of your data.

 You use the information in the Data Catalog to create and monitor your ETL jobs.
Information in the Data Catalog is stored as metadata tables, where each table
specifies a single data store.

 Typically, you run a crawler to take inventory of the data in your data stores, but there
are other ways to add metadata tables into your Data Catalog
The following workflow diagram shows how AWS Glue crawlers interact with data
stores and other elements to populate the Data Catalog.

The following is the general workflow for how a crawler populates the AWS Glue Data
Catalog:

 A crawler runs any classifiers that you choose to infer the format and schema of your
data. You provide the code for custom classifiers, and they run in the order that you
specify.
 The first custom classifier to successfully recognize the structure of your data is used
to create a schema. Custom classifiers lower in the list are skipped.
 If no custom classifier matches your data's schema, built-in classifiers try to recognize
your data's schema. An example of a built-in classifier is one that recognizes JSON.
 The crawler connects to the data store. Some data stores require connection properties
for crawler access.
 The inferred schema is created for your data.
 The crawler writes metadata to the Data Catalog. A table definition contains metadata
about the data in your data store. The table is written to a database, which is a
container of tables in the Data Catalog. Attributes of a table include classification,
which is a label created by the classifier that inferred the table schema.
Classifiers
 In AWS Glue, classifiers are components that help identify the schema and structure
of your data when it's not immediately apparent. They're particularly useful for
handling data in formats that may not have explicit schema information or for custom
data formats.

 A classifier reads the data in a data store. If it recognizes the format of the data, it
generates a schema. The classifier also returns a certainty number to indicate how
certain the format recognition was.

 AWS Glue provides a set of built-in classifiers, but you can also create custom
classifiers. AWS Glue invokes custom classifiers first, in the order that you specify in
your crawler definition. Depending on the results that are returned from custom
classifiers, AWS Glue might also invoke built-in classifiers. If a classifier
returns certainty=1.0 during processing, it indicates that it's 100 percent certain that it
can create the correct schema. AWS Glue then uses the output of that classifier.

 If no classifier returns certainty=1.0, AWS Glue uses the output of the classifier that
has the highest certainty. If no classifier returns a certainty greater than 0.0, AWS
Glue returns the default classification string of UNKNOWN.

Types of Classifiers:
AWS Glue provides several types of classifiers to handle different data formats:

Built-in Classifiers:
If AWS Glue doesn't find a custom classifier that fits the input data format with 100 percent
certainty, it invokes the built-in classifiers in the order shown in the following table. The
built-in classifiers return a result to indicate whether the format matches (certainty=1.0) or
does not match (certainty=0.0). The first classifier that has certainty=1.0 provides the
classification string and schema for a metadata table in your Data Catalog.
These are pre-defined classifiers provided by AWS Glue for common data formats like CSV,
JSON, Avro, and others. AWS Glue uses these classifiers out-of-the-box to recognize these
formats.

 XML Classifier: This classifier is used for processing XML files. It extracts
information about the structure of XML files, including elements, attributes, and
namespaces.
 JSON Classifier: This classifier is used for JSON files. It helps identify the structure
of JSON data, including nested objects and arrays.
 Grok Classifier: This classifier is used for log files that follow a specific pattern
known as "Grok patterns." Grok is a pattern-matching syntax used for parsing log
data.
 Avro Classifier: This classifier is used for Apache Avro data serialization format. It
helps AWS Glue understand the schema of Avro data.
 OpenCSVSerDe Classifier: This classifier is used for comma-separated values (CSV)
data. It helps identify the columns and data types in CSV files.

Custom Classifiers:
Custom Classifiers in AWS Glue are user-defined rules or patterns that you can create to help
AWS Glue understand the schema and structure of your data. They are particularly useful
when working with data formats that may not be covered by the built-in classifiers provided
by AWS Glue or when you have specific patterns in your data that require custom handling.
Here's how Custom Classifiers work in AWS Glue:
User-Defined Patterns: With Custom Classifiers, you can define your own patterns using
regular expressions or custom code. These patterns are used to identify the structure of your
data.
Handling Non-Standard Formats: If your data is in a format that doesn't conform to
commonly recognized standards (like CSV, JSON, Avro, etc.), a Custom Classifier can be
invaluable in helping AWS Glue understand the data.
Specialized Data Formats: Custom Classifiers are especially useful for dealing with
proprietary or specialized data formats that may not have pre-defined classifiers available.
Usage in Crawlers: When setting up a Crawler in AWS Glue to scan and catalog your data,
you can specify which classifiers to use. If none of the built-in classifiers are suitable for your
data, you can select a Custom Classifier that you've defined.
Schema Inference: Once a Custom Classifier identifies the structure of your data, AWS Glue
can use this information to infer the schema, including details like column names and data
types.
Data Cataloging: The information inferred from the data by the Custom Classifier is then
used to create or update entries in the AWS Glue Data Catalog, which is a central repository
for metadata about your data sources.
In summary, Custom Classifiers in AWS Glue give you the flexibility to define your own
rules for recognizing the structure of your data. This is particularly valuable when working
with non-standard or specialized data formats, allowing you to effectively catalog and process
your data in AWS Glue workflows.
Options to create a job from scratch
 Visual ETL – author in a visual interface focused on data flow

 Using an Interactive code notebook – interactively author jobs in a notebook

interface based on Jupyter Notebooks
 When you start a notebook through AWS Glue Studio, all the configuration steps are
done for you so that you can explore your data and start developing your job script
after only a few seconds.

 Code with a script editor – For those familiar with programming and writing ETL
scripts, choose this option to create a new Spark ETL job. Choose the engine (Python
shell, Ray, Spark (Python), or Spark (Scala). Then, choose Start fresh or Upload
script. uploading an existing script from a local file. If you choose to use the script
editor, you can't use the visual job editor to design or edit your job.
A Spark job is run in an Apache Spark environment managed by AWS Glue. By
default, new scripts are coded in Python.
Trigger
 When fired, a trigger can start specified jobs and crawlers. A trigger fires on demand,
based on a schedule, or based on a combination of events.
 Triggers allow you to schedule and automate the execution of your ETL (Extract,
Transform, Load) jobs or development endpoints based on specified criteria.
 Triggers can be set up to run jobs at specific times, in response to events, or based on
a predefined schedule. For example, you can create a trigger to run a Glue job every
day at a certain time, or you can set up a trigger to launch a job when a new file is
added to an Amazon S3 bucket.
 Only two crawlers can be activated by a single trigger. If you want to crawl multiple
data stores, use multiple sources for each crawler instead of running multiple crawlers
simultaneously.
 A trigger can exist in one of several states. A trigger is
either CREATED, ACTIVATED, or DEACTIVATED. There are also transitional
states, such as ACTIVATING. To temporarily stop a trigger from firing, you can
deactivate it. You can then reactivate it later.

There are three types of triggers:

Scheduled
A time-based trigger based on cron.

You can create a trigger for a set of jobs or crawlers based on a schedule. You can
specify constraints, such as the frequency that the jobs or crawlers run, which days
of the week they run, and at what time. These constraints are based on cron. When
you're setting up a schedule for a trigger, consider the features and limitations of
cron. For example, if you choose to run your crawler on day 31 each month, keep
in mind that some months don't have 31 days

Conditional
A trigger that fires when a previous job or crawler or multiple jobs or crawlers
satisfy a list of conditions.

When you create a conditional trigger, you specify a list of jobs and a list of
crawlers to watch. For each watched job or crawler, you specify a status to watch
for, such as succeeded, failed, timed out, and so on. The trigger fires if the watched
jobs or crawlers end with the specified statuses. You can configure the trigger to
fire when any or all of the watched events occur.
On-demand
A trigger that fires when you activate it. On-demand triggers never enter
the ACTIVATED or DEACTIVATED state. They always remain in
the CREATED state.

So that they are ready to fire as soon as they exist, you can set a flag to activate scheduled
and conditional triggers when you create them.
Important
Jobs or crawlers that run as a result of other jobs or crawlers completing are referred to
as dependent. Dependent jobs or crawlers are only started if the job or crawler that completes
was started by a trigger. All jobs or crawlers in a dependency chain must be descendants of a
single scheduled or on-demand trigger.
Workflow
 AWS Glue workflow is a collection of jobs, triggers, and crawlers that are
orchestrated to perform a data ETL process. Triggers initiate the workflow based on a
schedule or event, jobs perform the actual data processing, and crawlers discover and
catalog the data sources. The Data Catalog provides metadata management, and
connections enable Glue jobs to interact with external data stores.
 A workflow in AWS Glue is a directed acyclic graph (DAG) of Glue entities (such as
jobs, triggers, and crawlers) that you can execute on a schedule.

A workflow consists of one or more jobs that are orchestrated to execute in a specific order.

Triggers:Triggers in AWS Glue are events that initiate the execution of a workflow. They can
be scheduled to run at specific times or event based(like the completion of a previous job).
Triggers can be set to run once or on a recurring schedule.
Crawlers:
Crawlers are used to automatically discover the structure of your source data and create
metadata tables in the AWS Glue Data Catalog. This is especially useful when working with
semi-structured or unstructured data.
Crawlers can be used to scan data stored in various data stores, like Amazon S3, Amazon
RDS, etc.
Connections:
A connection in AWS Glue defines the connection information to an external data store. This
includes details like the endpoint, port, username, password, etc.
Connections are used by Glue jobs to connect to data sources and targets.
Data Catalog:
The AWS Glue Data Catalog is a centralized metadata repository that stores information
about the data sources, targets, transformations, and schema definitions used by Glue jobs.
It provides a unified view of your data, making it easier to manage and query.
Workflows:
A workflow in AWS Glue is a logical grouping of jobs, triggers, and crawlers that define the
ETL process. It's represented as a directed acyclic graph (DAG) where nodes represent Glue
entities and edges represent dependencies between them.
Dependencies:
In a Glue workflow, jobs and triggers can have dependencies on other jobs, triggers, or
crawlers. This means that a job or trigger will only execute once its dependencies have
completed successfully.
Schedulers:
AWS Glue provides a scheduling mechanism through triggers. Triggers can be set up to run
jobs and workflows at specific times or based on events.
You can use cron expressions or specific event conditions to trigger jobs.
Error Handling and Monitoring:
AWS Glue provides logging and monitoring capabilities to track the progress and status of
your jobs and workflows. You can view logs in Amazon CloudWatch and set up notifications
for job completion or failure events.
Glue Optimisation Technique
Data Partitioning:
 Partitioning involves dividing large datasets into smaller, more manageable pieces
based on certain criteria (e.g., date, region).
 This can significantly reduce the amount of data that needs to be processed during
each job run, resulting in faster execution times and lower costs.
 AWS Glue supports data partitioning, and it's important to set up partitions
appropriately for your use case.
Dynamic Frame Optimizations:
 AWS Glue uses DynamicFrames, which are similar to DataFrames in Apache Spark.
When working with large datasets, you can use the repartition method to control the
number of partitions in a DynamicFrame.
 It's important to choose an appropriate number of partitions to balance parallelism and
memory consumption.
Choosing the Right Worker Type and Number:
 AWS Glue allows you to choose between different worker types (Standard, G.1X,
G.2X) with varying CPU and memory configurations.
 Depending on the nature of your ETL workload, you should select the appropriate
worker type and number to ensure optimal performance.
Tuning DataFrames and DynamicFrames:
 When working with DataFrames or DynamicFrames, consider using operations like
select, filter, and join selectively to limit the amount of data that needs to be
processed.
 Avoid unnecessary transformations and apply filters early in the pipeline to reduce the
volume of data being processed.
Using Custom Classifiers:
 AWS Glue allows you to define custom classifiers for your data sources. These
classifiers can help improve the accuracy and efficiency of schema detection, which is
important for understanding the structure of your data.
Optimizing Data Storage:
 Consider using columnar storage formats like Parquet or ORC, which can
significantly reduce storage costs and improve query performance.
 Compressing data can also save on storage costs and improve read/write performance.
Cost Management:
 AWS Glue can be cost-effective, but it's important to monitor and manage costs.
Consider factors like worker type, number of workers, and data storage options to
optimize costs.
Indexing and Optimization for Data Stores:
 If you're loading data into a data store like Amazon Redshift or Amazon RDS, make
sure to apply best practices for indexing and optimizing queries to improve
performance.
Optimize Data Access and Storage:
 Partitioning: Divide large datasets into smaller partitions based on relevant columns to
reduce unnecessary data processing.
 Columnar File Formats: Use columnar file formats like Parquet or ORC, which enable
efficient data filtering and querying.
 Compression: Compress data to reduce storage costs and network bandwidth usage.
 S3 Optimized Committers: Utilize EMRFS S3-optimized committer for faster S3
writes and reduced metadata overhead.
 Job Bookmarks: Leverage job bookmarks to process only new or updated data,
avoiding repeated processing of unchanged data.
 Push-down Predicates: Push filter conditions down to the data source to reduce the
amount of data transferred to Spark.
Optimize Memory Usage:
 Tune Spark Memory Allocation: Adjust Spark memory allocation based on job
requirements to avoid Out-of-Memory (OOM) errors.
 Utilize Off-Heap Memory: Leverage off-heap memory for large datasets and UDFs to
reduce memory pressure on the driver node.
 Optimize PySpark UDFs: Avoid buffering large records in off-heap memory with
PySpark UDFs by moving select and filter operations upstream.
 Cache Frequently Accessed Data: Cache frequently accessed data in memory to
improve performance.
Optimize Execution Plan:
 Shuffle Optimization: Minimize shuffle operations by partitioning data effectively and
using broadcast joins for small join datasets.
 Repartitioning: Optimize data distribution across partitions to balance workload and
improve parallelism.
 Coalesce: Combine smaller partitions into larger ones to reduce overhead and
improve performance for tasks that operate on larger blocks of data.
 Utilize REPARTITION or COALESCE for Spark SQL: Use these query hints to
control data partitioning and improve query execution.
Optimize Workload Management:
 Workload Partitioning: Divide complex ETL pipelines into smaller, independent jobs
to improve parallelism and reduce job execution time.
 Job Scheduling: Schedule jobs to avoid overloading the cluster and optimize resource
utilization.
 Monitoring and Alerting: Set up monitoring and alerting mechanisms to identify
performance bottlenecks and resource usage issues.
 Autoscaling: Utilize autoscaling to automatically adjust cluster capacity based on
workload demands.
Leverage AWS Glue Features:
 Custom Workflows: Design custom workflows to orchestrate complex data processing
pipelines efficiently.
 AWS Glue Spark Libraries: Utilize AWS Glue Spark libraries for common data
processing tasks, such as data quality checks and data deduplication.
 AWS Glue Studio: Use AWS Glue Studio for visual job development, debugging, and
optimization.
Glue Data Breu
AWS Glue DataBrew is a visual data preparation tool that enables users to clean and
normalize data without writing any code. Using DataBrew helps reduce the time it takes to
prepare data for analytics and machine learning (ML) by up to 80 percent, compared to
custom developed data preparation.

Visual Data Preparation:

 DataBrew provides a visual interface that allows users to explore, clean, and
transform their data without writing any code.
 It offers a wide range of built-in transformations, which can be applied using a point-
and-click interface.
Data Profiling:
 DataBrew includes data profiling capabilities that allow users to understand the
structure and quality of their data.
 It can automatically detect patterns, null values, and outliers in the data.
 Recipe-based Transformation:
 DataBrew uses recipes, which are sets of transformation steps that can be applied to
your data.
 Users can create custom recipes to define complex transformations.
Integration with AWS Services:
 DataBrew integrates with various AWS services, including AWS Glue, Amazon S3,
AWS Lake Formation, AWS IAM, AWS CloudTrail, and others.
Automated Data Lineage:
 DataBrew provides automated data lineage tracking, which helps users understand the
source and history of their data.
Data Quality Rules:
 Users can define data quality rules to validate and ensure data quality during the
preparation process.
Scheduling and Automation:
 DataBrew supports scheduling of jobs, allowing for regular data preparation tasks to
be automated.
Data Collaboration:
 It offers collaboration features that enable teams to work together on data preparation
projects.
Output to Various Data Targets:
 DataBrew can write the cleaned and transformed data to various targets including
Amazon S3, Amazon Redshift, and more.

You can choose from over 250 built-in transformations to combine, pivot, and transpose the
data without writing code. AWS Glue DataBrew also automatically recommends
transformations such as filtering anomalies, correcting invalid, incorrectly classified, or
duplicate data, normalizing data to standard date and time values, or generating aggregates
for analyses. For complex transformations, such as converting words to a common base or
root word, DataBrew provides transformations that use advanced machine learning
techniques such as Natural Language Processing (NLP).

You can group multiple transformations together, save them as recipes, and apply the recipes
directly to newly incoming data.For input data, AWS Glue DataBrew supports commonly
used file formats, such as comma-separated values (.csv), JSON and nested JSON, Apache
Parquet and nested Apache Parquet, and Excel sheets. For output data,

AWS Glue DataBrew supports comma-separated values (.csv), JSON, Apache Parquet,
Apache Avro, Apache ORC and XML
 What is AWS Glue?
 Describe the AWS Glue Architecture
 What are the primary benefits of using AWS Data Brew?
 Describe the four ways to create AWS Glue jobs
 How does AWS Glue support the creation of no-code ETL jobs?
 What is a connection in AWS Glue?
 What are the Features of AWS Glue?
 When to use a Glue Classifier?
 What are the main components of AWS Glue?
 What Data Sources are supported by AWS Glue?
 What is AWS Glue Data Catalog?

 What is the AWS Glue Schema Registry?

The AWS Glue Schema Registry assists us by allowing to validate and regulate the lifecycle
of streaming data using registered Apache Avro schemas at no cost. Apache Kafka, Amazon
Managed Streaming for Apache Kafka (MSK), Amazon Kinesis Data Streams, Apache Flink,
Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda benefit from Schema
Registry.

 Why should we use AWS Glue Schema Registry?

You can use the AWS Glue Schema Registry to:

 Validate schemas: Schemas used for data production are checked against schemas in a
central registry when data streaming apps are linked with AWS Glue Schema
Registry, allowing you to regulate data quality centrally.

 Safeguard schema evolution: One of eight compatibility modes can be used to specify
criteria for how schemas can and cannot grow.

 Improve data quality: Serializers compare data producers' schemas to those in the
registry, enhancing data quality at the source and avoiding downstream difficulties
caused by random schema drift.

 Save costs: Serializers transform data into a binary format that can be compressed
before transferring, lowering data transfer and storage costs.

 Improve processing efficiency: A data stream frequently comprises records with

multiple schemas. The Schema Registry allows applications that read data streams to
process each document based on the schema rather than parsing its contents,
increasing processing performance.
 When should I use AWS Glue vs. AWS Batch?

AWS Batch enables you to conduct any batch computing job on AWS with ease and
efficiency, regardless of the work type. AWS Batch maintains and produces computing
resources in your AWS account, giving you complete control over and insight into the
resources in use. AWS Glue is a fully-managed ETL solution that runs your ETL tasks in a
serverless Apache Spark environment. We recommend using AWS Glue for your ETL use
cases. AWS Batch might be a better fit for some batch-oriented use cases, such as ETL use
cases.

 What are AWS Tags in AWS Glue?

A tag is a label you apply to an Amazon Web Services resource. Each tag has a key and an
optional value, both of which are defined by you.
In AWS Glue, you may use tags to organize and identify your resources. Tags can be used to
generate cost accounting reports and limit resource access. You can restrict which users in
your AWS account have authority to create, update, or delete tags if you use AWS Identity
and Access Management.

 How do I query metadata in Athena? What is the general workflow for how a
Crawler populates the AWS Glue Data Catalog?

 What is AWS Glue Elastic Views?

AWS Glue Elastic Views makes it simple to create materialized views that integrate and
replicate data across various data stores without writing proprietary code. AWS Glue Elastic
Views can quickly generate a virtual materialized view table from multiple source data stores
using familiar Structured Query Language (SQL). AWS Glue Elastic Views moves data from
each source data store to a destination datastore and generates a duplicate of it. AWS Glue
Elastic Views continuously monitors data in your source data stores, and automatically
updates materialized views in your target data stores, ensuring that data accessed through the
materialized view is always up-to-date.

 Why should we use AWS Glue Elastic Views?

Use AWS Glue Elastic Views to aggregate and continuously replicate data across several
data stores in near-real-time. This is frequently the case when implementing new application
functionality requiring data access from one or more existing data stores. For example, a
company might use a customer relationship management (CRM) application to keep track of
customer information and an e-commerce website to handle online transactions. The data
would be stored in these apps or more data stores. The firm is now developing a new custom
application that produces and displays special offers for active website visitors.

 What is AWS Glue DataBrew?

AWS Glue DataBrew is a visual data preparation solution that allows data analysts and
scientists to prepare without writing code using an interactive, point-and-click graphical
interface. You can simply visualize, clean, and normalize terabytes, even petabytes, of data
directly from your data lake, data warehouses, and databases, including Amazon S3, Amazon
Redshift, Amazon Aurora, and Amazon RDS, with Glue DataBrew.
 Who can use AWS Glue DataBrew?

AWS Glue DataBrew is designed for users that need to clean and standardize data before
using it for analytics or machine learning. The most common users are data analysts and data
scientists. Business intelligence analysts, operations analysts, market intelligence analysts,
legal analysts, financial analysts, economists, quants, and accountants are examples of
employment functions for data analysts. Materials scientists, bioanalytical scientists, and
scientific researchers are all examples of employment functions for data scientists.

 What types of transformations are supported in AWS Glue DataBrew?

You can combine, pivot, and transpose data using over 250 built-in transformations without
writing code. AWS Glue DataBrew also suggests transformations such as filtering anomalies,
rectifying erroneous, wrongly classified, duplicate data, normalizing data to standard date and
time values, or generating aggregates for analysis automatically. Glue DataBrew enables
transformations that leverage powerful machine learning techniques such as Natural
Language Processing for complex transformations like translating words to a common base
or root word (NLP). Multiple transformations can be grouped, saved as recipes, and applied
straight to incoming data.

 What file formats does AWS Glue DataBrew support?

AWS Glue DataBrew accepts comma-separated values (.csv), JSON and nested JSON,
Apache Parquet and nested Apache Parquet, and Excel sheets as input data types. Comma-
separated values (.csv), JSON, Apache Parquet, Apache Avro, Apache ORC, and XML are
all supported as output data formats in AWS Glue DataBrew.

 Do we need to use AWS Glue Data Catalog or AWS Lake Formation to use AWS
Glue DataBrew?

No. Without using the AWS Glue Data Catalog or AWS Lake Formation, you can use AWS
Glue DataBrew. DataBrew users can pick data sets from their centralized data catalog using
the AWS Glue Data Catalog or AWS Lake

 What is the best practice for managing the credentials required by a Glue
connection?
 The best practice is for the credentials to be stored & accessed securely by
leveraging AWS Systems Manager (SSM), AWS Secrets Manager or Amazon Key
Management Service (KMS)

 What streaming sources does AWS Glue support?

AWS Glue supports Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed
Streaming for Apache Kafka (Amazon MSK).
 What is an interactive session in AWS Glue and what are its benefits?

 Interactive sessions in AWS Glue are essentially on-demand serverless Spark runtime
environments that allow rapid build and test of data preparation and analytics
applications. Interactive sessions can be used via the visual interface, AWS command
line or the API.
 Using interactive sessions, you can author and test your scripts as Jupyter notebooks.
Glue supports a comprehensive set of Jupyter magics allowing developers to develop
rich data preparation or transformation scripts.

 What are the two types of workflow views in AWS Glue?

 as the design view of the workflow, whereas the dynamic view is the runtime view of
the workflow that includes logs, status and error details for the latest run of the
workflow.
 Static view is used mainly while defining the workflow, whereas dynamic view is
used when operating the workflow.

 What are start triggers in AWS Glue?

 How can you start an AWS Glue workflow run using AWS CLI?

 What role does Apache Spark play in AWS Glue?

WS Glue and Apache Spark are closely intertwined. At its core, AWS Glue leverages Apache
Spark as its underlying distributed data processing engine. Apache Spark plays a pivotal role
in AWS Glue, empowering it with robust data processing capabilities. AWS Glue, in turn,
simplifies some of the complexities of Spark, making it more accessible to a broader
audience. AWS Glue uses Apache Spark as a distributed data processing engine. AWS Glue
scripts are compiled into code that runs on Apache Spark. SparkContext, a key component in
Spark, is initialized for implicitly in AWS Glue so you don’t have to worry about initializing
it yourself.

 Explain why and when you would use AWS Glue compared to other options to
set up data pipelines

AWS Glue makes it easy to move data between data stores and as such, can be used in a
variety of data integration scenarios, including:

 Data lake build & consolidation: Glue can extract data from multiple sources and
load the data into a central data lake powered by something like Amazon S3. (Related
Reading: Building Data Lakes on AWS using S3 and Glue)
 Data migration: For large migration and modernization initiatives, Glue can help
move data from a legacy data store to a modern data lake or data warehouse.
 Data transformation: Glue provides a visual workflow to transform data using a
comprehensive built-in transformation library or custom transformation using
PySpark
 Data cataloging: Glue can assist data governance initiatives since it supports
automatic metadata cataloging across your data sources and targets, making it easy to
discover and understand data relationships.
When compared to other options for setting up data pipelines, such as Apache NiFi or
Apache Airflow, AWS Glue is typically a good choice if:

 You want a fully managed solution: With Glue, you don’t have to worry about
setting up, patching, or maintaining any infrastructure.
 Your data sources are primarily in AWS: Glue integrates natively with many AWS
services, such as S3, Redshift, and RDS.
 You are constrained by programming skills availability: Glue’s visual workflow
makes it easy to create data pipelines in a no-code or low-code code way.
 You need flexibility and scalability: Glue can scale automatically to meet demand
and can handle petabyte-scale data.

 Can you highlight the role of AWS Glue in big data environments?

 AWS Glue plays a pivotal role in big data environments as it provides the ability to
handle, process and transform large data sets in distributed and parallel environments.
AWS Glue is engineered for large-scale data processing. It can scale horizontally,
providing the capability to process petabytes of data efficiently and quickly.
 AWS Glue is highly beneficial in a big data environment due to its serverless
architecture and integration capabilities with other AWS services.

 What is the difference between AWS Glue and AWS EMR?

Some of the differences between AWS Glue and EMR are:

 AWS Glue is a fully managed ETL (extract, transform, and load) service that
makes it easy for customers to prepare and load their data for analytics. AWS
EMR, on the other hand, is a service that makes it easy to process large
amounts of data quickly and efficiently.
 AWS Glue and EMR are both used for data processing but they differ in how
they process and data and their typical use cases
 AWS Glue can be easily used to process both structured as well as
unstructured data while AWS EMR is typically suited for processing
structured or semi-structured data.
 AWS Glue can automatically discover and categorize the data. AWS EMR
does not have that capability.
 AWS Glue can be used to process streaming data or data in near-real-time,
while AWS EMR is typically used for scheduled batch processing.
 Usage of AWS Glue is charged per DPU hour while EMR is charged per
underlying EC2 instance hour.
 AWS Glue is easier to get started than EMR as Glue does not require
developers to have prior knowledge of MapReduce or Hadoop.
 What are some ways to orchestrate glue jobs as part of a larger ETL flow?
Glue Workflows and AWS Step Functions are two ways to orchestrate glue jobs as part of
large ETL flows.

 Can Glue crawlers be configured to run on a regular schedule? If yes, how?

Yes, Glue crawlers can be configured to run on a regular schedule. Glue supports cron
based scheduling format to be specified during the creation of the crawler. For ETL
workflows orchestrated by step functions, event-based triggers in step functions can be used
to run crawlers on a specific schedule.
 Is AWS Glue suitable for converting log files into structured data?

 Yes, AWS Glue is suitable for converting log files into structured data. Using the
AWS Glue Visual Canvas or by defining a custom glue job, we can define custom
data transformations to structure log file data.
 Glue makes is possible to aggregate logs from various sources into a common data
lake that makes it easy to access and maintain these logs.

 Our company’s spend on AWS Glue is increasing rapidly. How can we optimize
our AWS Glue spend?
Cost optimization is a critical aspect of running workloads in cloud and leveraging cloud
services, including AWS Glue. On going cost optimization ensures we are making most of
our cloud investments while reducing waste. When optimizing AWS Glue spend, the
following factors should be considered:
1. Use Glue Development Endpoints sparingly as these can get costly quickly.
2. Choose the right DPU allocation based on job complexity and requirements.
3. Optimize job concurrency
4. Use Glue job bookmarks to track processed data, allowing Glue to skip
previously processed records during incremental runs, thus reducing cost for
recurring jobs.
5. Some additional factors such as leveraging Glue Data Catalog, minimizing
costly transformations, etc.
EMR(Electric Map Reduce)
Amazon EMR (Elastic MapReduce) is a cloud-based big data platform provided by Amazon
Web Services (AWS). It allows for the processing of vast amounts of data quickly and cost-
effectively using popular frameworks such as Apache Hadoop, Apache Spark, Presto, and
Hive.
Scalability: EMR allows you to easily scale your cluster up or down based on your
processing needs. You can start with a small cluster and scale it to thousands of nodes if
necessary.

Managed Service: AWS EMR is a fully managed service, which means that AWS handles
the underlying infrastructure for you. This includes provisioning, monitoring, and managing
the compute resources.

Supported Frameworks:
 Hadoop: EMR supports Apache Hadoop, which is an open-source framework for
processing and storing large datasets.
 Spark: Apache Spark is another popular framework for big data processing that is
supported on EMR. It provides a more flexible and faster alternative to Hadoop's
MapReduce.
 Presto: EMR supports Presto, an open-source distributed SQL query engine designed
for interactive analytics.
 Hive and Pig: EMR also supports Hive and Pig, which are high-level languages for
querying and analyzing data in Hadoop.

Integration with Other AWS Services:

EMR integrates seamlessly with other AWS services like S3 (Simple Storage Service), Glue
(for data cataloging and ETL), and Kinesis (for real-time data streaming).
You can also use EMR with AWS Data Pipeline to automate and orchestrate your data
workflows.

Cost Management:
EMR provides features for cost optimization, such as the ability to use spot instances (which
are spare AWS capacity at a lower price) for cost savings.

Application Ecosystem:
EMR supports a wide range of applications and libraries that can be used for data processing,
analysis, and visualization.

EMR Notebooks:
EMR provides a notebook interface that allows you to create and manage Jupyter notebooks
for interactive data analysis and exploration.

EMR Studio:
EMR Studio is an integrated development environment for EMR that makes it easy to
develop, visualize, and debug big data applications.
Overview of Amazon EMR

 Understanding clusters and nodes

 Submitting work to a cluster
 Processing data
 Understanding the cluster lifecycle

Understanding clusters and nodes

The central component of Amazon EMR is the cluster. A cluster is a collection of Amazon
Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called
a node. Each node has a role within the cluster, referred to as the node type. Amazon EMR
also installs different software components on each node type, giving each node a role in a
distributed application like Apache Hadoop.

The node types in Amazon EMR are as follows:

The following diagram represents a cluster with one primary node and four core nodes.

Master Node:
 The master node is responsible for coordinating the overall workflow of the cluster.
 It manages the distribution of tasks to the core and task nodes and monitors their
health and status.
 The master node hosts the Hadoop Distributed File System (HDFS) NameNode,
which manages the metadata for the Hadoop cluster's data.
Core Nodes:
 Core nodes store the Hadoop Distributed File System (HDFS) data blocks and
participate in data processing.
 They run both task and storage services, contributing to both computation and data
storage.
 The number of core nodes determines the amount of storage capacity in the cluster, as
well as the parallel processing capability.
Task Nodes:
 Task nodes are responsible only for processing data and running tasks assigned by the
master node.
 They do not store persistent data, and their primary purpose is to contribute
computational power to the cluster.
 Task nodes are useful for handling transient workloads or for scaling the cluster's
processing capacity without increasing storage.

Submitting work to a cluster

When you run a cluster on Amazon EMR, you have several options as to how you specify the
work that needs to be done.
 Provide the entire definition of the work to be done in functions that you specify as
steps when you create a cluster. This is typically done for clusters that process a set
amount of data and then terminate when processing is complete.
 Create a long-running cluster and use the Amazon EMR console, the Amazon EMR
API, or the AWS CLI to submit steps, which may contain one or more jobs
 Create a cluster, connect to the primary node and other nodes as required using SSH,
and use the interfaces that the installed applications provide to perform tasks and
submit queries, either scripted or interactively.

Processing data

When you launch your cluster, you choose the frameworks and applications to install for your
data processing needs. To process data in your Amazon EMR cluster, you can submit jobs or
queries directly to installed applications, or you can run steps in the cluster.
Submitting jobs directly to applications

You can submit jobs and interact directly with the software that is installed in your Amazon
EMR cluster. To do this, you typically connect to the primary node over a secure connection
and access the interfaces and tools that are available for the software that runs directly on
your cluster.

Running steps to process data

You can submit one or more ordered steps to an Amazon EMR cluster. Each step is a unit of
work that contains instructions to manipulate data for processing by software installed on the
cluster.

The following is an example process using four steps:

1. Submit an input dataset for processing.
2. Process the output of the first step by using a Pig program.
3. Process a second input dataset by using a Hive program.
4. Write an output dataset.

Generally, when you process data in Amazon EMR, the input is data stored as files in your
chosen underlying file system, such as Amazon S3 or HDFS. This data passes from one step
to the next in the processing sequence. The final step writes the output data to a specified
location, such as an Amazon S3 bucket.

Steps are run in the following sequence:

1. A request is submitted to begin processing steps.
2. The state of all steps is set to PENDING.
3. When the first step in the sequence starts, its state changes to RUNNING. The other
steps remain in the PENDING state.
4. After the first step completes, its state changes to COMPLETED.
5. The next step in the sequence starts, and its state changes to RUNNING. When it
completes, its state changes to COMPLETED.
6. This pattern repeats for each step until they all complete and processing ends.

The following diagram represents the step sequence and change of state for the steps as they
are processed.
If a step fails during processing, its state changes to FAILED. You can determine what
happens next for each step. By default, any remaining steps in the sequence are set
to CANCELLED and do not run if a preceeding step fails. You can also choose to ignore the
failure and allow remaining steps to proceed, or to terminate the cluster immediately.

The following diagram represents the step sequence and default change of state when a step
fails during processing.

Understanding the cluster lifecycle

A successful Amazon EMR cluster follows this process:

1. Amazon EMR first provisions EC2 instances in the cluster for each instance
according to your specifications. For more information, see Configure cluster
hardware and networking. For all instances, Amazon EMR uses the default AMI for
Amazon EMR or a custom Amazon Linux AMI that you specify. For more
information, see Using a custom AMI. During this phase, the cluster state
is STARTING.
2. Amazon EMR runs bootstrap actions that you specify on each instance. You can use
bootstrap actions to install custom applications and perform customizations that you
require. For more information, see Create bootstrap actions to install additional
software. During this phase, the cluster state is BOOTSTRAPPING.
3. Amazon EMR installs the native applications that you specify when you create the
cluster, such as Hive, Hadoop, Spark, and so on.
4. After bootstrap actions are successfully completed and native applications are
installed, the cluster state is RUNNING. At this point, you can connect to cluster
instances, and the cluster sequentially runs any steps that you specified when you
created the cluster. You can submit additional steps, which run after any previous
steps complete. For more information, see Submit work to a cluster.
5. After steps run successfully, the cluster goes into a WAITING state. If a cluster is
configured to auto-terminate after the last step is complete, it goes into
a TERMINATING state and then into the TERMINATED state. If the cluster is
configured to wait, you must manually shut it down when you no longer need it. After
you manually shut down the cluster, it goes into the TERMINATING state and then
into the TERMINATED state.
A failure during the cluster lifecycle causes Amazon EMR to terminate the cluster and all of
its instances unless you enable termination protection. If a cluster terminates because of a
failure, any data stored on the cluster is deleted, and the cluster state is set
to TERMINATED_WITH_ERRORS. If you enabled termination protection, you can retrieve
data from your cluster, and then remove termination protection and terminate the cluster. For
more information, see Using termination protection.

The following diagram represents the lifecycle of a cluster, and how each stage of the
lifecycle maps to a particular cluster state.

Cluster scaling and provisioning option:

1]Set Cluster Size Manually:
-giving number of node manually and not auto scaling

2]EMR Managed Scaling:

-providing minimum and maximum number of nodes
-EMR takes care of scaling based on its algorithm

3]Custom Automatic Scaling:

-providing minimum and maximum number of nodes
-also providing scaling rules(Scale out rules and scale in rules)
Instance Group & Instance Fleets

In Amazon EMR (Elastic MapReduce), instance groups and instance fleets are concepts
related to managing and configuring the EC2 instances that make up your EMR cluster.
These features allow you to define and manage the composition and behavior of the instances
in your cluster.

1. Instance Groups:
 An instance group is a collection of Amazon EC2 instances within an EMR
cluster that share the same instance type and the same configuration.
 There are three types of instance groups in EMR:
 Core Instance Group: Instances in this group host the Hadoop
Distributed File System (HDFS) and run task and task instance
processes.
 Master Instance Group: This group contains the master node, which
manages the distribution of tasks across the core and task nodes.
 Task Instance Group: These instances are used to perform tasks and
are dynamically added or removed based on the load.
 Each instance group is associated with an Amazon EC2 instance type, which
defines the computing resources (CPU, memory) available to instances in that
group.
 You can specify the number of instances, instance type, and other
configurations for each instance group when you create an EMR cluster.

2. Instance Fleets:
 Instance fleets provide a more flexible and efficient way to provision and
manage the instances in your EMR cluster compared to instance groups.
 An instance fleet is a set of EC2 instance types and weighted capacity that
define the desired composition of instances in your cluster. It allows you to
specify multiple instance types and their weights, and EMR automatically
provisions and manages the instances based on your specifications.
 This helps in optimizing costs and improving fault tolerance by diversifying
across multiple instance types.
 Unlike instance groups, instance fleets allow EMR to automatically provision
instances based on the target capacity and instance type weights you define.

In summary, instance groups and instance fleets in AWS EMR provide mechanisms to
manage the composition and behavior of EC2 instances within your cluster. Instance groups
offer a more traditional and fixed approach, while instance fleets provide a more dynamic and
flexible way to manage instances, allowing for better cost optimization and fault tolerance.
The choice between them depends on your specific requirements and preferences.
Features of EMR

Amazon Elastic MapReduce (EMR) is a cloud-based big data platform provided by Amazon
Web Services (AWS). It simplifies the processing of large amounts of data by using popular
frameworks such as Apache Hadoop, Apache Spark, Apache Hive, Apache HBase, and
others. Here are some of the key features of AWS EMR:

1. Ease of Use:
 AWS EMR provides a web-based console that makes it easy to launch and
manage clusters.
 It integrates with other AWS services, allowing seamless interaction with
storage, databases, and other resources.

2. Flexibility:
 Supports a variety of popular big data frameworks, including Apache Hadoop,
Apache Spark, Apache Hive, Apache HBase, and more.
 Allows you to run custom applications and frameworks on EMR clusters.
3. Cluster Configuration:
 Provides the ability to customize the size and configuration of the cluster
based on your specific workload requirements.
 Allows both on-demand and spot instances to optimize costs.
4. Security:
 Offers integration with AWS Identity and Access Management (IAM) for
access control.
 Supports encryption of data at rest and in transit using AWS Key Management
Service (KMS).
 Enables fine-grained access controls for data stored on Amazon S3.
5. Managed Scaling:
 Automatically adjusts the size of the cluster based on the workload, helping to
optimize performance and costs.
 Supports manual resizing of clusters for specific use cases.
6. Integration with AWS Services:
 Seamless integration with other AWS services, such as Amazon S3, Amazon
DynamoDB, Amazon RDS, and more.
 EMR can read and write data directly to and from Amazon S3, making it easy
to work with data stored in S3.
7. Logging and Monitoring:
 Provides detailed logging and monitoring capabilities through AWS
CloudWatch and other tools.
 Allows you to configure and monitor application-specific metrics.
8. Data Lake and Data Catalog Integration:
 Supports integration with AWS Glue Data Catalog, making it easier to
discover and manage metadata.
 Allows you to build data lakes on Amazon S3 and query data using
EMR.
9. Application Ecosystem:
 Supports a wide range of applications and libraries within the Hadoop
and Spark ecosystems.
 Provides pre-configured Amazon Machine Images (AMIs) for popular
big data frameworks.
10. Cost Optimization:
 Supports spot instances for cost-effective cluster provisioning.
 Allows you to use Reserved Instances to reduce costs for long-running
clusters.
11. Multi-Step Workflows:
 Supports the creation of multi-step workflows using Apache Oozie,
making it easy to manage complex data processing workflows.
Optimization techniques for AWS EMR:
1. Instance Types and Sizes:
 Choose appropriate EC2 instance types and sizes based on the nature of your workload.
Instances with more CPU or memory might be more suitable for certain tasks.
 Utilize spot instances for cost savings, but be aware of the possibility of interruption.
2. Cluster Sizing:
 Adjust the number of instances in your EMR cluster based on the size of your data and the
complexity of your processing tasks.
 Use AWS Auto Scaling to automatically adjust the size of your cluster based on demand.
3. Instance Groups:
 Leverage instance groups to allocate and manage resources efficiently.
 Use core and task instance groups appropriately. Core nodes store HDFS data, while task
nodes are for processing only.
4. Instance Fleets:
 Consider using instance fleets for better control over instance types and pricing models.
5. Bootstrap Actions:
 Use bootstrap actions to install additional software, libraries, or configurations needed for
your specific use case.
6. Spot Instances:
 Take advantage of spot instances to reduce costs, but be prepared for potential interruptions.
Consider using a mix of on-demand and spot instances for a balance between cost and
reliability.
7. EMR Release Version:
 Keep your EMR cluster up to date with the latest release versions to benefit from
performance improvements, bug fixes, and new features.
8. Storage Optimization:
 Optimize storage configurations, including the choice of EBS volumes and S3 storage
options.
 Use instance store volumes for temporary data to avoid unnecessary EBS costs.
9. Data Compression:
 Compress your data to reduce storage costs and improve data transfer efficiency. Choose
appropriate compression codecs based on your data and processing requirements.
10. Task Instance Groups:
 Use task instance groups for transient and ephemeral workloads to further reduce costs.
11. Custom AMIs (Amazon Machine Images):
 Create custom AMIs with pre-installed software and configurations to reduce cluster startup
times.
12. Monitoring and Logging:
 Use AWS CloudWatch to monitor cluster performance, set up alarms, and identify
bottlenecks.
 Enable logging to Amazon S3 for EMR cluster logs to facilitate debugging and optimization.
13. Tune Spark and Hadoop Configurations:
 Adjust Spark and Hadoop configurations based on your specific workloads to optimize
performance.
14. Use EMR Notebooks:
 Consider using EMR notebooks for interactive data exploration and analysis.
EMR Serverless

Amazon EMR Serverless is a new deployment option for Amazon EMR. EMR Serverless
provides a serverless runtime environment that simplifies the operation of analytics
applications that use the latest open source frameworks, such as Apache Spark and Apache
Hive. With EMR Serverless, you don’t have to configure, optimize, secure, or operate
clusters to run applications with these frameworks.
EMR Serverless helps you avoid over- or under-provisioning resources for your data
processing jobs. EMR Serverless automatically determines the resources that the application
needs, gets these resources to process your jobs, and releases the resources when the jobs
finish. For use cases where applications need a response within seconds, such as interactive
data analysis, you can pre-initialize the resources that the application needs when you create
the application.
With EMR Serverless, you'll continue to get the benefits of Amazon EMR, such as open
source compatibility, concurrency, and optimized runtime performance for popular
frameworks.
EMR Serverless is suitable for customers who want ease in operating applications using open
source frameworks. It offers quick job startup, automatic capacity management, and
straightforward cost controls.

Concepts
In this section, we cover EMR Serverless terms and concepts that appear throughout our
EMR Serverless User Guide.

Release version
An Amazon EMR release is a set of open-source applications from the big data ecosystem.
Each release includes different big data applications, components, and features that you select
for EMR Serverless to deploy and configure so that they can run your applications. When you
create an application, you must specify its release version. Choose the Amazon EMR release
version and the open source framework version that you want to use in your application

Application
With EMR Serverless, you can create one or more EMR Serverless applications that use open
source analytics frameworks. To create an application, you must specify the following
attributes:
 The Amazon EMR release version for the open source framework version that you want to
use. To determine your release version, see Amazon EMR Serverless release versions.
 The specific runtime that you want your application to use, such as Apache Spark or Apache
Hive.
After you create an application, you can submit data-processing jobs or interactive requests to
your application.
Each EMR Serverless application runs on a secure Amazon Virtual Private Cloud (VPC)
strictly apart from other applications. Additionally, you can use AWS Identity and Access
Management (IAM) policies to define which users and roles can access the application. You
can also specify limits to control and track usage costs incurred by the application.

Consider creating multiple applications when you need to do the following:

 Use different open source frameworks
 Use different versions of open source frameworks for different use cases
 Perform A/B testing when upgrading from one version to another
 Maintain separate logical environments for test and production scenarios
 Provide separate logical environments for different teams with independent cost controls and
usage tracking
 Separate different line-of-business applications
EMR Serverless is a Regional service that simplifies how workloads run across multiple
Availability Zones in a Region. To learn more about how to use applications with EMR
Serverless, see Interacting with an application.

Job run
A job run is a request submitted to an EMR Serverless application that the application
asychronously executes and tracks through completion. Examples of jobs include a HiveQL
query that you submit to an Apache Hive application, or a PySpark data processing script that
you submit to an Apache Spark application. When you submit a job, you must specify a
runtime role, authored in IAM, that the job uses to access AWS resources, such as Amazon
S3 objects. You can submit multiple job run requests to an application, and each job run can
use a different runtime role to access AWS resources. An EMR Serverless application starts
executing jobs as soon as it receives them and runs multiple job requests concurrently. To
learn more about how EMR Serverless runs jobs, see Running jobs.

Workers
An EMR Serverless application internally uses workers to execute your workloads. The
default sizes of these workers are based on your application type and Amazon EMR release
version. When you schedule a job run, you can override these sizes.
When you submit a job, EMR Serverless computes the resources that the application needs
for the job and schedules workers. EMR Serverless breaks down your workloads into tasks,
downloads images, provisions and sets up workers, and decommissions them when the job
finishes. EMR Serverless automatically scales workers up or down based on the workload
and parallelism required at every stage of the job. This automatic scaling removes the need
for you to estimate the number of workers that the application needs to run your workloads.
Pre-initialized capacity
EMR Serverless provides a pre-initialized capacity feature that keeps workers initialized and
ready to respond in seconds. This capacity effectively creates a warm pool of workers for an
application. To configure this feature for each application, set the initial-capacity parameter
of an application. When you configure pre-initialized capacity, jobs can start immediately so
that you can implement iterative applications and time-sensitive jobs. To learn more about
pre-initialized workers, see Configuring an application.

EMR Studio
EMR Studio is the user console that you can use to manage your EMR Serverless
applications. If an EMR Studio doesn't exist in your account when you create your first EMR
Serverless application, we automatically create one for you. You can access EMR Studio
either from the Amazon EMR console, or you can turn on federated access from your identity
provider (IdP) through IAM or IAM Identity Center. When you do this, users can access
Studio and manage EMR Serverless applications without direct access to the Amazon EMR
console.
1. Explain the architecture of Amazon Elastic MapReduce (EMR) and how it enables
effective data processing and analysis?
Amazon EMR architecture consists of a cluster with one master node, core nodes, and task
nodes. The master node manages the cluster, while core nodes store data in Hadoop
Distributed File System (HDFS) and run tasks. Task nodes only execute tasks without storing
data.
EMR uses Apache Hadoop, an open-source framework that processes large datasets across
distributed clusters. It leverages MapReduce programming model for parallel processing,
enabling efficient data analysis. Additionally, EMR supports other frameworks like Spark,
Hive, and Presto for diverse analytical needs.
EMR integrates with AWS services such as S3, DynamoDB, and Redshift, facilitating
seamless data storage and retrieval. Autoscaling adjusts the number of nodes based on
workload, optimizing resource usage and cost. Spot Instances further reduce costs by utilizing
spare EC2 capacity.
Security is ensured through encryption, IAM roles, VPCs, and private subnets. Monitoring
and logging are available via Amazon CloudWatch and EMR Console, allowing performance
tracking and issue resolution.

2. How does Amazon EMR differ from traditional Hadoop and Spark clusters? What are
the specific advantages and limitations of using Amazon EMR?
Amazon EMR differs from traditional Hadoop and Spark clusters by providing a managed,
scalable, and cost-effective service for big data processing. It simplifies cluster setup,
management, and scaling while integrating with other AWS services.
Advantages of Amazon EMR include:
1. Easy setup: Quick cluster creation with pre-configured applications.
2. Scalability: Automatic resizing based on workload demands.
3. Cost-effectiveness: Pay-as-you-go pricing model and Spot Instances support.
4. Integration: Seamless integration with AWS ecosystem (S3, DynamoDB, etc.).
5. Security: Built-in security features like encryption and IAM roles.
Limitations of Amazon EMR are:
1. Vendor lock-in: Limited to the AWS environment.
2. Customization constraints: Less flexibility compared to self-managed clusters.
3. Latency: Potential latency issues when accessing data stored outside EMRFS.

3. How do you optimize the performance of an EMR job? What factors should be
considered?
To optimize EMR job performance, consider these factors:
1. Cluster Configuration: Choose appropriate instance types and sizes based on workload
requirements. Utilize Spot Instances for cost savings.
2. Data Storage: Use HDFS or Amazon S3 with consistent view enabled to store data.
Optimize S3 read/write operations using partitioning and compression techniques.
3. Task Distribution: Balance the number of mappers and reducers according to input data
size and processing complexity. Configure speculative execution to handle slow tasks.
4. Tuning Parameters: Adjust Hadoop/Spark configurations such as memory allocation,
garbage collection settings, and parallelism levels to improve resource utilization.
5. Monitoring & Logging: Enable CloudWatch metrics and logs to identify bottlenecks and
track performance improvements over time.
6. Code Optimization: Profile application code to find inefficiencies and use efficient
algorithms and data structures.
4. Describe the process of resizing an Amazon EMR cluster. What are some best practices
to maintain high availability and optimal performance while resizing a cluster?
Resizing an Amazon EMR cluster involves modifying the number of instances in the cluster
to meet changing workload demands. There are two types of resizing: scale-out (adding
instances) and scale-in (removing instances).
To resize a cluster, use the AWS Management Console, CLI, or SDKs. First, identify the
instance groups you want to modify, then change their target capacities accordingly.
Best practices for maintaining high availability and optimal performance while resizing
include:
1. Use Auto Scaling policies to automatically adjust cluster size based on
predefined CloudWatch metrics.
2. Resize during periods of low demand to minimize impact on running jobs.
3. Monitor key performance indicators (KPIs) like CPU utilization, memory
usage, and HDFS capacity to determine when resizing is necessary.
4. Opt for uniform instance groups with similar configurations to simplify
management and ensure consistent performance.
5. Test different cluster sizes and configurations to find the best balance
between cost and performance.
6. Consider using Spot Instances for cost savings but be prepared for potential
interruptions.
7. Implement data backup strategies to prevent data loss during resizing
operations.

5. Discuss the use of EMR File System (EMRFS) in Amazon EMR. What benefits does it
provide when compared to HDFS?
EMR File System (EMRFS) is an implementation of Hadoop Distributed File System
(HDFS) that allows Amazon EMR clusters to utilize data stored in Amazon S3. It provides
several benefits compared to traditional HDFS:
1. Scalability: EMRFS can scale horizontally, allowing for increased storage and throughput
without impacting cluster performance.
2. Durability: Data stored in S3 has 11 nines durability, reducing the risk of data loss.
3. Cost-effectiveness: Users pay only for the storage they use in S3, avoiding over-
provisioning costs associated with HDFS.
4. Flexibility: EMRFS enables sharing of data across multiple EMR clusters or other AWS
services, simplifying data management.
5. Consistency: EMRFS offers consistent view feature, ensuring read-after-write consistency
for objects written by EMRFS or other S3 clients.
6. Security: Integration with AWS Identity and Access Management (IAM) allows granular
access control to data stored in S3.

6. Explain the role of spot instances in Amazon EMR, and how it can be used to achieve
cost-effective resource allocation.
Spot instances in Amazon EMR play a crucial role in cost-effective resource allocation by
allowing users to bid on unused EC2 capacity at a lower price than On-Demand instances.
When the bid price exceeds the current Spot price, the instances are provisioned and added to
the EMR cluster.
To achieve cost-effective resource allocation, users can specify a percentage of their core and
task nodes as spot instances during cluster creation or modify an existing cluster’s instance
groups configuration. By doing so, they leverage the cost savings from spot instances while
maintaining the stability of the cluster with On-Demand instances for critical components like
master nodes.
Additionally, users can set up instance fleets to define multiple instance types and bidding
strategies, enabling EMR to automatically provision the most cost-effective combination of
instances based on available capacity and user-defined constraints.
However, it is essential to consider that spot instances may be terminated when the Spot price
rises above the bid price or due to capacity constraints. To mitigate this risk, users should
implement checkpointing and data replication strategies to ensure minimal impact on job
progress and data integrity.

7. What are the different security configurations available in Amazon EMR, and how can
the security of an Amazon EMR cluster be improved?
Amazon EMR security configurations include:
1. Identity and Access Management (IAM): Define roles for EMR clusters, EC2 instances,
and service access control.
2. Encryption: Use AWS Key Management Service (KMS) to encrypt data at rest (HDFS, S3)
and in transit (Spark, MapReduce).
3. Network Isolation: Utilize Amazon VPCs, subnets, and security groups to isolate resources
and restrict traffic.
4. Logging and Monitoring: Enable CloudTrail, CloudWatch, and EMRFS audit logs for
tracking user activities and cluster performance.
5. Authentication: Integrate with Kerberos or LDAP for secure authentication of users and
services.
6. Authorization: Implement Apache Ranger or similar tools for fine-grained access control
over Hadoop components.
To improve the security of an Amazon EMR cluster:
– Regularly review and update IAM policies, ensuring least privilege access.
– Enforce encryption for sensitive data and communication channels.
– Limit network exposure by using private subnets and strict security group rules.
– Monitor logs for suspicious activity and set up alerts for potential threats.
– Keep software versions updated and apply security patches promptly.

8. Describe the different types of EMR clusters (transient and long-running) and their
appropriate use cases.
Transient and long-running EMR clusters serve distinct purposes in data processing.
Transient clusters are temporary, created for specific tasks like batch processing or ETL jobs.
They’re cost-effective as they auto-terminate upon job completion, minimizing resource
usage. Use cases include log analysis, recommendation engines, and data transformations.
Long-running clusters persist even after job completion, suitable for interactive analytics or
streaming applications. They enable continuous data ingestion and real-time processing. Use
cases encompass real-time fraud detection, IoT data processing, and ad-hoc querying using
tools like Apache Zeppelin or Jupyter notebooks.

9. Can you explain how Amazon EMR supports the use of custom machine learning (ML)
algorithms? What is the process for integrating custom ML libraries into an EMR cluster?
Amazon EMR supports custom ML algorithms by allowing users to install and configure
additional libraries, frameworks, or applications on the cluster. This flexibility enables
integration of custom ML libraries into an EMR cluster.
To integrate custom ML libraries, follow these steps:
1. Create a bootstrap action script that installs and configures the required dependencies for
your custom library.
2. Upload the script to Amazon S3.
3. Launch an EMR cluster with the specified bootstrap action using AWS Management
Console, CLI, or SDKs.
4. Develop your ML application using the custom library and upload it to S3.
5. Add a step in the EMR cluster to execute your ML application, referencing the uploaded
code in S3.
6. Monitor the progress of your application through the EMR console or logs stored in S3.
Amazon EMR is a managed Hadoop framework that simplifies big data processing, while
Amazon Data Pipeline is a web service for data movement and transformation. Key
differences include:
1. Purpose: EMR focuses on distributed data processing using Hadoop ecosystem tools,
whereas Data Pipeline orchestrates data workflows across various AWS services.
2. Scalability: EMR automatically scales underlying infrastructure, while Data Pipeline
requires manual scaling of resources.
3. Flexibility: EMR supports multiple programming languages and frameworks, but Data
Pipeline is limited to AWS-specific components.
Choose EMR when dealing with large-scale data processing tasks requiring complex
analytics or machine learning capabilities. Opt for Data Pipeline when orchestrating data
workflows between AWS services, focusing on data movement and simple transformations.

11. Discuss how Amazon EMR integrates with AWS Glue, AWS Lake Formation, and
Amazon Athena. How can these services complement each other?
Amazon EMR integrates with AWS Glue, AWS Lake Formation, and Amazon Athena to
create a comprehensive data processing ecosystem.
AWS Glue is a serverless ETL service that simplifies data extraction, transformation, and
loading tasks. It provides an interface for defining jobs and crawlers, which can discover and
catalog metadata in the AWS Glue Data Catalog. EMR can use this catalog as a central
repository for schema information, enabling seamless integration between various big data
applications.
AWS Lake Formation streamlines the process of setting up, securing, and managing data
lakes. It automates tasks like data ingestion, cleaning, and cataloging. EMR clusters can
access data stored in a lake created by Lake Formation, leveraging its security policies and
permissions for fine-grained access control.
Amazon Athena is an interactive query service that allows users to analyze data in Amazon
S3 using standard SQL. By integrating with the AWS Glue Data Catalog, Athena can utilize
the same metadata as EMR, ensuring consistency across queries. This enables users to run ad-
hoc analyses on data processed by EMR without needing to set up additional infrastructure.

12. Explain how to monitor the performance of an Amazon EMR cluster in real-time.
What metrics and tools are used for this purpose?
To monitor the performance of an Amazon EMR cluster in real-time, use Amazon
CloudWatch and Ganglia. CloudWatch provides metrics such as CPU utilization, memory
usage, and disk I/O operations. Enable detailed monitoring for more frequent data points.
Create custom alarms to notify you when specific thresholds are breached.
Ganglia is an open-source tool that offers a web-based interface for visualizing cluster
performance. Install it on your EMR cluster by adding the “Ganglia” application during
cluster creation or modifying an existing cluster. Access the Ganglia dashboard through the
EMR Console or directly via its URL.
Combine both tools for comprehensive monitoring: CloudWatch for metric collection and
alerting, and Ganglia for visualization and historical analysis.

13. How does Amazon EMR handle data durability and data loss prevention? Discuss the
various data backup options available.
Amazon EMR ensures data durability and loss prevention through replication, backup, and
recovery mechanisms. It leverages Hadoop’s HDFS for distributed storage, which replicates
data blocks across multiple nodes to prevent data loss due to node failures.
For additional data protection, Amazon EMR offers various backup options:
1. S3 DistCp: Use the S3 DistCp tool to copy data from HDFS to Amazon S3 periodically,
providing a durable backup of your data in case of cluster failure or termination.
2. EMR File System (EMRFS): EMRFS allows direct access to data stored in Amazon S3,
enabling you to use it as a persistent storage layer. This eliminates the need to store data on
local HDFS, reducing the risk of data loss.
3. AWS Glue Data Catalog: Integrate with AWS Glue Data Catalog to store metadata about
your data, making it easier to discover, search, and manage datasets.
4. Snapshots: Create snapshots of EBS volumes attached to EMR instances for point-in-time
backups. These can be used to restore data if needed.
5. Cross-Region Replication: Enable cross-region replication in Amazon S3 to automatically
replicate data across different regions, ensuring high availability and disaster recovery.

14. Describe the bootstrap actions in Amazon EMR. What are their use cases, and how can
you create custom bootstrap actions?
Bootstrap actions in Amazon EMR are scripts executed on cluster nodes during the setup
phase before Hadoop starts. They enable customization of clusters, such as installing
additional software or configuring system settings.
Use cases include:
1. Installing custom applications
2. Modifying configuration files
3. Adjusting system parameters
To create a custom bootstrap action, follow these steps:
1. Write a script (e.g., Bash) to perform desired tasks.
2. Upload the script to an S3 bucket.
3. Specify the S3 location when creating the EMR cluster using AWS Management Console,
CLI, or SDKs.
Example: Install Python 3 and pip:
#!/bin/bash
sudo yum install -y python3
curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py --user

15. How does Amazon EMR handle instance failures? Explain the process of automatic
failover and recovery.
Amazon EMR handles instance failures through automatic failover and recovery
mechanisms. When a failure is detected, the system takes several steps to recover:
1. Identifies failed instances: EMR monitors cluster health and detects when an instance
becomes unresponsive or fails.
2. Reassigns tasks: Tasks running on the failed instance are reassigned to other available
instances in the cluster.
3. Launches replacement instances: EMR automatically provisions new instances to replace
the failed ones, maintaining the desired capacity.
4. Recovers data: If the failed instance was a core node with HDFS data, EMR recovers the
data by replicating it from other nodes.
5. Updates metadata: The system updates metadata about the cluster’s state, including
information about active instances and task assignments.
This process ensures minimal disruption to ongoing processing jobs and maintains data
integrity.

16. Can you discuss the concept of data locality in Amazon EMR and its impact on job
performance?
Data locality in Amazon EMR refers to the placement of data and computation tasks on
nodes within a cluster. It aims to minimize data movement across the network, thus
improving job performance. Hadoop Distributed File System (HDFS) stores data in blocks,
which are distributed across multiple nodes. When processing data, EMR attempts to
schedule tasks on nodes where the required data is already present.
There are three levels of data locality: node-local, rack-local, and off-switch. Node-local
means data and task reside on the same node; rack-local indicates they’re on different nodes
but within the same rack; off-switch implies they’re on separate racks. EMR prioritizes node-
local assignments, followed by rack-local, then off-switch.
Job performance benefits from data locality as it reduces network congestion and latency.
However, strict adherence to data locality may lead to resource underutilization or
imbalanced workloads. To address this, EMR uses delay scheduling, waiting for a short
period before assigning non-local tasks, allowing time for local resources to become
available.
Auto-scaling in Amazon EMR plays a crucial role in optimizing resource usage and cost-
efficiency. It dynamically adjusts the number of instances in an EMR cluster based on
workload demands, ensuring optimal performance while minimizing costs.
Auto-scaling helps improve efficiency by:
1. Automatically adding instances when demand increases, preventing bottlenecks and
maintaining high throughput.
2. Removing excess instances during low-demand periods, reducing unnecessary expenses.
3. Balancing workloads across instances to maximize utilization.
4. Allowing users to define scaling policies based on custom CloudWatch metrics or
predefined YARN metrics.
5. Supporting both core nodes (HDFS storage) and task nodes (processing only), enabling
fine-grained control over cluster resources.
6. Integrating with Spot Instances for additional cost savings without compromising
availability.

18. Describe how Amazon EMR can be integrated with Amazon S3, and discuss the
benefits of storing input and output data in S3.
Amazon EMR integrates with Amazon S3 through the use of EMR File System (EMRFS),
which allows EMR clusters to access and process data stored in S3. This integration enables
seamless data transfer between EMR and S3, as well as efficient querying using tools like
Hive and Spark.
Storing input and output data in S3 offers several benefits:
1. Durability: S3 provides 99.999999999% durability, ensuring data reliability.
2. Scalability: S3 can store unlimited amounts of data, allowing for growth
without capacity constraints.
3. Cost-effectiveness: Pay-as-you-go pricing model reduces storage costs
compared to traditional Hadoop Distributed File System (HDFS).
4. Data accessibility: S3 data is accessible from multiple EMR clusters or
other AWS services, enabling parallel processing and reducing data silos.
5. Decoupling storage and compute: Separating storage (S3) from compute
resources (EMR) allows independent scaling and optimization of each
component.
6. Simplified data management: Lifecycle policies and versioning features in
S3 help manage data efficiently over time.

19. Explain the process of migrating an on-premises Hadoop cluster to Amazon EMR.
What are the key steps and considerations when performing this migration?
Migrating an on-premises Hadoop cluster to Amazon EMR involves several key steps and
considerations:
1. Assess the current Hadoop environment, including data size, processing requirements, and
dependencies.
2. Choose appropriate Amazon EMR instance types based on resource needs and cost
optimization.
3. Set up necessary AWS services such as S3 for storage, IAM for access control, and VPC
for networking.
4. Modify existing Hadoop applications to work with EMR, considering differences in file
systems (HDFS vs. S3) and APIs.
5. Migrate data from on-premises storage to S3 using tools like AWS Snowball or DataSync
for efficient transfer.
6. Test migrated applications on EMR, ensuring correctness and performance meet
expectations.
7. Plan a cutover strategy, minimizing downtime during migration.
Key considerations include:
– Ensuring data security and compliance during migration
– Optimizing costs by selecting suitable instances and leveraging spot pricing
– Monitoring and managing EMR clusters effectively

20. How can Amazon EMR be used for data warehousing and data analytics workloads?
Discuss some use cases and architectural patterns.
Amazon EMR is a managed Hadoop framework that simplifies running big data workloads,
enabling data warehousing and analytics. It supports various distributed processing engines
like Apache Spark, Hive, and Presto.
Use cases:
1. Log analysis: Analyze web server logs to gain insights into user behavior and improve
website performance.
2. ETL processing: Extract, transform, and load large datasets from multiple sources for
further analysis or storage in Amazon S3 or Redshift.
3. Machine learning: Train ML models on vast amounts of data using libraries like
TensorFlow or PyTorch.
4. Real-time analytics: Process streaming data with low-latency requirements using Apache
Flink or Kafka.
Architectural patterns:
1. Decoupling storage and compute: Store raw data in Amazon S3 and use EMR clusters for
processing, allowing independent scaling of storage and compute resources.
2. Data lake architecture: Ingest structured and unstructured data into an S3-based data lake,
then process and analyze it using EMR.
3. Lambda architecture: Combine batch and real-time processing by using EMR for batch
layer and Kinesis/Flink for the speed layer, merging results for final output.
4. Federated querying: Use EMR with Amazon Athena or Redshift Spectrum to query data
across different storage systems without moving it.

21. Discuss the best practices for cost optimization in Amazon EMR. What are the different
pricing models and billing options available?
To optimize costs in Amazon EMR, consider the following best practices:
1. Choose appropriate instance types: Select instances that provide optimal performance for
your workload at the lowest cost.
2. Use Spot Instances: Leverage Spot Instances to save up to 90% compared to On-Demand
pricing.
3. Utilize Reserved Instances: Purchase RIs for long-term workloads to achieve significant
savings over On-Demand rates.
4. Optimize cluster size: Scale clusters based on demand and use auto-scaling policies to
minimize costs.
5. Compress data: Reduce storage and processing costs by compressing input/output data.
6. Monitor usage: Track resource utilization with CloudWatch metrics and adjust
configurations accordingly.
Amazon EMR offers three pricing models:
1. On-Demand Instances: Pay-as-you-go model without upfront commitment.
2. Reserved Instances (RIs): Commitment-based model offering discounts for 1 or 3-year
terms.
3. Spot Instances: Bid-based model allowing you to purchase unused capacity at a lower
price.
Billing options include:
1. Per-second billing: Charges are calculated per second of usage.
2. Savings Plans: Flexible plans providing discounts for consistent usage across AWS
services.

22. Explain the role of AWS Identity and Access Management (IAM) policies in controlling
access to Amazon EMR resources.
AWS Identity and Access Management (IAM) policies play a crucial role in controlling
access to Amazon EMR resources by defining permissions for users, groups, and roles. These
policies determine the actions that can be performed on specific resources within an AWS
account.
IAM policies are JSON documents containing statements with elements like Effect, Action,
Resource, and Condition. The Effect specifies whether to allow or deny access, while Action
lists the operations allowed or denied. Resource identifies the target resource(s), and
Condition defines any constraints applied to the policy.
In the context of Amazon EMR, IAM policies help manage access to clusters, instances, and
other related services such as S3 buckets and EC2 instances. For example, you can create a
policy allowing certain users to launch EMR clusters but restrict them from terminating
existing ones.
Additionally, IAM policies can be used to control access to EMRFS data stored in S3,
ensuring only authorized users have read/write access to specific paths. This is achieved
through the use of EMRFS authorization rules, which map IAM policies to Hadoop user
accounts.
23. How can you use Amazon EMR and Lambda functions together for real-time data
processing? Provide an example use case.
Amazon EMR and Lambda functions can be used together for real-time data processing by
leveraging the strengths of both services. EMR is ideal for large-scale, distributed data
processing tasks, while Lambda excels at handling event-driven architectures with low
latency requirements.
In a typical use case, you could set up an Amazon Kinesis Data Stream to ingest real-time
data from various sources like IoT devices or social media feeds. Then, create a Lambda
function that processes incoming records in the stream and transforms them as needed. This
transformed data can then be written back to another Kinesis Data Stream or directly into an
S3 bucket.
Next, configure an EMR cluster to consume the processed data from the second Kinesis Data
Stream or S3 bucket. The EMR cluster can run Spark Streaming jobs to perform further
analysis, aggregation, or machine learning on the data. Finally, store the results in a persistent
storage system such as Amazon Redshift or DynamoDB for querying and visualization
purposes.
This architecture enables real-time data processing using Lambda functions for initial
transformations and EMR for more complex analytics, providing scalability and flexibility in
handling diverse workloads.

24. How to configure encryption options for data at rest and data in transit in Amazon
EMR?
To configure encryption options for data at rest and in transit in Amazon EMR, follow these
steps:
1. Enable server-side encryption (SSE) for S3 using AWS Key Management Service (KMS)
or SSE-S3 to protect data stored in input/output buckets.
2. Use HDFS Transparent Data Encryption (TDE) with KMS to encrypt intermediate data on
cluster nodes.
3. Configure local disk encryption by enabling LUKS (Linux Unified Key Setup) on the
instance store volumes of your EMR cluster instances.
4. For data in transit, enable TLS/SSL encryption for applications like Spark, Hive, and
Presto by setting up security configurations in EMR.
5. Create a security configuration JSON file specifying encryption settings for each
component (e.g., S3, HDFS, Local Disk, and TLS).
6. When creating an EMR cluster, specify the created security configuration using the
--security-configuration
flag.
Example:
aws emr create-security-configuration --name MySecurityConfig --security-configuration '{
"EncryptionConfiguration": {
"AtRestEncryptionConfiguration": {...},
"InTransitEncryptionConfiguration": {...}
}
}'
aws emr create-cluster ... --security-configuration MySecurityConfig
25. Explain the process of creating and deploying Docker containers in an Amazon EMR
cluster. What benefits does containerization bring to the EMR environment?
To create and deploy Docker containers in an Amazon EMR cluster, follow these steps:
1. Set up the EMR cluster with a custom bootstrap action to install Docker.
2. Create a Dockerfile defining your container’s environment, dependencies, and application
code.
3. Build the Docker image using
docker build
command and push it to a container registry like Amazon ECR or Docker Hub.
4. Configure EMR step(s) to pull the Docker image and run the containerized application
using
docker run
command.
Containerization benefits in EMR environment include:
– Isolation: Containers encapsulate applications and their dependencies, ensuring consistent
execution across environments.
– Versioning: Container images can be versioned, allowing easy rollback to previous versions
if needed.
– Portability: Containers can run on any platform supporting Docker, simplifying migration
between environments.
– Resource Efficiency: Containers share underlying OS resources, reducing resource
overhead compared to running separate VMs for each application.
– Scalability: Containers can be easily scaled horizontally by adding more instances to handle
increased workloads.
RELATIONAL DATABASE SERVICES (RDS)
What is DATABASE?
Database is a systematic collection of data. Databases supports storage and manipulation of data.
e.g: facebook, telecom companies, amazon.com
What is DBMS?
DBMS is a collection of programs which enable its users to access database, manipulate data,
reporting/ representation of data.
Types of DBMS
1. Hierarchical
2. Network
3. Relational
4. Object oriented

Relational Database:
A relational database is a data structure that allows you to link information from different
tables of different types of data bucket.
Tables are related to each other.
All fields must be filled.
Best suited for OLTP (online transaction processing)
Relational DB: MySQL, Oracle, DBMS, IBM DB2
A row of a table is also called records. It contains the specific information of each individual
entry in the table.
Each table has its own primary key.
A schema (design of database) is used to strictly define tables, columns, indexes and relation
between tables.
Relational DB are usually used in enterprises application/scenario. Exception in MySQL
which is used for web application.
Common application for MySQL include php and java based web applications that requires a
database storage backend. E.g: JOOMLA
Cannot scale out horizontally.
Virtually all relational DB uses SQL.

Non-Relational Database\No-Sql DB:

Non-relational databases store data without a structured mechanism to link data from
different tables to one another.
Required low cost hardware.
Much faster performance (read/write) compared to relational DBs.
Horizontal scaling is possible.
Never provide tables with flat fixed column records. It means schema-free.
Best suited for online analytical processing (OLAP).
E.g: of NoSQL databases: MongoDB, Casssandra, DynamoDB, Postegre, Raven, Redis.
Types of No-SQL Databases:
1. Columnar Database (cassandra, HBase)
2. Document Databse (MongoDB, CouchDB, RavenDB)
3. Key Value Database (Redis, Riak, DynamoDB, Tokyo Cabinet)
4. Graph Based Database (Neo4J, FlockDB)

1. Columnar Database:
A columnar database is a DBMS that stores data in columns instead of Rows.
In a columnar DB all the column-1 values are physically together followed by all the
column-2 values.
In a row oriented DBMS the data would be stored like this:
(1, bob, 30, 8000: 2, arun, 26, 4000: 3, vian, 39, 2000 ;)
In a column based DBMS the database would be stored like this:
(1, 2, 3: bob, arun, vian; 30, 26, 39; 8000, 4000, 2000 ;)
Benefit is that because a column based DBMS is self-indexing, it uses less disk space that a
RDBMS containing the same data. It easily perform operation like min, max, average.

2. Document Database:
Document DB make it easier for developer to store and querying data in a DB by using the
same document model format they use in their application code.
Document DBs are efficient for storing catalogue.
Store semi-structure data as document typically in JSON or XML format. (example)
A document database is a great choice for contain management application such as blogs and
video platform.

3. Key-Value Database:
A key-value DB is a simple DB that uses an associative array (like dictionary) as a
fundamental model where each key is associated with one and only one value in a collection.
It allows horizontal scaling.
Used cases: shopping cart, and session store in app like fb and twitter.
They improve application performance by storing critical pieces of data in memory for low
latency access.
Amazon elasticache as an in-memory key-value stores.

4. Graph Based Database:

A graph DB is basically a collection of nodes and edges.
Each node represent an entity and each edges represent a connection or relationship between
two nodes.
In an AWS fully managed relational DB engines service where AWS is responsible for:
Security and patching.
Automated backup.
Software updates for DB engine.
If selected multi AZ with synchronous replication between the active and stand by DB
instances.
Automatic failover if multi AZ option was selected.
By default, every DB has weekly maintenance window. (max 35 days.)

Settings managed by the users:

Managing DB settings.
Creating relational database schema.
Database performance tuning.
Relational Database Engine Options:
1. MS SQL Sever
2. My SQL: supports 64TB of DB
3. Oracle
4. AWS Aurora: high throughput
5. Postgre SQL: highly reliable and stable
6. Maria DB: MySQL compatible, 64TB DB

There are two Licensing Options:

1. BYOL (Bring Your Own License)
2. License from AWS on hourly basis

RDS Limits:
Up to 40DB instances per account.
10 of this 40 can be Oracle or MS-SQL server under license included model.
Or
Under BYOL model, all 40 can be any DB engine you need.

RDS Instance Storage:

Amazon RDS use EBS volumes (not instance store) for DB and logs storage.

1. General Purpose: use for DB workloads with moderate I/O requirement.

Limits: min: 20 GB
Max: 16384 GB
2. Provisional IOPS RDS Storage: use for high performance OLTP workloads.

Limits: min: 100GB

Max: 16384GB

Templates available in RDS:

a. Production
b. Dev/Test
c. Free-Tier

DB Instance Size:
a. Standard class
b. Memory-Optimized class
c. Burstable class
What is Multi-AZ in RDS:
You can select multi AZ option during RDS DB instance launch.
RDS service creates a standby instances in a different AZ in the same region and configure
“synchronous replication” between the primary and standby.
You cannot read/write to the standby RDS DB instances.
You cannot select which AZ in the region will be chosen to create the standby DB instance.
You can however view, which AZ is selected after the standby is created.
Depending on the instance class it may take 1 to few minutes to failover to the standby
instance.
AWS recommends the use of provisioned IOPS instances for multi-AZ RDS instance.

When Multi-AZ RDS Failover Triggers:

In case of failure of primary DB instance failure.
In case of AZ failure.
Loss of network connectivity to primary DB.
Loss of primary EC2 instance failure.
EBS failure of primary DB instance.
The primary DB instance is changed.
Patching the O.S of the DB instance.
Manual failover. (in case of rebooting.)

Multi-AZ RDS Failover Consequences:

During failover the CNAME of the RDS DB instance is updated to map to the standby IP
address.
It is recommended to use the end point to reference your DB instances and not its IP address.
The CNAME doesn’t change because the RDS endpoint doesn’t change.
RDS end point doesn’t change by selecting multi-AZ option, however the primary and standby
instances will have different IP addresses, as they are in different AZ.
It is always recommended that you do not use the IP address to point RDS instances, always
use endpoint. By using endpoint there will be no change whenever a failover happens.

When we do manual failover?

In case of rebooting.
This is by selecting the “reboot with failover” reboot options on the primary RDS DB
instances.

A DB instance reboot is required for changes to take effect when you change the DB
parameter group on when you change a static DB parameter.

Whenever failover occurs AWS RDS sends SNS notification.

You can use API calls to find out the RDS events occurred in the last 14 days.
Even you can use CLI to view last 14 days events.
Using AWS console you can only last one day events
In case of OS patching, system upgrades and DB scaling, these things happens on standby
first then on primary to avoid outage.
In multi-AZ, snapshots and automated backups are done on standby instance to avoid I/O
suspension on primary.
RDS Multi-AZ Deployment Maintenance:
Firstly perform maintenance of standby.
Now convert standby into primary so that maintenance can be done on primary.(currently)
You can manually upgrade a DB instance to a supported DB engine version from AWS
console as follows:- RDS->DB instance->modify DB->set DB engine version.
By default change will take effect during the next maintenance window.
Or you can force a immediate upgrade if you want.
In multi-AZ version upgrade will be conducted by both primary and standby at the same time
which will cause an outage.
Do it during maintenance window.

There are two methods to backup and restore your RDS DB instances:
1. AWS RDS automated backup
2. User initiated manual backup

Either you can take backup of entire DB instance or just the DB.
You can create a restore volume snapshots of your entire DB instances.
Automated backups by AWS, backup your DB data to multiple AZ to provide for data
durability.
Select-automated backup in AWS console.
Stored in Amazon S3.
Multi-AZ automated backups will be taken from the standby instance.
The DB instance must be in “ACTIVE” state for automated backup.
RDS automatically backups the DB instances daily by creating a storage volume snapshots
of your DB instance (fully daily snapshots) including the DB transaction logs.
You can decide when you would like to take backup (window)
No additional charge for RDS backing up your DB instance.
For multi-AZ deployment, backups are taken from the standby DB instance (true for Maria
DB, MySQL, Oracle, Postgre SQL).
Automated backups are deleted when you delete your RDS DB instance.
An outage occurs if you change the backup retention period from zero to non-zero value or
the other way around.
Retention period of automate backup is 7 days (by default) via AWS console.
AWS Aurora is an exception. Its default is 1 day.
Via CLI or API 1 day by default.
You can increase it up to 35 days.
If you don’t want backup, put zero “0” in the retention period.
In case of manual snapshot, point-in-time recovery is not possible.
Manual snapshot is also stored in S3.
They are not deleted automatically, if you delete RDS instance.
Take a final snapshot before deleting your RDS DB instance.
You can share manual snapshot directly with other AWS Account.
When you restore a DB instance only the default DB parameters and security groups are
associated with the restored instance.
You cannot restore a DB snapshot into an existing DB instance rather it has to create a new
DB instance it has new endpoint.
Restoring from the backup or a DB snapshot changes the RDS instance endpoint.
At the time of restoring, you can change the storage type (general purpose or provisioned.)
You cannot encrypt an existing unencrypted DB instance.
To do that you need to: create a new encrypted instance and migrate your data to it (from
unencrypted to encrypted) or you can restore from a backup/snapshot into a new encrypted RDS
instance.
RDS supports encryption-at-rest for all DB engines using KMS.
What actually encrypted when data-at-rest:
a. All its snapshots.
b. Backups of DB (S3 storage.)
c. Data on EBS volume.
d. Read replica created from the snapshots.

Some points related to RDS Billings:

No upfront cost.
You have to pay only for:
a. DB instance hours (partial hour charged as full hours)
b. Storage GB/month.
c. Internet data transfer.
d. Backup storage (i.e S3)

This is increases by increasing DB backup’s retention period.

Also charged for:
a. Multi-AZ DB hours.
b. Provisioned stage (multi-AZ)
c. Double write I/O
d. You are nor charged for DB data transfer during replication from primary to standby.
Redshift
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud
provided by Amazon Web Services (AWS). It is designed for high-performance analysis
using a distributed architecture that allows for parallel processing of large datasets. Redshift
is particularly well-suited for data warehousing and business intelligence applications where
large volumes of data need to be analyzed quickly.

 Architecture of Redshift and its components:

Understanding the Amazon Redshift Architecture Components

Redshift is meant to work in a Cluster formation. A typical Redshift Cluster has

two or more Compute Nodes which are coordinated through a Leader Node. All
client applications communicate with the Cluster only with the Leader Node. This
Architecture can be broken down into the following components:

 Redshift Architecture Component 1: Leader Node

 Redshift Architecture Component 2: Compute Node
 Redshift Architecture Component 3: Node Slices
 Redshift Architecture Component 4: Massively Parallel Processing
 Redshift Architecture Component 5: Columnar Data Storage
 Redshift Architecture Component 6: Data Compression
 Redshift Architecture Component 7: Query Optimizer
 Redshift Architecture Component 7: Cluster Internal Network
Redshift Architecture Component 1: Leader Node

The Leader Node in an Amazon Redshift Cluster manages all external and internal
communication. It is responsible for preparing query execution plans whenever a
query is submitted to the Cluster. Once the query execution plan is ready, the Leader
Node distributes the query execution code on the Compute Nodes and assigns Slices
of data to each to Compute Node for computation of results.

Leader Node distributes query load to Compute Node only when the query involves
accessing data stored on the Compute Nodes. Otherwise, the query is executed on the
Leader Node itself. There are several functions in Redshift Architecture that are
always executed on the Leader Node.

You can read SQL Functions Supported on the Leader Node for more information on
these functions, here.
Redshift Architecture Component 2: Compute Nodes

Compute Nodes are responsible for the actual execution of queries and have data stored with
them. They execute queries and return intermediate results to the Leader Node which further
aggregates the results.

There are two types of Compute Nodes available in Redshift Architecture:

 Dense Storage (DS): Dense Storage Nodes allow you to create large Data
Warehouses using Hard Disk Drives (HDDs) for a low price point.
 Dense Compute (DC): Dense Compute nodes allow you to create high-performance
Data Warehouses using Solid-State Drives (SSDs).

A more detailed explanation of how responsibilities are divided among Leader and Compute
Nodes are depicted in the diagram below:
Redshift Architecture Component 3: Node Slices

A Compute Node consists of Slices. Each Slice has a portion of Compute Node’s memory
and disk assigned to it where it performs query operations. The Leader Node is responsible
for assigning a query code and data to a slice for execution. Slices once assigned query load
work in parallel to generate query results.

Data is distributed among the Slices on the basis of the Distribution Style and Distribution
Key of a particular table. An even distribution of data enables Redshift to assign workload
evenly to Slices and maximizes the benefit of parallel processing.

The number of Slices per Compute Node is decided on the basis of the type of node. You can
find more information on Clusters and Nodes.

Redshift Architecture Component 4: Massively parallel processing (MPP)

Amazon Redshift Architecture allows it to use Massively Parallel Processing (MPP) for fast
processing even for the most complex queries and a huge amount of data set. Multiple
compute nodes execute the same query code on portions of data to maximize Parallel
Processing.

Redshift Architecture Component 5: Columnar Data Storage

Data in the Amazon Redshift Data Warehouse is stored in a Columnar fashion which
drastically reduces the I/O on disks. Columnar storage reduces the number of disk I/O
requests and minimizes the amount of data loaded into the memory to execute a query.
Reduction in I/O speeds up query execution and loading less data means Redshift can
perform more in-memory processing.

Redshift uses Sort Keys to sort Columns and filter out chunks of data while
executing queries. You can read more about Sort Keys in our post on Choosing the
best Sort Keys here.
Redshift Architecture Component 6: Data Compression

Data compression is one of the important factors in ensuring query performance. It reduces
the storage footprint and enables the loading of large amounts of data in the memory fast.
Owing to Columnar storage, Redshift can use adaptive compression encoding depending on
the Column data type. Read more about using compression encodings in Compression
Encodings in Redshift here.

Redshift Architecture Component 7: Query Optimizer

Redshift’s Query Optimizer generates query plans that are MPP-aware and takes advantage
of Columnar Data Storage. Query Optimizer uses analyzed information about tables to
generate efficient query plans for execution. Read more about Analyze to know how to make
the best of Query Optimizer here.

Redshift Architecture Component 7: Cluster Internal Network

Amazon Redshift provides private and high-speed network communication between leader
node and compute nodes by leveraging high-bandwidth network connections and custom
communication protocols. The compute nodes run on an isolated network that can never be
accessed directly by Client Applications.
Performance Features:
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the
cloud. It is designed to analyze large datasets using SQL queries and provides high-
performance analysis of structured data. Here are some key features of Amazon Redshift:
1. Columnar Storage:
 Amazon Redshift stores data in a columnar format, which is more efficient for analytical
queries. This allows for high compression rates and faster query performance.
2. Massively Parallel Processing (MPP):
 Redshift uses a massively parallel processing architecture, distributing the workload across
multiple nodes to enable parallel query execution. This helps to handle large datasets and
complex queries efficiently.
3. Scalability:
 Redshift is scalable, allowing you to easily add or remove nodes to accommodate changing
workloads. You can scale the cluster up or down based on your performance and storage
requirements.
4. Automated Backup and Restore:
 Redshift automatically takes incremental snapshots of your data to enable point-in-time
recovery. You can also create manual snapshots for data backup and archiving purposes.
5. Data Compression:
 Redshift uses various compression techniques to minimize storage requirements and improve
query performance. This includes columnar storage, run-length encoding, and dictionary
encoding.
6. Advanced Query Optimization:
 Redshift provides features such as zone maps, query rewrites, and automatic table
optimization to enhance query performance. The query optimizer analyzes and chooses the
most efficient query plan for execution.
7. Security:
 Amazon Redshift offers several security features, including encryption of data in transit and
at rest, Virtual Private Cloud (VPC) integration, AWS Identity and Access Management
(IAM) for access control, and support for Virtual Private Network (VPN) and Direct
Connect.
8. Integration with AWS Ecosystem:
 Redshift seamlessly integrates with other AWS services, allowing you to easily transfer data
to and from services like Amazon S3, AWS Glue, and Amazon EMR for data processing and
analytics.
9. Concurrency and Workload Management:
 Redshift provides robust concurrency controls and workload management features, allowing
you to manage multiple concurrent queries efficiently. You can set query queues and allocate
resources based on your specific workload requirements.
10. Monitoring and Logging:
 AWS Redshift provides tools for monitoring and logging, including Amazon CloudWatch
metrics, query and performance logging, and detailed system views. This helps you monitor
the health and performance of your data warehouse.
11. Data Loading Options:
 Redshift supports various data loading options, including direct data injection from Amazon
S3, data streaming, and bulk data loading using the COPY command.
These features collectively make Amazon Redshift a powerful and flexible data warehousing
solution for businesses dealing with large-scale analytical workloads in the cloud.
 4]Data Security and Protection:
#Data Security:
-SSL (Secure Sockets Layer) encryption
-TLS (Transport Layer Security) encryption,
#Data Protection:
Data encryption:
1] Server-side encryption
2] Client-side encryption
Encryption at rest : AES-256 encryption
Encryption in Transit : SSL/TLS encryption

 What is Workgroup?
-collection of compute resources from which an endpoint is created
-compute-related Container
-groups together compute resources like RPUs(Redshift Processing Units), VPC
subnet groups, security groups

Workgroup is a collection of compute resources. The compute-related workgroup groups

together compute resources like RPUs, VPC subnet groups, and security groups. Properties
for the workgroup include network and security settings. Other resources that are grouped
under workgroups include access and usage limits. You can configure these compute
properties using the Amazon Redshift Serverless console, the AWS Command Line Interface,
or the Amazon Redshift Serverless APIs.

What is Namespace?
-namespace is a collection of database objects and users
-storage-related
-groups together schemas, tables, users, or AWS Key Management Service keys for
encrypting data

Namespace is a collection of database objects and users. The storage-related namespace

groups together schemas, tables, users, or AWS Key Management Service keys for encrypting
data. Storage properties include the database name and password of the admin user,
permissions, and encryption and security. Other resources that are grouped under namespaces
include datashares, recovery points, and usage limits. You can configure these storage
properties using the Amazon Redshift Serverless console, the AWS Command Line Interface,
or the Amazon Redshift Serverless APIs for the specific resource.

==>>When using Redshift Serverless, we need to provision Workgroup for availing compute
resources and Namespace for availing storage resources

You can create one or more namespaces and workgroups. Each namespace can have only one
workgroup associated with it. Conversely, each workgroup can be associated with only one
namespace.

================================================================
 Evolution of Data Processing Frameworks:
1]ETL :
Transform-->Spark(Traditional DW)
2]ELT :
Load into DW
-Transform in modern dw
3]EtLT:
t->transform on spark :schema conversion,column trancation
Load into warehouse
-Transform in modern dw : aggregation related
UPSERT operation-->Update and Insert

 High Cardinality and Low Cardinality

In the context of databases, cardinality refers to the uniqueness of data values in a column.
High cardinality means that a column has a large number of unique values, while low
cardinality means that there are a small number of unique values.

High Cardinality:

1. Unique Values:
 High cardinality columns have a large number of distinct values.
 Examples could include columns like "user_id" or "email_address" where each row has a
unique value.
2. Indexes:
 High cardinality columns are good candidates for indexing because they provide efficient
access to specific rows in a table.
3. Filtering:
 When querying on high cardinality columns, the database engine can quickly filter down the
result set as each value is likely to be unique.
4. Join Conditions:
 High cardinality columns are often used in join conditions between tables.
5. Storage:
 Storing high cardinality columns efficiently may require more storage space.
Low Cardinality:

1. Limited Unique Values:

 Low cardinality columns have a small number of distinct values.
 Examples could include columns like "gender" or "status" where there are only a few
possible values.
2. Indexes:
 Creating indexes on low cardinality columns may not be as beneficial because the values are
not highly unique.
3. Filtering:
 Filtering on low cardinality columns might not be as selective, as a large portion of the data
may have the same value.
4. Storage:Low cardinality columns may require less storage space compared to high
cardinality columns.
Considerations for Redshift:
1. Distribution Key:
 In Amazon Redshift, the distribution key choice can impact query performance. High
cardinality columns are often good candidates for distribution keys.
2. Sorting Key:
 Sorting keys can also impact performance. Choosing a column with high cardinality as a sort
key can improve query performance.
3. Query Optimization:
 Understanding cardinality is crucial for query optimization. Redshift's query planner uses
statistics about the data to optimize query plans.
4. Data Types:
 Choosing appropriate data types for columns can impact storage and performance. High
cardinality columns might benefit from smaller and more efficient data types.
When working with Amazon Redshift or any database system, it's essential to analyze the
cardinality of your data and make design decisions based on the specific characteristics of
your workload and queries.

EX:
1000 Records and 10 unique values:

product 1000MB

columnar storage -> stores 10 values -->stores respective references

==>help in storage cost optimization
but also in query performance

---------------------------------------------------------------------------------------------------------------

Query result caching(result set caching): involves storing the actual result sets of
executed queries for later retrieval

Query caching:
-involves caching the execution plan or metadata associated with a query, rather
than the actual data
-helps save on planning and optimization time

#1 RPU : 2 virtual CPUs and 16 GB of RAM

Ways to load data into Redshift ::
1]Copy Command:
-internal/managed table created
-data is actually moved to this table(Redshift storage consumed)
2]Redshift Spectrum:
-external table created
-no actual movement of data (data resides in S3 itself)

1] Copy Command ::
COPY table_name [column_list]
FROM data_source
[options]
explaination:
o table_name: name of the target table where you want to load data

o column_list: (Optional)
 -A comma-separated list of columns in the target table.
 -If not specified, Redshift assumes that the column order in the source
file matches the order of columns in the target table
o data_source:
 -Specifies the source of the data you want to copy
 -This can be an Amazon S3 bucket, an Amazon EMR cluster, a data
file on your local file system, or a remote host using SSH

o options:Additional configuration options for the COPY command

 -specifying the file format, delimiter, data encoding, error handling etc
o Common Options:

1]FORMAT format_type: Specifies the file format of the source data. Supported formats
include CSV, JSON, AVRO, PARQUET, ORC, and more.
2]DELIMITER 'delimiter': Specifies the field delimiter used in the source data.
3]IGNOREHEADER n: Specifies the number of header lines to skip in the source data.
4]FILLRECORD: Adds null columns to match the target table's column count if the source
data has missing columns.
5]ENCRYPTED: Indicates that the source data is encrypted.
6]MAXERROR n: Sets the maximum number of allowed data load errors before the COPY
operation fails
7]CREDENTIALS 'aws_access_key_id=access-key-id;aws_secret_access_key=secret-
access-key': Specifies AWS access credentials when loading data from Amazon S3--->psycopg
-postgre sql database
8]COMPUPDATE ON|OFF: Specifies whether to recalculate table statistics after the COPY
operation
9]GZIP: Indicates that the source data is in GZIP compressed format
10]TRUNCATECOLUMNS: Truncates data that exceeds the column length in the target table
11]REMOVEQUOTES: Removes surrounding quotation marks from data fields
2]Redshift Spectrum:
-a feature of Amazon Redshift that enables you to run SQL queries directly against
data stored in your Amazon S3 buckets
-extends the data warehousing capabilities of Redshift by allowing you to analyze and
join data from multiple sources, both within your Redshift data warehouse and external data
stored in Amazon S3

#Features:
1]Data in Amazon S3
2]External Tables
-not stored in your Redshift cluster but act as metadata for querying the data in S3
3]Querying :
-can run SQL queries that join your internal Redshift tables with your external S3 data
4]Performance:
-optimizes query performance by pushing down filters to the S3 data, minimizing data
movement

#Benefits of using Redshift Spectrum:

1]Cost-Efficient
2]Scalability
3]Data Integration: allows you to integrate and analyze data from various sources( without
the need to move or replicate the data)
4]Separation of Storage and Compute

##Snapshots and Backups:

1]Automated snapshots
2]Manual snapshots
-Restoring a table from a snapshot

##Performance Optimization and Tuning:

1]Automatic compression-->
2]Query caching
3]Parallel query execution
4]Data Distribution Styles:
1)Auto:
create table your_table (
column1 int,
column2 varchar(50))
diststyle auto;
2)Even:
- appropriate when a table doesn't participate in joins

create table your_table (

column1 int,
column2 varchar(50));
3)All:
create table your_table (
column1 int,
column2 varchar(50))
diststyle all;

4)Key:
create table your_table (
column1 int,
column2 varchar(50),
distribution_key_column int distkey);

5]Sort Keys ::
1)Compound :
-composed of one or more columns
-Data is initially sorted based on the first column in the sort key and
then within each of those groups, it is further sorted based on the second column and so on
-known access patterns that frequently filter, join, or aggregate data
based on multiple columns in a predictable order

-DDL:
create table your_table (
column1 int,
column2 varchar(50),
column3 date)
sortkey (column1, column3);

2)Interleaved ::
-also composed of one or more columns
-doesn't prioritize one column over the others ==>it interleaves the
data across all columns in the sort key evenly
-can help improve query performance for tables with unpredictable
query patterns(varying filtering and grouping conditions)

-DDL:
create table your_table (
column1 int,
column2 varchar(50),
column3 date)
interleaved sortkey (column1, column2, column3);
6]Redshift Workload Management(WLM):
-WLM helps you manage and prioritize queries in your Redshift cluster
-ensuring that different workloads and queries can coexist and perform
efficiently in a multi-user environment
-enables you to allocate resources, control concurrency, and manage query
performance by defining query queues and assigning query groups

#Explore about : Query Queues,Query Prioratization,Concurrency etc

7] Vaccum and Analyze Commands::

1]Vaccum Command:
-used to optimize and reclaim storage space in database tables
-two main types:
1]Vaccum:
- reclaims space and resorts rows in the specified table
-When DML happens-->does not immediately release the space==> lead to
fragmented storage and decreased query performance

-Running a VACUUM on a table consolidates the data, removes deleted rows, and sorts the
remaining rows
-Syntax : VACUUM table_name;

2]Vaccum Full:
- performs a more aggressive vacuum operation
- it can be more resource-intensive and time-consuming
-syntax :VACUUM FULL table_name;

2]Analyze Command :
-used to update and refresh statistics about the data in database tables
-these statistics are essential for the query planner to make informed
decisions about query execution plans
-Syntax : ANALYZE my_table;
-important to regularly run ANALYZE on your tables after significant
data changes, to ensure that the query planner has the most accurate information for
optimizing query performance
##Amazon Redshift best practices for designing tables::
1.Choose the best sort key
2.Choose the best distribution style
3.Use automatic compression
4.Define primary key and foreign key constraints between tables wherever
appropriate. Even though they are informational only, the query optimizer uses those
constraints to generate more efficient query plans
5. Use smallest possible column size

##Amazon Redshift best practices for loading data::

1. Use a COPY command to load data
2. Use a single COPY command to load from multiple files
3. Use a staging table to perform a merge (upsert)
4. Compressing your data files
5. Verify data files before and after a load:
When you load data from Amazon S3, first upload your files to your Amazon
S3 bucket,
then verify that the bucket contains all the correct files, and only those files.
After the load operation is complete, query the STL_LOAD_COMMITS
system table to verify that the expected files were loaded.
6.Use a multi-row insert: improve performance by batching up a series of inserts
7.Use a bulk insert -->copy command
8.load data in sort key order ==>use sort keys while creating tables
9.Load data in sequential blocks ==>
1. What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale data warehouse service provided by
Amazon Web Services (AWS). It allows users to easily analyze data using SQL and Business
Intelligence (BI) tools. Redshift is optimized for fast querying and can handle petabyte-scale
data warehouses, making it a popular choice for organizations with large amounts of data to
analyze. It is designed to be cost-effective and easy to use, with features such as automatic
data compression and columnar storage to help reduce storage requirements and improve
query performance. Redshift is also highly scalable, with the ability to add or remove nodes
as needed to accommodate changes in data volume or query workloads.

2. What are the benefits of using AWS Redshift?

The major benefits provided by AWS Redshift include:
 In-built security with end-to-end encryption.
 Multiple query support that provides significant query speed upgrades.
 It provides an easy-to-use platform that is similar to MySQL and provides the usage of
PostgreSQL, ODBC, and JDBC.
 It offers Automated backup and fast scaling with fewer complications.
 It is a cost-effective warehousing technique.

3. How do list tables in Amazon Redshift?

The ‘Show table’ keyword lists the tables in Amazon Redshift. It displays the table schema
along with table and column constraints. Syntax:
SHOW TABLE [schema.]table_name

4. Why use an AWS Data Pipeline to load CSV into Redshift? And How?
AWS Data Pipeline facilitates the extraction and loading of CSV(Comma Separated Values)
files. Using AWS Data Pipelines for CSV loading eliminates the stress of putting together a
complex ETL system. It offers template activities to perform DML(data manipulation) tasks
efficiently.
To load the CSV file, we must copy the CSV data from the host source and paste that into
Redshift via RedshiftCopyActivity.

5. How far Redshift is better in performance as compared to other data warehouse

technologies?
Amazon Redshift is the easiest and fastest cloud data warehouse which facilitates 3 times
better price performance than other data warehouses. Redshift offers fast query performance
at a comparatively modest cost to firms where datasets range in size from gigabytes to
exabytes.

6. How to connect a private Redshift cluster?

By selecting option NO, you access your private IP address within the VPC. Bu doing this,
you execute the public IP address. Now, the way of its access is through the VPC.
One more method most people use to connect to a private database is by using port
forwarding by a Bastion server.

7. How are Amazon RDS, DynamoDB, and Redshift different?

RDS – RDS’s storage limit depends on which engine you’re running, but it tops out at 64 TB
using Amazon Aurora. SQL accommodates 16 TB, and all the other engines allow for 32TB.
Redshift – Redshift’s max capacity is much higher at 2PB.
DynamoDB – DynamoDB has limitless storage capacity..
8. How to use an AWS Data Pipeline to load CSV into Redshift?
You can also extract and load your CSV files using the AWS Data Pipeline. The advantage of
using the AWS Data Pipeline for loading is that you won’t have to worry about putting
together a complicated ETL system. You can use template activities to carry out data
manipulation tasks more efficiently.
Copy your CSV data from your host source into Redshift with the RedshiftCopyActivity.
This template uses Amazon RDS, Amazon EMR, and Amazon S3 to copy data.

9. Where and When Redshift can be used?

Big customers are heading towards service on the data warehouse today. Redshift can be used
in different sectors, and business use cases seeking a data warehouse cloud service with
features such as cost savings, efficient dynamic query engine, security, etc.
Clients looking for moving from on-premise to cloud model, PaaS model. The traditional
setup of servers and data centers for a company was a headache. This requires upfront
planning, estimation, prediction of servers, type of servers, etc., and eventually takes months
to come to a conclusion. Any wrong estimation or decision can lead to over or short of the
estimated capacity and financial loss or shortage of resources. Following are business use
cases or industries where Redshift can be used:-
 Consolidation of accounting data: Redshift can be used to consolidate the data to see the
company’s financial position at the company level. Redshift math, analytic, and date
functions along with user-in-built functions to derive various formulas and complex
customized calculations with optimized performance are very valuable features for
accounting
 Build Data Lake for pricing data: Redshift’s columnar storage is the best fit for time series
data.
 Supply chain management: To query and analyze huge volumes of data features like parallel
processing with powerful node types make Redshift a good option

10. What are materialized views in Redshift?

A precomputed result set is stored in a materialized view, which is based on a SQL query
over one or more base tables. You may query a materialized view using SELECT queries in
the same way how you can query other tables or views in the database.
Syntax:
CREATE MATERIALIZED VIEW mv_name
[ BACKUP { YES | NO } ][ table_attributes ][ AUTO REFRESH { YES | NO } ]
AS query

11. What are the top features of Redshift?

 Redshift uses columnar storage, data compression, and zone maps to reduce the amount of
I/O needed to perform queries.
 It uses a massively parallel processing data warehouse architecture to parallelize and
distribute SQL operations.
 Redshift uses machine learning to deliver high throughput based on your workloads.
 Redshift uses result caching to deliver sub-second response times for repeat queries.
 Redshift automatically and continuously backs up your data to S3. It can asynchronously
replicate your snapshots to S3 in another region for disaster recovery.
12. How can you verify your disk space usage in a Redshift cluster?
Using the stv partitions table, run the following query:
select
sum(capacity)/1024 as capacity_gbytes,sum(used)/1024 as used_gbytes,(sum(capacity) -
sum(used))/1024 as free_gbytes
from
stv_partitions where part_begin=0;

The output will look like below:

capacity_gbytes used_gbytes free_gbytes

382 22 370

13. How do we load data into Redshift?

Several methods are available to load data into Redshift, but the commonly used 3 methods
are:
1. The Copy command is used to load data into AWS Redshift.
2. Use AWS services to load data into Redshift.
3. Use the Insert command to load data into Redshift.

14. What is Redshift Spectrum? What data formats does Redshift Spectrum support?
Redshift Spectrum is released by AWS(Amazon Web Services) as a companion to Amazon
Redshift. It uses Amazon Simple Storage Service (Amazon S3) to run SQL queries against
the data available in a data lake. Redshift Spectrum facilitates the query processing against
gigabytes to exabytes of unstructured data in Amazon S3, and no ETL or loading is required
in this process. Redshift Spectrum is used to produce and optimize a query plan. Redshift
Spectrum supports various structured and semi-structured data formats, including AVRO,
TEXTFILE, RCFILE, PARQUET, SEQUENCE FILE, RegexSerDe, JSON, Geok, Ion, and
ORC. Amazon suggests using columnar data formats like Apache PARQUET to improve
performance and reduce cost.

15. How do I use Amazon S3 Bucket to load CSV into Redshift?

To load a CSV file from an Amazon S3 bucket into Amazon Redshift, you can follow these
steps:
1. Create an Amazon S3 bucket and upload the CSV file to it.
2. Create an Amazon Redshift cluster and database.
3. Create a database table in Redshift to hold the data from the CSV file.
4. Use the COPY command to load the data from the S3 bucket into the Redshift table.
The COPY command has the following syntax:
COPY table_name FROM 's3://bucket_name/file_name.csv'
CREDENTIALS 'aws_access_key_id=access_key;aws_secret_access_key=secret_key'
DELIMITER ',' IGNOREHEADER 1;
Replace table_name with the name of the Redshift table, bucket_name with the name of the
S3 bucket, and file_name.csv with the name of the CSV file.
Replace access_key and secret_key with your AWS access key and secret key, respectively.
5. Execute the COPY command to load the data from the S3 bucket into the Redshift table.
Note that you will need to have the appropriate permissions and access to the S3 bucket and
Redshift cluster in order to execute the COPY command. You may also need to set up VPC
and security group rules to allow the Redshift cluster to access the S3 bucket
17. How will the price of Amazon Redshift vary?
The Amazon Redshift pricing depends upon the type of node chosen by the customer to build
his cluster. It mainly offers two types of nodes that differ in terms of storage and
computation:
1. Dense Compute Nodes
These optimized computing nodes offer RAM of up to 244GB and SSDs of up to 2.5
terabytes. The lowest spec price for dc2.larges varies from 0.25$ to 0.37$ per hour, and the
highest spec price for dc2.8x varies from 4.8$ to 7$ per hour.
1. Dense Storage Nodes
These nodes provide high storage capacity in two versions- a basic version(ds2.xlarge) with
up to 2 TB
HDDs and a higher version(ds2.8xlarge) with up to 16 TB HDDs. The cost of the basic
version varies from 0.85$ to1.4$ per hour, and for the higher version is 6$ to 11$.

16. What are the key differences between SQL Server and Amazon Redshift?
Here are some key differences between SQL Server and Amazon Redshift:
SQL Server is a traditional, on-premises or cloud-based database management system with a
relational data model and a powerful database engine. It is fast and flexible, but can be
expensive to scale and may not integrate well with other AWS services.

Amazon Redshift is a fully managed, cloud-based data warehousing service with a relational
data model and a PostgreSQL-based database engine. It is designed for fast query
performance and low cost, and has strong integration with other AWS services such as
Amazon S3 and Amazon EMR. However, it may not have all of the features and capabilities

Feature SQL Server Amazon Redshift

Deployment On-premises or cloud Cloud

Data model Relational Relational

Database engine Microsoft SQL Server PostgreSQL

Scalability Linear Linear or columnar

Performance Fast Very fast

Cost Varies Low

Integration with other AWS services Limited Strongof SQL Server.

17. How will the price of Amazon Redshift vary?
The Amazon Redshift pricing depends upon the type of node chosen by the customer to build
his cluster. It mainly offers two types of nodes that differ in terms of storage and
computation:
2. Dense Compute Nodes
These optimized computing nodes offer RAM of up to 244GB and SSDs of up to 2.5
terabytes. The lowest spec price for dc2.larges varies from 0.25$ to 0.37$ per hour, and the
highest spec price for dc2.8x varies from 4.8$ to 7$ per hour.
2. Dense Storage Nodes
These nodes provide high storage capacity in two versions- a basic version(ds2.xlarge) with
up to 2 TB
HDDs and a higher version(ds2.8xlarge) with up to 16 TB HDDs. The cost of the basic
version varies from 0.85$ to1.4$ per hour, and for the higher version is 6$ to 11$.

18. What is a data warehouse and how does AWS Redshift help?
A data warehouse is designed as a warehouse where the data from the systems and other
sources generated by the organization are collected and processed.
A high-level data warehouse has three-tier architecture:
1. In the bottom tier, we have the tools which cleanse and collect the data.
2. In the middle level, we have tools to transform the data using the Online Analytical
Processing Server.
3. At the top level, we have different tools where data analysis and data mining are carried out
at the front end.
As data grows continuously in an organization and the company constantly has to update its
expensive storage servers. Here AWS Redshift is generated in the cloud-based warehouses
offered by Amazon where businesses store their data.

19. Is there any support for time zones in Redshift while storing data?
Timezones aren’t supported by Redshift while storing data.
All times data are stored without timezone information and are considered to be UTC.
When you insert a value into a TIMESTAMPTZ, for example, the timezone offset is applied
to the timestamp to convert it to UTC, and the corrected timestamp is saved. The original
timezone information is not kept.
We’ll need an extra column to store the timezone if you wish to keep track of it.

20. How does Amazon Redshift apply Pricing?

Pricing
 You pay for the number of bytes scanned by RedShift Spectrum
 You pay a per-second billing rate based on the type and number of nodes in your cluster.
 You can reserve instances by committing to using Redshift for a 1 or 3-year term and save
costs.

21. How do we execute SQL files on Redshift?

You can be done this job by using a simple Python script running on an EC2 to set up a
JDBC connection to Redshift. When it is done, execute the queries in the your.SQL file
22. What problems have you faced while working with Amazon Redshift?
 Majority of the people face the problem of queries that are very slow and take a lot of time
answering it.
 Another problem that is seemed is on the dashboard. The dashboard is too slow.
 Another problem in Amazon Redshift is the “black box”. It is very difficult to observe
‘what’s going on’.

23. What are the benefits of Amazon Redshift?

AWS Redshift has below main benefits compared to other options :
1. AWS Redshift is easy to operate: you can find a choice to build a cluster in the AWS Redshift
Console. Only press and leave the rest on the Redshift computer program. Just complete the
correct information and start the cluster. The cluster is now ready to be used, for example, to
control, track and scale Redshift.
2. Cost Effective: Because there is no need to set up, the cost of this warehouse is reduced to
1/10th.
3. Scaling of Warehouse is very easy: You just have to resize the cluster size by increasing the
number of compute nodes.
4. High performance: It uses techniques such as column storage and large simultaneous
processing techniques to produce high efficiency and responsiveness times.

24. What are clusters in Redshift? How do I create and delete a cluster in AWS
redshift?
Computing resources in Amazon Redshift data warehouse are called nodes which are further
arranged in a group known as a cluster.
This cluster contains at least one database and it works on the Amazon Redshift engine.
To create a Cluster, you have to follow these steps: –
 The very first step to creating a cluster is to open the Amazon ECS console
 After that, you need to select the region to use which you can find from the navigation bar.
 When it is done, select cluster in the navigation panel.
 Then, select Create Cluster can be seen on the Cluster page.
 At last, you should select the selection compatibility which might be networking, EC2
Linux+ networking, or EC2 window + networking.
To delete a cluster in AWS, follow these steps: –
The very first step to delete a cluster is to need you to open the Amazon Redshift console.
 After that, select the Cluster which you want to remove from the navigation panel
 When it is done, on the Configuration tab of the cluster details page and then select Cluster,
and after that select the Delete option.
 Before going through the end, you need to do some final steps one of the following in the
Delete Cluster dialog box.
 You must choose YES to remove the cluster in creating a snapshot and then take the last
snapshot. And then you give the name to that snapshot. And finally, select the delete option.
 Or you can choose NO to delete in creating a snapshot without the taking final snapshot and
then select the delete option.

25. How to Stop/Start the Redshift cluster?

You can Start the Redshift cluster by using the following steps:
 In the Redshift Snapshots, select the snapshot of the cluster that you want to restore.
 Select the Restore option on the Dropdown “Actions” Snapshot menu.
 Complete the configuration details, then click the “Restore” button at the bottom right.
You can Stop the Redshift cluster by using the following steps:
 Select the cluster you want to stop from the AWS Console.
 Select the “Delete” option on the Dropdown “Cluster” menu.
 Enter the Snapshot name.
 Click on Stop.

26. How do you query Amazon Redshift to show your table data?
Below is the command to list tables in a public schema :
SELECT DISTINCT employee
FROM pg_table_def
WHERE schemaname = 'public'
ORDER BY employee;

# Below is the command to describe the columns from a table called table_data

SELECT *
FROM pg_table_def
WHERE tablename = 'employee'AND schemaname = 'public'

27. Why should I use Amazon Redshift over an on-premises data warehouse?
On-premises data warehouses require a considerable amount of time and resources to
manage, especially for large datasets. In addition, the financial costs of constructing,
maintaining, and increasing self-managed on-site data warehouses are very high.
As your data expands, you must continuously exchange what data to load into your data
warehouse and what data to store in order to control costs, keep ETL complexity low, and
deliver good results. Amazon Redshift not only greatly decreases the expense and operating
overhead of a data center, but with Redshift Bandwidth, it also makes it easy to analyze vast
volumes of data in its native format without forcing you to load the data.

28. Explain the architecture of Amazon Redshift?

An Amazon Redshift data repository is a business-class relational database query and
administration system. It provides connection of clients with a great number of applications
including reporting, business intelligence (BI), and analytics tools.
Amazon Redshift has great storage and excellent query performance with an aggregation of
column data storage, massively parallel processing, and targeted data compression encoding
schemes. It is all about the architecture of Redshift system architecture

29. List some Pros and Cons of Amazon Redshift.

Pros of Amazon Redshift
 It offers network isolation.
 It offers result caching.
 It integrates with third-party tools.
 It offers a consistent backup for your data.
Cons of Amazon Redshift
 It does not work as a live app database.
 2. It is a little behind the times with its Postgre setup.
 3. Your performance levels decrease as the clusters increase.
 4. There are no stored procedures available to you in Amazon Redshift.
30. What is Redshift Spectrum?
RedShift Spectrum :
 Enables you to run queries against exabytes of data in S3 without having to load or transform
(ETL) any data.
 Redshift Spectrum doesn’t use Enhanced VPC Routing.
 If you store data in a columnar format, Redshift Spectrum scans only the columns needed by
your query, rather than processing entire rows.
 If you compress your data using one of Redshift Spectrum’s supported compression
algorithms, less data is scanned.
Redshift Spectrum scales up to thousands of instances if needed, so queries run fast,
regardless of the size of the data. In addition, you can use exactly the same SQL for Amazon
S3 data as you do for your Amazon Redshift queries and connect to the same Amazon
Redshift endpoint using the same BI tools.
Redshift Spectrum lets you split storage and compute, allowing you to scale each of them
independently. You can set up as many Amazon Redshift clusters as you need to query your
Amazon S3 data lake, providing high availability and unlimited concurrence. Redshift
Spectrum gives you the right to store your data wherever you want, in the format you want,
and to have it ready for processing when you need it.
If you are making a query, the Amazon Redshift SQL Endpoint creates and optimizes a query
plan. Amazon Redshift describes what data is local and what is in Amazon S3, creates a plan
to reduce the amount of Amazon S3 data that needs to be read, and requests Redshift
Spectrum workers from a shared resource pool to read and process data from Amazon S3.

31. What is Amazon Redshift managed storage?

Amazon Redshift managed storage is available with RA3 node types which allows you to
scale and pay for computing and storing separately so that you can configure your cluster
based on your computing needs.
It automatically uses high-performance SSD-based local storage as a Tier-1 cache and takes
advantage of optimizations such as data block temperature, data blockage, and workload
patterns to deliver high performance while scaling storage automatically to Amazon S3 as
required without requiring action.

32. How does Amazon Redshift simplify data warehouse management?

Amazon Redshift handles the work necessary to set up, run and scale a data center.
It provides infrastructure power, automating ongoing administrative tasks such as backup and
patching, and monitoring nodes and drives to recover from failures. For Redshift Spectrum,
Amazon Redshift handles all the computing infrastructure, load balancing, planning,
scheduling, and execution of your queries for data stored in Amazon S3.

33. What are Database Querying Options available in Amazon Redshift?

Database Querying Options :
 Connect to your cluster through a SQL client tool using standard ODBC and JDBC
connections.
 Connect to your cluster and run queries on the AWS Management Console with the Query
Editor.
34. Which query language is used by Amazon Redshift?
SQL (Structured Query Language) is used by Amazon Redshift
35. How do you manage security in Amazon Redshift?
Security :
 By default, an Amazon Redshift cluster is only accessible to the AWS account that creates the
cluster.
 Use IAM to create user accounts and manage permissions for those accounts to control
cluster operations.
 If you are using the EC2-VPC platform for your Redshift cluster, you must use VPC security
groups.
 If you are using the EC2-Classic platform for your Redshift cluster, you must use Redshift
security groups.
 When you provision the cluster, you can optionally choose to encrypt the cluster for
additional security. Encryption is an immutable property of the cluster.
 Snapshots created from the encrypted cluster are also encrypted.
36. What are the different options for monitoring Amazon Redshift?
Monitoring :
 Use the database audit logging feature to track information about authentication attempts,
connections, disconnections, changes to database user definitions, and queries run in the
database. The logs are stored in S3 buckets.
 Redshift tracks events and retains information about them for a period of several weeks in
your AWS account.
 Redshift provides performance metrics and data so that you can track the health and
performance of your clusters and databases. It uses CloudWatch metrics to monitor the
physical aspects of the cluster, such as CPU utilization, latency, and throughput.
 Query/Load performance data helps you monitor database activity and performance.
 When you create a cluster, you can optionally configure a CloudWatch alarm to monitor the
average percentage of disk space that is used across all of the nodes in your cluster, referred
to as the default disk space alarm.
37. What are Cluster Snapshots in Amazon Redshift?
Cluster Snapshots :
 Point-in-time backups of a cluster. There are two types of snapshots: automated and manual.
Snapshots are stored in S3 using SSL.
 Redshift periodically takes incremental snapshots of your data every 8 hours or 5 GB per
node of data change.
 Redshift provides free storage for snapshots that is equal to the storage capacity of your
cluster until you delete the cluster. After you reach the free snapshot storage limit, you are
charged for any additional storage at the normal rate.
 Automated snapshots are enabled by default when you create a cluster. These snapshots are
deleted at the end of a retention period, which is one day, but you can modify it. You cannot
delete an automated snapshot manually.
 By default, manual snapshots are retained indefinitely, even after you delete your cluster.
 You can share an existing manual snapshot with other AWS accounts by authorizing access to
the snapshot.
 You can configure Amazon Redshift to automatically copy snapshots (automated or manual)
for a cluster to another AWS Region. For automated snapshots, you can also specify the
retention period to keep them in the destination AWS Region. The default retention period for
copied snapshots is seven days.
 If you store a copy of your snapshots in another AWS Region, you can restore your cluster
from recent data if anything affects the primary AWS Region. You can configure your cluster
to copy snapshots to only one destination AWS Region at a time.

38. What is Amazon Redshift ODBC?
The Amazon Redshift ODBC Driver allows you to connect with live Amazon Redshift data,
directly from applications that support ODBC connectivity. It is also helpful to read, write,
and update Amazon Redshift data through a standard ODBC Driver interface.

39. What are the Limits per Region in Amazon Redshift?

 The maximum number of tables is 9,900 for large and xlarge cluster node types and 20,000
for 8xlarge cluster node types.
 The number of user-defined databases you can create per cluster is 60.
 The number of concurrent user connections that can be made to a cluster is 500.
 The number of AWS accounts you can authorize to restore a snapshot is 20 for each snapshot
and 100 for each AWS KMS key.

40. What are the limitations of Amazon Redshift?

Amazon Redshift has a few limitations that users should be aware of:
1. Redshift is a data warehousing service and is not designed for real-time, low-latency
workloads. It may not be suitable for applications that require fast writes or reads.
2. Redshift is based on the PostgreSQL database engine and does not support all features of
PostgreSQL or other database engines.
3. Redshift has a maximum cluster size of 128 nodes. If you need to scale beyond this limit, you
will need to use multiple clusters or a different database service.
4. Redshift only supports a limited set of data types, including numeric, character, and date/time
types. It does not support complex data types such as arrays or objects.
5. Redshift does not support transactions or ACID compliance. This means that it is not suitable
for applications that require transactional consistency.
6. Redshift has a maximum table size of 1.6 TB and a maximum column size of 65535 bytes. If
you need to store larger amounts of data or longer columns, you will need to use a different
database service.
7. Redshift has a maximum of 20,000 tables per cluster and a maximum of 10,000 columns per
table. If you need to create more tables or columns, you will need to use a different database
service.

Levinson H J Principles of Lithography PDF
No ratings yet
Levinson H J Principles of Lithography PDF
524 pages
Microsoft Azure Fundamentals Exam Cram: Second Edition
From Everand
Microsoft Azure Fundamentals Exam Cram: Second Edition
IP Specialist
5/5 (1)
Installation Method of The M&e Works
100% (2)
Installation Method of The M&e Works
27 pages
What is cloud computing
No ratings yet
What is cloud computing
19 pages
CloudComputing Unit1
No ratings yet
CloudComputing Unit1
10 pages
CC
No ratings yet
CC
4 pages
cloud introduction
No ratings yet
cloud introduction
20 pages
Cloud Intro
No ratings yet
Cloud Intro
20 pages
What Is Cloud Computing
No ratings yet
What Is Cloud Computing
4 pages
cloudintro-lec01
No ratings yet
cloudintro-lec01
33 pages
CLOUD NOTES
No ratings yet
CLOUD NOTES
12 pages
Cloud Computing
No ratings yet
Cloud Computing
9 pages
Cloud Computing
No ratings yet
Cloud Computing
15 pages
Cloud Computing & Type
No ratings yet
Cloud Computing & Type
13 pages
cc
No ratings yet
cc
16 pages
UNIT-I
No ratings yet
UNIT-I
17 pages
Unit 1
No ratings yet
Unit 1
24 pages
CLOUD COMPUTING_UNIT1
No ratings yet
CLOUD COMPUTING_UNIT1
26 pages
Unit 1
No ratings yet
Unit 1
8 pages
CHAPTER 2 - CLOUD COMPUTING FUNDAMENTALS (AutoRecovered)
No ratings yet
CHAPTER 2 - CLOUD COMPUTING FUNDAMENTALS (AutoRecovered)
10 pages
Navigating The Cloud - Unlocking The Power of Amazon Web Services For Business Success
No ratings yet
Navigating The Cloud - Unlocking The Power of Amazon Web Services For Business Success
79 pages
What Is Cloud Computing
No ratings yet
What Is Cloud Computing
5 pages
CLOUD COMPUTING
No ratings yet
CLOUD COMPUTING
40 pages
Cloud Computing Unit 1
No ratings yet
Cloud Computing Unit 1
6 pages
Te Comp 14 Exp01
No ratings yet
Te Comp 14 Exp01
5 pages
Cloud Computing
No ratings yet
Cloud Computing
22 pages
What Is Cloud Computing
No ratings yet
What Is Cloud Computing
6 pages
Fundamental of Cloud Computing and Iot: Prepared: Mebiratu B
No ratings yet
Fundamental of Cloud Computing and Iot: Prepared: Mebiratu B
30 pages
UNIT 5 FS Notes
No ratings yet
UNIT 5 FS Notes
31 pages
Cloud Computing NIST Model: Essential Characteristics
No ratings yet
Cloud Computing NIST Model: Essential Characteristics
12 pages
Cloud Computing Unit 1 Notes
No ratings yet
Cloud Computing Unit 1 Notes
10 pages
Cloud Computing Notes
No ratings yet
Cloud Computing Notes
4 pages
Assignment 1
No ratings yet
Assignment 1
32 pages
9-Cloud Provisioning and Managing Private Cloud-12!09!2024 (1)
No ratings yet
9-Cloud Provisioning and Managing Private Cloud-12!09!2024 (1)
76 pages
Understanding Cloud
No ratings yet
Understanding Cloud
8 pages
Draft NIST Working Definition of Cloud Computing v15
No ratings yet
Draft NIST Working Definition of Cloud Computing v15
2 pages
Cloud Computing
No ratings yet
Cloud Computing
16 pages
CC Mids Notes PDC
No ratings yet
CC Mids Notes PDC
3 pages
Unit-1: Introduction To Cloud Computing
No ratings yet
Unit-1: Introduction To Cloud Computing
14 pages
Recap CLOUD 1
No ratings yet
Recap CLOUD 1
5 pages
Cloud Computing
No ratings yet
Cloud Computing
68 pages
Cloud Computing
No ratings yet
Cloud Computing
16 pages
Cloud Computing in A Business Environment
No ratings yet
Cloud Computing in A Business Environment
41 pages
Cloud Computing
No ratings yet
Cloud Computing
14 pages
CC Unit 1
No ratings yet
CC Unit 1
18 pages
ETEM S04 - (Cloud Computing)
No ratings yet
ETEM S04 - (Cloud Computing)
28 pages
Cloud Computing 1
No ratings yet
Cloud Computing 1
21 pages
Cloud Computing Notes
No ratings yet
Cloud Computing Notes
6 pages
Azure Fundamentals
No ratings yet
Azure Fundamentals
36 pages
Unit 5 Updates Notes
No ratings yet
Unit 5 Updates Notes
30 pages
cloud baba
No ratings yet
cloud baba
2 pages
Cloud Introduction: Cloud Computing Fundamentals Vision
No ratings yet
Cloud Introduction: Cloud Computing Fundamentals Vision
31 pages
CLOUD COMPUTING
No ratings yet
CLOUD COMPUTING
8 pages
Cloud Unit 1
No ratings yet
Cloud Unit 1
17 pages
Cloud Computing
No ratings yet
Cloud Computing
42 pages
Material - Chapter 1
No ratings yet
Material - Chapter 1
32 pages
1446747eeb54f95a
No ratings yet
1446747eeb54f95a
51 pages
Cloud Computing by Me
No ratings yet
Cloud Computing by Me
50 pages
BATCHNUM_ 141 GROUP4
No ratings yet
BATCHNUM_ 141 GROUP4
40 pages
CC Notes
No ratings yet
CC Notes
13 pages
Cloud Computing For Noobs
From Everand
Cloud Computing For Noobs
Silas Meadowlark
No ratings yet
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
From Everand
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
Poonam Devi
No ratings yet
Updated6nov2014 Classnotes Control Valves
100% (1)
Updated6nov2014 Classnotes Control Valves
40 pages
I Did Not Know That (B1)
No ratings yet
I Did Not Know That (B1)
2 pages
60 Seconds Weight Loss Inspiration Book PDF
100% (2)
60 Seconds Weight Loss Inspiration Book PDF
192 pages
User Engagement With Online News: Conceptualizing Interactivity and Exploring The Relationship Between Online News Videos and User Comments
No ratings yet
User Engagement With Online News: Conceptualizing Interactivity and Exploring The Relationship Between Online News Videos and User Comments
19 pages
Chapter IV
No ratings yet
Chapter IV
49 pages
KRA To Be Followed On Crafting Supervisory Plan
No ratings yet
KRA To Be Followed On Crafting Supervisory Plan
6 pages
Review Notes
No ratings yet
Review Notes
23 pages
Binns Et Al 1932 An Experiment in Egyptian Blue Glaze
No ratings yet
Binns Et Al 1932 An Experiment in Egyptian Blue Glaze
2 pages
Elementary Calculus SMN1014
No ratings yet
Elementary Calculus SMN1014
3 pages
Saxophone Fingering Chart Better Sax PDF
No ratings yet
Saxophone Fingering Chart Better Sax PDF
7 pages
Affixes
No ratings yet
Affixes
2 pages
Microprocessor Intel x86 Evolution and Main Features
No ratings yet
Microprocessor Intel x86 Evolution and Main Features
3 pages
Deloitte 2010 Shift Index
No ratings yet
Deloitte 2010 Shift Index
198 pages
My Eras Fellowship User Guide
No ratings yet
My Eras Fellowship User Guide
57 pages
Problem Set 5 Saf PDF
No ratings yet
Problem Set 5 Saf PDF
2 pages
Existing Vendors
No ratings yet
Existing Vendors
1 page
CBSE - Senior School Certificate Examination (Class XII) Results 2025
No ratings yet
CBSE - Senior School Certificate Examination (Class XII) Results 2025
1 page
Chapter 2 Project Cycle
No ratings yet
Chapter 2 Project Cycle
30 pages
1 Introduction and Additional Topics 05-01-2022 (05 Jan 2022) Material - I - 05!01!2022 - Introduction
No ratings yet
1 Introduction and Additional Topics 05-01-2022 (05 Jan 2022) Material - I - 05!01!2022 - Introduction
12 pages
SY BSC Physics Final Syllabus Draft - 2023-2024
No ratings yet
SY BSC Physics Final Syllabus Draft - 2023-2024
24 pages
Research Problem vs. Research Question
No ratings yet
Research Problem vs. Research Question
4 pages
The Cryotron Superconductive Computer Component
No ratings yet
The Cryotron Superconductive Computer Component
32 pages
Shrey Khanna: Experience Skills
No ratings yet
Shrey Khanna: Experience Skills
4 pages
Payward Drug Study and Fdar
No ratings yet
Payward Drug Study and Fdar
10 pages
Ride On Mower - Cub Cadet LT1042 - L0502212
No ratings yet
Ride On Mower - Cub Cadet LT1042 - L0502212
36 pages
Applied Mechanics Solids: Allan F. Bower
No ratings yet
Applied Mechanics Solids: Allan F. Bower
18 pages
83517-S6R2-T2MPTK-5 Eapp - Ec19041
No ratings yet
83517-S6R2-T2MPTK-5 Eapp - Ec19041
3 pages
Indian Overseas Bank Application Form For Funds Transfer Under Rtgs
No ratings yet
Indian Overseas Bank Application Form For Funds Transfer Under Rtgs
2 pages