0% found this document useful (0 votes)

11 views485 pages

AWS Well-Architected Framework - Framework

The AWS Well-Architected Framework provides best practices for designing and operating secure, reliable, efficient, cost-effective, and sustainable systems on AWS. It offers a structured approach to evaluate architectures against cloud best practices, helping organizations identify areas for improvement. The framework is intended for technology professionals and emphasizes continuous learning and adaptation to enhance business success in the cloud environment.

Uploaded by

crispus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views485 pages

AWS Well-Architected Framework - Framework

Uploaded by

crispus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 485

AWS Well-Architected Framework Framework

AWS Well-Architected Framework: Framework

Amazon's trademarks and trade dress may not be used in connection with any product or service
that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any
manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are
the property of their respective owners, who may or may not be aﬃliated with, connected to, or
sponsored by Amazon.
AWS Well-Architected Framework Framework

Best practices ......................................................................................................................................... 57

Resources ................................................................................................................................................ 63
The review process ........................................................................................................................ 64
Conclusion ...................................................................................................................................... 66
Contributors ................................................................................................................................... 67
Further reading .............................................................................................................................. 68
Document revisions ....................................................................................................................... 69
Appendix: Questions and best practices ...................................................................................... 72
Operational excellence ............................................................................................................................ 72
Organization ......................................................................................................................................... 72
Prepare ................................................................................................................................................ 129
Operate ............................................................................................................................................... 196
Evolve .................................................................................................................................................. 238
Security ...................................................................................................................................................... 257
Security foundations .......................................................................................................................... 257
Identity and access management ................................................................................................... 282
Detection .............................................................................................................................................. 336
Infrastructure protection ................................................................................................................... 350
Data protection ................................................................................................................................... 376
Incident response ............................................................................................................................... 408
Application security ............................................................................................................................ 431
Reliability ................................................................................................................................................... 449
Foundations ......................................................................................................................................... 450
Workload architecture ....................................................................................................................... 488
Change management ......................................................................................................................... 533
Failure management .......................................................................................................................... 573
Performance eﬃciency ........................................................................................................................... 669
Architecture selection ........................................................................................................................ 669
Compute and hardware .................................................................................................................... 684
Data management .............................................................................................................................. 702
Networking and content delivery ................................................................................................... 726
Process and culture ............................................................................................................................ 755
Cost optimization ..................................................................................................................................... 772
Practice Cloud Financial Management ........................................................................................... 772
Expenditure and usage awareness .................................................................................................. 795
Cost-eﬀective resources .................................................................................................................... 837

iv
AWS Well-Architected Framework Framework

AWS Well-Architected Framework

Publication date: June 27, 2024 (Document revisions)

The AWS Well-Architected Framework helps you understand the pros and cons of decisions you
make while building systems on AWS. By using the Framework you will learn architectural best
practices for designing and operating reliable, secure, eﬃcient, cost-eﬀective, and sustainable
systems in the cloud.

Introduction
The AWS Well-Architected Framework helps you understand the pros and cons of decisions you
make while building systems on AWS. Using the Framework helps you learn architectural best
practices for designing and operating secure, reliable, eﬃcient, cost-eﬀective, and sustainable
workloads in the AWS Cloud. It provides a way for you to consistently measure your architectures
against best practices and identify areas for improvement. The process for reviewing an
architecture is a constructive conversation about architectural decisions, and is not an audit
mechanism. We believe that having well-architected systems greatly increases the likelihood of
business success.

AWS Solutions Architects have years of experience architecting solutions across a wide variety
of business verticals and use cases. We have helped design and review thousands of customers’
architectures on AWS. From this experience, we have identiﬁed best practices and core strategies
for architecting systems in the cloud.

The AWS Well-Architected Framework documents a set of foundational questions that help you to
understand if a specific architecture aligns well with cloud best practices. The framework provides
a consistent approach to evaluating systems against the qualities you expect from modern cloud-
based systems, and the remediation that would be required to achieve those qualities. As AWS
continues to evolve, and we continue to learn more from working with our customers, we will
continue to refine the definition of well-architected.

This framework is intended for those in technology roles, such as chief technology oﬃcers
(CTOs), architects, developers, and operations team members. It describes AWS best practices and
strategies to use when designing and operating a cloud workload, and provides links to further
implementation details and architectural patterns. For more information, see the AWS Well-
Architected homepage.

Introduction 1
AWS Well-Architected Framework Framework

Name Description

data, systems, and assets in a way that can

improve your security posture.

Reliability The reliability pillar encompasses the ability of

a workload to perform its intended function
correctly and consistently when it’s expected
to. This includes the ability to operate and
test the workload through its total lifecycle
. This paper provides in-depth, best practice
guidance for implementing reliable workloads
on AWS.

Performance eﬃciency The ability to use computing resources

eﬃciently to meet system requirements, and
to maintain that eﬃciency as demand changes
and technologies evolve.

Cost optimization The ability to run systems to deliver business

value at the lowest price point.

Sustainability The ability to continually improve sustainab

ility impacts by reducing energy consumption
and increasing eﬃciency across all component
s of a workload by maximizing the beneﬁts
from the provisioned resources and minimizin
g the total resources required.

In the AWS Well-Architected Framework, we use these terms:

• A component is the code, conﬁguration, and AWS Resources that together deliver against a
requirement. A component is often the unit of technical ownership, and is decoupled from other
components.

• The term workload is used to identify a set of components that together deliver business value.
A workload is usually the level of detail that business and technology leaders communicate
about.

Deﬁnitions 3
AWS Well-Architected Framework Framework

example, verifying that teams are meeting internal standards. We mitigate these risks in two ways.
First, we have practices (ways of doing things, process, standards, and accepted norms) that focus
on allowing each team to have that capability, and we put in place experts who verify that teams
raise the bar on the standards they need to meet. Second, we implement mechanisms that carry
out automated checks to verify standards are being met.

“Good intentions never work, you need good mechanisms to make anything happen” —
Jeﬀ Bezos.

This means replacing a human's best efforts with mechanisms (often automated) that check for
compliance with rules or process. This distributed approach is supported by the Amazon leadership
principles, and establishes a culture across all roles that works back from the customer. Working
backward is a fundamental part of our innovation process. We start with the customer and what
they want, and let that define and guide our efforts. Customer-obsessed teams build products in
response to a customer need.

For architecture, this means that we expect every team to have the capability to create
architectures and to follow best practices. To help new teams gain these capabilities or existing
teams to raise their bar, we activate access to a virtual community of principal engineers who
can review their designs and help them understand what AWS best practices are. The principal
engineering community works to make best practices visible and accessible. One way they do this,
for example, is through lunchtime talks that focus on applying best practices to real examples.
These talks are recorded and can be used as part of onboarding materials for new team members.

AWS best practices emerge from our experience running thousands of systems at internet scale.
We prefer to use data to define best practice, but we also use subject matter experts, like principal
engineers, to set them. As principal engineers see new best practices emerge, they work as a
community to verify that teams follow them. In time, these best practices are formalized into our
internal review processes, and also into mechanisms that enforce compliance. The Well-Architected
Framework is the customer-facing implementation of our internal review process, where we
have codified our principal engineering thinking across field roles, like Solutions Architecture and
internal engineering teams. The Well-Architected Framework is a scalable mechanism that lets you
take advantage of these learnings.

By following the approach of a principal engineering community with distributed ownership of

architecture, we believe that a Well-Architected enterprise architecture can emerge that is driven

On architecture 5
AWS Well-Architected Framework Framework

improvements can be made and can help develop organizational experience in dealing with
events.

General design principles 7

AWS Well-Architected Framework Framework

Design principles

The following are design principles for operational excellence in the cloud:

• Organize teams around business outcomes: The ability of a team to achieve business outcomes
comes from leadership vision, effective operations, and a business-aligned operating model.
Leadership should be fully invested and committed to a CloudOps transformation with a suitable
cloud operating model that incentivizes teams to operate in the most efficient way and meet
business outcomes. The right operating model uses people, process, and technology capabilities
to scale, optimize for productivity, and differentiate through agility, responsiveness, and
adaptation. The organization's long-term vision is translated into goals that are communicated
across the enterprise to stakeholders and consumers of your cloud services. Goals and
operational KPIs are aligned at all levels. This practice sustains the long-term value derived from
implementing the following design principles.
• Implement observability for actionable insights: Gain a comprehensive understanding
of workload behavior, performance, reliability, cost, and health. Establish key performance
indicators (KPIs) and leverage observability telemetry to make informed decisions and take
prompt action when business outcomes are at risk. Proactively improve performance, reliability,
and cost based on actionable observability data.

• Safely automate where possible: In the cloud, you can apply the same engineering discipline
that you use for application code to your entire environment. You can define your entire
workload and its operations (applications, infrastructure, configuration, and procedures) as code,
and update it. You can then automate your workload’s operations by initiating them in response
to events. In the cloud, you can employ automation safety by configuring guardrails, including
rate control, error thresholds, and approvals. Through effective automation, you can achieve
consistent responses to events, limit human error, and reduce operator toil.

• Make frequent, small, reversible changes: Design workloads that are scalable and loosely
coupled to permit components to be updated regularly. Automated deployment techniques
together with smaller, incremental changes reduces the blast radius and allows for faster reversal
when failures occur. This increases conﬁdence to deliver beneﬁcial changes to your workload
while maintaining quality and adapting quickly to changes in market conditions.

• Refine operations procedures frequently: As you evolve your workloads, evolve your operations
appropriately. As you use operations procedures, look for opportunities to improve them. Hold
regular reviews and validate that all procedures are effective and that teams are familiar with
them. Where gaps are identified, update procedures accordingly. Communicate procedural

Design principles 9
AWS Well-Architected Framework Framework

Best practices

Note
All operational excellence questions have the OPS preﬁx as a shorthand for the pillar.

Topics

• Organization

• Prepare

• Operate

• Evolve

Organization

Your teams must have a shared understanding of your entire workload, their role in it, and shared
business goals to set the priorities that will achieve business success. Well-defined priorities will
maximize the benefits of your efforts. Evaluate internal and external customer needs involving
key stakeholders, including business, development, and operations teams, to determine where to
focus efforts. Evaluating customer needs will verify that you have a thorough understanding of
the support that is required to achieve business outcomes. Verify that you are aware of guidelines
or obligations defined by your organizational governance and external factors, such as regulatory
compliance requirements and industry standards that may mandate or emphasize specific focus.
Validate that you have mechanisms to identify changes to internal governance and external
compliance requirements. If no requirements are identified, validate that you have applied due
diligence to this determination. Review your priorities regularly so that they can be updated as
needs change.

Evaluate threats to the business (for example, business risk and liabilities, and information security
threats) and maintain this information in a risk registry. Evaluate the impact of risks, and tradeoffs
between competing interests or alternative approaches. For example, accelerating speed to market
for new features may be emphasized over cost optimization, or you may choose a relational
database for non-relational data to simplify the effort to migrate a system without refactoring.
Manage benefits and risks to make informed decisions when determining where to focus efforts.
Some risks or choices may be acceptable for a time, it may be possible to mitigate associated risks,

Best practices 11
AWS Well-Architected Framework Framework

assumptions, and reduce the risk of conﬁrmation bias. Grow inclusion, diversity, and accessibility
within your teams to gain beneﬁcial perspectives.

If there are external regulatory or compliance requirements that apply to your organization,
you should use the resources provided by AWS Cloud Compliance to help educate your teams
so that they can determine the impact on your priorities. The Well-Architected Framework
emphasizes learning, measuring, and improving. It provides a consistent approach for you to
evaluate architectures, and implement designs that will scale over time. AWS provides the
AWS Well-Architected Tool to help you review your approach before development, the state
of your workloads before production, and the state of your workloads in production. You can
compare workloads to the latest AWS architectural best practices, monitor their overall status,
and gain insight into potential risks. AWS Trusted Advisor is a tool that provides access to a core
set of checks that recommend optimizations that may help shape your priorities. Business and
Enterprise Support customers receive access to additional checks focusing on security, reliability,
performance, cost-optimization, and sustainability that can further help shape their priorities.

AWS can help you educate your teams about AWS and its services to increase their understanding
of how their choices can have an impact on your workload. Use the resources provided by AWS
Support (AWS Knowledge Center, AWS Discussion Forums, and AWS Support Center) and AWS
Documentation to educate your teams. Reach out to AWS Support through AWS Support Center
for help with your AWS questions. AWS also shares best practices and patterns that we have
learned through the operation of AWS in The Amazon Builders' Library. A wide variety of other
useful information is available through the AWS Blog and The Oﬃcial AWS Podcast. AWS Training
and Certiﬁcation provides some training through self-paced digital courses on AWS fundamentals.
You can also register for instructor-led training to further support the development of your teams’
AWS skills.

Use tools or services that permit you to centrally govern your environments across accounts,
such as AWS Organizations, to help manage your operating models. Services like AWS Control
Tower expand this management capability by allowing you to deﬁne blueprints (supporting your
operating models) for the setup of accounts, apply ongoing governance using AWS Organizations,
and automate provisioning of new accounts. Managed Services providers such as AWS Managed
Services, AWS Managed Services Partners, or Managed Services Providers in the AWS Partner
Network, provide expertise implementing cloud environments, and support your security and
compliance requirements and business goals. Adding Managed Services to your operating model
can save you time and resources, and lets you keep your internal teams lean and focused on
strategic outcomes that will diﬀerentiate your business, rather than developing new skills and
capabilities.

Best practices 13
AWS Well-Architected Framework Framework

Prepare

To prepare for operational excellence, you have to understand your workloads and their expected
behaviors. You will then be able to design them to provide insight to their status and build the
procedures to support them.

Design your workload so that it provides the information necessary for you to understand its
internal state (for example, metrics, logs, events, and traces) across all components in support of
observability and investigating issues. Observability goes beyond simple monitoring, providing
a comprehensive understanding of a system's internal workings based on its external outputs.
Rooted in metrics, logs, and traces, observability offers profound insights into system behavior and
dynamics. With effective observability, teams can discern patterns, anomalies, and trends, allowing
them to proactively address potential issues and maintain optimal system health. Identifying key
performance indicators (KPIs) is pivotal to ensure alignment between monitoring activities and
business objectives. This alignment ensures that teams are making data-driven decisions using
metrics that genuinely matter, optimizing both system performance and business outcomes.
Furthermore, observability empowers businesses to be proactive rather than reactive. Teams can
understand the cause-and-effect relationships within their systems, predicting and preventing
issues rather than just reacting to them. As workloads evolve, it's essential to revisit and refine the
observability strategy, ensuring it remains relevant and effective.

Adopt approaches that improve the flow of changes into production and that achieves refactoring,
fast feedback on quality, and bug fixing. These accelerate beneficial changes entering production,
limit issues deployed, and activate rapid identification and remediation of issues introduced
through deployment activities or discovered in your environments.

Adopt approaches that provide fast feedback on quality and achieves rapid recovery from changes
that do not have desired outcomes. Using these practices mitigates the impact of issues introduced
through the deployment of changes. Plan for unsuccessful changes so that you are able to respond
faster if necessary and test and validate the changes you make. Be aware of planned activities
in your environments so that you can manage the risk of changes impacting planned activities.
Emphasize frequent, small, reversible changes to limit the scope of change. This results in faster
troubleshooting and remediation with the option to roll back a change. It also means you are able
to get the beneﬁt of valuable changes more frequently.

Evaluate the operational readiness of your workload, processes, procedures, and personnel to
understand the operational risks related to your workload. Use a consistent process (including
manual or automated checklists) to know when you are ready to go live with your workload or

Best practices 15
AWS Well-Architected Framework Framework

OPS 7: How do you know that you are ready to support a workload?

Evaluate the operational readiness of your workload, processes and procedures, and personnel
to understand the operational risks related to your workload.

Invest in implementing operations activities as code to maximize the productivity of operations

personnel, minimize error rates, and achieve automated responses. Use “pre-mortems” to
anticipate failure and create procedures where appropriate. Apply metadata using Resource Tags
and AWS Resource Groups following a consistent tagging strategy to achieve identiﬁcation of your
resources. Tag your resources for organization, cost accounting, access controls, and targeting the
running of automated operations activities. Adopt deployment practices that take advantage of
the elasticity of the cloud to facilitate development activities, and pre-deployment of systems
for faster implementations. When you make changes to the checklists you use to evaluate your
workloads, plan what you will do with live systems that no longer comply.

Operate

Observability allows you to focus on meaningful data and understand your workload's interactions
and output. By concentrating on essential insights and eliminating unnecessary data, you maintain
a straightforward approach to understanding workload performance. It's essential not only
to collect data but also to interpret it correctly. Deﬁne clear baselines, set appropriate alert
thresholds, and actively monitor for any deviations. A shift in a key metric, especially when
correlated with other data, can pinpoint speciﬁc problem areas. With observability, you're better
equipped to foresee and address potential challenges, ensuring that your workload operates
smoothly and meets business needs.

Successful operation of a workload is measured by the achievement of business and customer

outcomes. Deﬁne expected outcomes, determine how success will be measured, and identify
metrics that will be used in those calculations to determine if your workload and operations are
successful. Operational health includes both the health of the workload and the health and success
of the operations activities performed in support of the workload (for example, deployment and
incident response). Establish metrics baselines for improvement, investigation, and intervention,
collect and analyze your metrics, and then validate your understanding of operations success and
how it changes over time. Use collected metrics to determine if you are satisfying customer and
business needs, and identify areas for improvement.

Best practices 17
AWS Well-Architected Framework Framework

OPS 10: How do you manage workload and operations events?

Prepare and validate procedures for responding to events to minimize their disruption to your
workload.

All of the metrics you collect should be aligned to a business need and the outcomes they support.
Develop scripted responses to well-understood events and automate their performance in
response to recognizing the event.

Evolve

Learn, share, and continuously improve to sustain operational excellence. Dedicate work cycles
to making nearly continuous incremental improvements. Perform post-incident analysis of all
customer impacting events. Identify the contributing factors and preventative action to limit or
prevent recurrence. Communicate contributing factors with aﬀected communities as appropriate.
Regularly evaluate and prioritize opportunities for improvement (for example, feature requests,
issue remediation, and compliance requirements), including both the workload and operations
procedures.

Include feedback loops within your procedures to rapidly identify areas for improvement and
capture learnings from running operations.

Share lessons learned across teams to share the beneﬁts of those lessons. Analyze trends within
lessons learned and perform cross-team retrospective analysis of operations metrics to identify
opportunities and methods for improvement. Implement changes intended to bring about
improvement and evaluate the results to determine success.

On AWS, you can export your log data to Amazon S3 or send logs directly to Amazon S3 for long-
term storage. Using AWS Glue, you can discover and prepare your log data in Amazon S3 for
analytics, and store associated metadata in the AWS Glue Data Catalog. Amazon Athena, through
its native integration with AWS Glue, can then be used to analyze your log data, querying it using
standard SQL. Using a business intelligence tool like Amazon QuickSight, you can visualize, explore,
and analyze your data. Discovering trends and events of interest that may drive improvement.

The following question focuses on these considerations for operational excellence.

Best practices 19
AWS Well-Architected Framework Framework

• Deﬁnition
• Best practices
• Resources

Design principles

In the cloud, there are a number of principles that can help you strengthen your workload security:

• Implement a strong identity foundation: Implement the principle of least privilege and
enforce separation of duties with appropriate authorization for each interaction with your AWS
resources. Centralize identity management, and aim to eliminate reliance on long-term static
credentials.
• Maintain traceability: Monitor, alert, and audit actions and changes to your environment in
real time. Integrate log and metric collection with systems to automatically investigate and take
action.
• Apply security at all layers: Apply a defense in depth approach with multiple security controls.
Apply to all layers (for example, edge of network, VPC, load balancing, every instance and
compute service, operating system, application, and code).
• Automate security best practices: Automated software-based security mechanisms improve
your ability to securely scale more rapidly and cost-effectively. Create secure architectures,
including the implementation of controls that are defined and managed as code in version-
controlled templates.
• Protect data in transit and at rest: Classify your data into sensitivity levels and use mechanisms,
such as encryption, tokenization, and access control where appropriate.
• Keep people away from data: Use mechanisms and tools to reduce or eliminate the need for
direct access or manual processing of data. This reduces the risk of mishandling or modification
and human error when handling sensitive data.
• Prepare for security events: Prepare for an incident by having incident management and
investigation policy and processes that align to your organizational requirements. Run incident
response simulations and use tools with automation to increase your speed for detection,
investigation, and recovery.

Deﬁnition

There are seven best practice areas for security in the cloud:

Design principles 21
AWS Well-Architected Framework Framework

Security

The following question focuses on these considerations for security. (For a list of security questions
and best practices, see the Appendix.).

SEC 1: How do you securely operate your workload?

To operate your workload securely, you must apply overarching best practices to every area of
security. Take requirements and processes that you have deﬁned in operational excellence at an
organizational and workload level, and apply them to all areas.

Staying up to date with recommendations from AWS, industry sources, and threat intellige
nce helps you evolve your threat model and control objectives. Automating security processes,
testing, and validation allow you to scale your security operations.

In AWS, segregating diﬀerent workloads by account, based on their function and compliance or
data sensitivity requirements, is a recommended approach.

Identity and access management

Identity and access management are key parts of an information security program, ensuring that
only authorized and authenticated users and components are able to access your resources, and
only in a manner that you intend. For example, you should deﬁne principals (that is, accounts,
users, roles, and services that can perform actions in your account), build out policies aligned with
these principals, and implement strong credential management. These privilege-management
elements form the core of authentication and authorization.

In AWS, privilege management is primarily supported by the AWS Identity and Access Management
(IAM) service, which allows you to control user and programmatic access to AWS services and
resources. You should apply granular policies, which assign permissions to a user, group, role, or
resource. You also have the ability to require strong password practices, such as complexity level,
avoiding re-use, and enforcing multi-factor authentication (MFA). You can use federation with your
existing directory service. For workloads that require systems to have access to AWS, IAM allows for
secure access through roles, instance proﬁles, identity federation, and temporary credentials.

The following questions focus on these considerations for security.

Best practices 23
AWS Well-Architected Framework Framework

Which user needs To By

programmatic access?

Workforce identity Use temporary credentials to Following the instructions for

sign programmatic requests the interface that you want to
(Users managed in IAM to the AWS CLI, AWS SDKs, or use.
Identity Center) AWS APIs.
• For the AWS CLI, see
Conﬁguring the AWS
CLI to use AWS IAM
Identity Center in the AWS
Command Line Interface
User Guide.
• For AWS SDKs, tools, and
AWS APIs, see IAM Identity
Center authentication in
the AWS SDKs and Tools
Reference Guide.

IAM Use temporary credentials to Following the instructions in

sign programmatic requests Using temporary credentia
to the AWS CLI, AWS SDKs, or ls with AWS resources in the
AWS APIs. IAM User Guide.

IAM (Not recommended) Following the instructions for

Use long-term credentials to the interface that you want to
sign programmatic requests use.
to the AWS CLI, AWS SDKs, or
• For the AWS CLI, see
AWS APIs.
Authenticating using IAM
user credentials in the AWS
Command Line Interface
User Guide.
• For AWS SDKs and tools,
see Authenticate using
long-term credentials in

Best practices 25
AWS Well-Architected Framework Framework

SEC 4: How do you detect and investigate security events?

Capture and analyze events from logs and metrics to gain visibility. Take action on security
events and potential threats to help secure your workload.

Log management is important to a Well-Architected workload for reasons ranging from security
or forensics to regulatory or legal requirements. It is critical that you analyze logs and respond to
them so that you can identify potential security incidents. AWS provides functionality that makes
log management easier to implement by giving you the ability to define a data-retention lifecycle
or define where data will be preserved, archived, or eventually deleted. This makes predictable and
reliable data handling simpler and more cost effective.

Infrastructure protection

Infrastructure protection encompasses control methodologies, such as defense in depth, necessary

to meet best practices and organizational or regulatory obligations. Use of these methodologies is
critical for successful, ongoing operations in either the cloud or on-premises.

In AWS, you can implement stateful and stateless packet inspection, either by using AWS-native
technologies or by using partner products and services available through the AWS Marketplace.
You should use Amazon Virtual Private Cloud (Amazon VPC) to create a private, secured, and
scalable environment in which you can deﬁne your topology—including gateways, routing tables,
and public and private subnets.

The following questions focus on these considerations for security.

SEC 5: How do you protect your network resources?

Any workload that has some form of network connectivity, whether it’s the internet or a private
network, requires multiple layers of defense to help protect from external and internal network-
based threats.

Best practices 27
AWS Well-Architected Framework Framework

• AWS never initiates the movement of data between Regions. Content placed in a Region will
remain in that Region unless you explicitly use a feature or leverage a service that provides that
functionality.

The following questions focus on these considerations for security.

SEC 7: How do you classify your data?

Classiﬁcation provides a way to categorize data, based on criticality and sensitivity in order to
help you determine appropriate protection and retention controls.

SEC 8: How do you protect your data at rest?

Protect your data at rest by implementing multiple controls, to reduce the risk of unauthorized
access or mishandling.

SEC 9: How do you protect your data in transit?

Protect your data in transit by implementing multiple controls to reduce the risk of unauthori
zed access or loss.

AWS provides multiple means for encrypting data at rest and in transit. We build features into our
services that make it easier to encrypt your data. For example, we have implemented server-side
encryption (SSE) for Amazon S3 to make it easier for you to store your data in an encrypted form.
You can also arrange for the entire HTTPS encryption and decryption process (generally known as
SSL termination) to be handled by Elastic Load Balancing (ELB).

Incident response

Even with extremely mature preventive and detective controls, your organization should still
put processes in place to respond to and mitigate the potential impact of security incidents. The
architecture of your workload strongly aﬀects the ability of your teams to operate eﬀectively
during an incident, to isolate or contain systems, and to restore operations to a known good state.
Putting in place the tools and access ahead of a security incident, then routinely practicing incident

Best practices 29
AWS Well-Architected Framework Framework

The cost and complexity to resolve defects is typically lower the earlier you are in the SDLC. The
easiest way to resolve issues is to not have them in the ﬁrst place, which is why starting with
a threat model helps you focus on the right outcomes from the design phase. As your AppSec
program matures, you can increase the amount of testing that is performed using automation,
improve the ﬁdelity of feedback to builders, and reduce the time needed for security reviews. All of
these actions improve the quality of the software you build, and increase the speed of delivering
features into production.

These implementation guidelines focus on four areas: organization and culture, security of the
pipeline, security in the pipeline, and dependency management. Each area provides a set of
principles that you can implement and provides an end-to-end view of how you design, develop,
build, deploy, and operate workloads.

In AWS, there are a number of approaches you can use when addressing your application security
program. Some of these approaches rely on technology while others focus on the people and
organizational aspects of your application security program.

The following question focuses on these considerations for application security.

SEC 11: How do you incorporate and validate the security properties of applications
throughout the design, development, and deployment lifecycle?

Training people, testing using automation, understanding dependencies, and validating the
security properties of tools and applications help to reduce the likelihood of security issues in
production workloads.

Resources

Refer to the following resources to learn more about our best practices for Security.

Documentation

• AWS Cloud Security

• AWS Compliance
• AWS Security Blog
• AWS Security Maturity Model

Resources 31
AWS Well-Architected Framework Framework

around or repair the failure. With more sophisticated automation, it’s possible to anticipate and
remediate failures before they occur.
• Test recovery procedures: In an on-premises environment, testing is often conducted to prove
that the workload works in a particular scenario. Testing is not typically used to validate recovery
strategies. In the cloud, you can test how your workload fails, and you can validate your recovery
procedures. You can use automation to simulate diﬀerent failures or to recreate scenarios that
led to failures before. This approach exposes failure pathways that you can test and ﬁx before a
real failure scenario occurs, thus reducing risk.

• Scale horizontally to increase aggregate workload availability: Replace one large resource
with multiple small resources to reduce the impact of a single failure on the overall workload.
Distribute requests across multiple, smaller resources to verify that they don’t share a common
point of failure.

• Stop guessing capacity: A common cause of failure in on-premises workloads is resource

saturation, when the demands placed on a workload exceed the capacity of that workload (this
is often the objective of denial of service attacks). In the cloud, you can monitor demand and
workload utilization, and automate the addition or removal of resources to maintain the more
eﬃcient level to satisfy demand without over- or under-provisioning. There are still limits, but
some quotas can be controlled and others can be managed (see Manage Service Quotas and
Constraints).

• Manage change through automation: Changes to your infrastructure should be made using
automation. The changes that must be managed include changes to the automation, which then
can be tracked and reviewed.

Deﬁnition

There are four best practice areas for reliability in the cloud:

• Foundations

• Workload architecture

• Change management

• Failure management

To achieve reliability, you must start with the foundations — an environment where Service Quotas
and network topology accommodate the workload. The workload architecture of the distributed

Deﬁnition 33
AWS Well-Architected Framework Framework

REL 2: How do you plan your network topology?

Workloads often exist in multiple environments. These include multiple cloud environments
(both publicly accessible and private) and possibly your existing data center infrastructure. Plans
must include network considerations such as intra- and inter-system connectivity, public IP
address management, private IP address management, and domain name resolution.

Workload architecture

A reliable workload starts with upfront design decisions for both software and infrastructure. Your
architecture choices will impact your workload behavior across all of the Well-Architected pillars.
For reliability, there are speciﬁc patterns you must follow.

With AWS, workload developers have their choice of languages and technologies to use. AWS SDKs
take the complexity out of coding by providing language-speciﬁc APIs for AWS services. These
SDKs, plus the choice of languages, permits developers to implement the reliability best practices
listed here. Developers can also read about and learn from how Amazon builds and operates
software in The Amazon Builders' Library.

The following questions focus on these considerations for reliability.

REL 3: How do you design your workload service architecture?

Build highly scalable and reliable workloads using a service-oriented architecture (SOA) or
a microservices architecture. Service-oriented architecture (SOA) is the practice of making
software components reusable via service interfaces. Microservices architecture goes further to
make components smaller and simpler.

REL 4: How do you design interactions in a distributed system to prevent failures?

Distributed systems rely on communications networks to interconnect components, such as

servers or services. Your workload must operate reliably despite data loss or latency in these
networks. Components of the distributed system must operate in a way that does not negativel
y impact other components or the workload. These best practices prevent failures and improve
mean time between failures (MTBF).

Best practices 35
AWS Well-Architected Framework Framework

REL 8: How do you implement change?

Controlled changes are necessary to deploy new functionality, and to verify that the workloads
and the operating environment are running known software and can be patched or replaced in
a predictable manner. If these changes are uncontrolled, then it makes it diﬃcult to predict the
eﬀect of these changes, or to address issues that arise because of them.

When you architect a workload to automatically add and remove resources in response to changes
in demand, this not only increases reliability but also validates that business success doesn't
become a burden. With monitoring in place, your team will be automatically alerted when KPIs
deviate from expected norms. Automatic logging of changes to your environment permits you
to audit and quickly identify actions that might have impacted reliability. Controls on change
management certify that you can enforce the rules that deliver the reliability you need.

Failure management

In any system of reasonable complexity, it is expected that failures will occur. Reliability requires
that your workload be aware of failures as they occur and take action to avoid impact on
availability. Workloads must be able to both withstand failures and automatically repair issues.

With AWS, you can take advantage of automation to react to monitoring data. For example, when a
particular metric crosses a threshold, you can initiate an automated action to remedy the problem.
Also, rather than trying to diagnose and ﬁx a failed resource that is part of your production
environment, you can replace it with a new one and carry out the analysis on the failed resource
out of band. Since the cloud allows you to stand up temporary versions of a whole system at low
cost, you can use automated testing to verify full recovery processes.

The following questions focus on these considerations for reliability.

REL 9: How do you back up data?

Back up data, applications, and conﬁguration to meet your requirements for recovery time
objectives (RTO) and recovery point objectives (RPO).

Best practices 37
AWS Well-Architected Framework Framework

customers, even in the face of sustained problems. Your recovery processes should be as well
exercised as your normal production processes.

Resources

Refer to the following resources to learn more about our best practices for Reliability.

Documentation

• AWS Documentation

• AWS Global Infrastructure

• AWS Auto Scaling: How Scaling Plans Work

• What Is AWS Backup?

Whitepaper

• Reliability Pillar: AWS Well-Architected

• Implementing Microservices on AWS

Performance efficiency
The performance efficiency pillar includes the ability to use cloud resources efficiently to meet
performance requirements, and to maintain that efficiency as demand changes and technologies
evolve.

The performance efficiency pillar provides an overview of design principles, best practices, and
questions. You can find prescriptive guidance on implementation in the Performance Efficiency
Pillar whitepaper.

Topics

• Design principles

• Deﬁnition

• Best practices

• Resources

Resources 39
AWS Well-Architected Framework Framework

Take a data-driven approach to building a high-performance architecture. Gather data on all

aspects of the architecture, from the high-level design to the selection and conﬁguration of
resource types.

Reviewing your choices on a regular basis validates that you are taking advantage of the
continually evolving AWS Cloud. Monitoring veriﬁes that you are aware of any deviance from
expected performance. Make trade-oﬀs in your architecture to improve performance, such as using
compression or caching, or relaxing consistency requirements.

Best practices
Topics
• Architecture selection
• Compute and hardware
• Data management
• Networking and content delivery
• Process and culture

Architecture selection

The optimal solution for a particular workload varies, and solutions often combine multiple
approaches. Well-Architected workloads use multiple solutions and allow diﬀerent features to
improve performance.

AWS resources are available in many types and configurations, which makes it easier to find an
approach that closely matches your needs. You can also find options that are not easily achievable
with on-premises infrastructure. For example, a managed service such as Amazon DynamoDB
provides a fully managed NoSQL database with single-digit millisecond latency at any scale.

The following question focuses on these considerations for performance eﬃciency. (For a list of
performance eﬃciency questions and best practices, see the Appendix.).

PERF 1: How do you select appropriate cloud resources and architecture patterns for your
workload?

Often, multiple approaches are required for more eﬀective performance across a workload.
Well-Architected systems use multiple solutions and features to improve performance.

Best practices 41
AWS Well-Architected Framework Framework

of access (online, oﬄine, archival), frequency of update (WORM, dynamic), and availability and
durability constraints. Well-Architected workloads use purpose-built data stores which allow
diﬀerent features to improve performance.

In AWS, storage is available in three forms: object, block, and ﬁle:

• Object storage provides a scalable, durable platform to make data accessible from any internet
location for user-generated content, active archive, serverless computing, Big Data storage or
backup and recovery. Amazon Simple Storage Service (Amazon S3) is an object storage service
that oﬀers industry-leading scalability, data availability, security, and performance. Amazon S3
is designed for 99.999999999% (11 9's) of durability, and stores data for millions of applications
for companies all around the world.

• Block storage provides highly available, consistent, low-latency block storage for each virtual
host and is analogous to direct-attached storage (DAS) or a Storage Area Network (SAN). Amazon
Elastic Block Store (Amazon EBS) is designed for workloads that require persistent storage
accessible by EC2 instances that helps you tune applications with the right storage capacity,
performance and cost.

• File storage provides access to a shared file system across multiple systems. File storage
solutions like Amazon Elastic File System (Amazon EFS) are ideal for use cases such as large
content repositories, development environments, media stores, or user home directories.
Amazon FSx makes it efficient and cost effective to launch and run popular file systems so
you can leverage the rich feature sets and fast performance of widely used open source and
commercially-licensed file systems.

The following question focuses on these considerations for performance eﬃciency.

PERF 3: How do you store, manage, and access data in your workload?

The more efficient storage solution for a system varies based on the kind of access operation
(block, file, or object), patterns of access (random or sequential), required throughput, frequency
of access (online, offline, archival), frequency of update (WORM, dynamic), and availability and
durability constraints. Well-architected systems use multiple storage solutions and turn on
different features to improve performance and use resources efficiently.

Best practices 43
AWS Well-Architected Framework Framework

key metrics are capturing time-to-ﬁrst-byte or rendering. Other generally applicable metrics
include thread count, garbage collection rate, and wait states. Business metrics, such as the
aggregate cumulative cost per request, can alert you to ways to drive down costs. Carefully
consider how you plan to interpret metrics. For example, you could choose the maximum or 99th
percentile instead of the average.

• Performance test automatically: As part of your deployment process, automatically start

performance tests after the quicker running tests have passed successfully. The automation
should create a new environment, set up initial conditions such as test data, and then run a series
of benchmarks and load tests. Results from these tests should be tied back to the build so you
can track performance changes over time. For long-running tests, you can make this part of the
pipeline asynchronous from the rest of the build. Alternatively, you could run performance tests
overnight using Amazon EC2 Spot Instances.

• Load generation: You should create a series of test scripts that replicate synthetic or prerecorded
user journeys. These scripts should be idempotent and not coupled, and you might need to
include pre-warming scripts to yield valid results. As much as possible, your test scripts should
replicate the behavior of usage in production. You can use software or software-as-a-service
(SaaS) solutions to generate the load. Consider using AWS Marketplace solutions and Spot
Instances — they can be cost-eﬀective ways to generate the load.

• Performance visibility: Key metrics should be visible to your team, especially metrics against
each build version. This allows you to see any signiﬁcant positive or negative trend over time.
You should also display metrics on the number of errors or exceptions to make sure you are
testing a working system.

• Visualization: Use visualization techniques that make it clear where performance issues, hot
spots, wait states, or low utilization is occurring. Overlay performance metrics over architecture
diagrams — call graphs or code can help identify issues quickly.

• Regular review process: Architectures performing poorly is usually the result of a non-existent
or broken performance review process. If your architecture is performing poorly, implementing a
performance review process allows you to drive iterative improvement.

• Continual optimization: Adopt a culture to continually optimize the performance eﬃciency of

your cloud workload.

The following question focuses on these considerations for performance eﬃciency.

Best practices 45
AWS Well-Architected Framework Framework

The cost optimization pillar provides an overview of design principles, best practices, and
questions. You can ﬁnd prescriptive guidance on implementation in the Cost Optimization Pillar
whitepaper.

Topics

• Design principles

• Deﬁnition

• Best practices

• Resources

Design principles

There are ﬁve design principles for cost optimization in the cloud:

• Implement Cloud Financial Management: To achieve ﬁnancial success and accelerate business
value realization in the cloud, invest in Cloud Financial Management and Cost Optimization.
Your organization should dedicate time and resources to build capability in this new domain
of technology and usage management. Similar to your Security or Operational Excellence
capability, you need to build capability through knowledge building, programs, resources, and
processes to become a cost-eﬃcient organization.

• Adopt a consumption model: Pay only for the computing resources that you require and
increase or decrease usage depending on business requirements, not by using elaborate
forecasting. For example, development and test environments are typically only used for eight
hours a day during the work week. You can stop these resources when they are not in use for a
potential cost savings of 75% (40 hours versus 168 hours).

• Measure overall eﬃciency: Measure the business output of the workload and the costs
associated with delivering it. Use this measure to know the gains you make from increasing
output and reducing costs.

• Stop spending money on undiﬀerentiated heavy lifting: AWS does the heavy lifting of data
center operations like racking, stacking, and powering servers. It also removes the operational
burden of managing operating systems and applications with managed services. This permits
you to focus on your customers and business projects rather than on IT infrastructure.

• Analyze and attribute expenditure: The cloud makes it simple to accurately identify the
usage and cost of systems, which then permits transparent attribution of IT costs to individual

Design principles 47
AWS Well-Architected Framework Framework

Practice Cloud Financial Management

With the adoption of cloud, technology teams innovate faster due to shortened approval,
procurement, and infrastructure deployment cycles. A new approach to ﬁnancial management
in the cloud is required to realize business value and ﬁnancial success. This approach is Cloud
Financial Management, and builds capability across your organization by implementing
organizational wide knowledge building, programs, resources, and processes.

Many organizations are composed of many different units with different priorities. The ability to
align your organization to an agreed set of financial objectives, and provide your organization the
mechanisms to meet them, will create a more efficient organization. A capable organization will
innovate and build faster, be more agile and adjust to any internal or external factors.

In AWS you can use Cost Explorer, and optionally Amazon Athena and Amazon QuickSight
with the Cost and Usage Report (CUR), to provide cost and usage awareness throughout your
organization. AWS Budgets provides proactive notiﬁcations for cost and usage. The AWS blogs
provide information on new services and features to verify you keep up to date with new service
releases.

The following question focuses on these considerations for cost optimization. (For a list of cost
optimization questions and best practices, see the Appendix.).

COST 1: How do you implement cloud ﬁnancial management?

Implementing Cloud Financial Management helps organizations realize business value and
ﬁnancial success as they optimize their cost and usage and scale on AWS.

When building a cost optimization function, use members and supplement the team with experts
in CFM and cost optimization. Existing team members will understand how the organization
currently functions and how to rapidly implement improvements. Also consider including people
with supplementary or specialist skill sets, such as analytics and project management.

When implementing cost awareness in your organization, improve or build on existing programs
and processes. It is much faster to add to what exists than to build new processes and programs.
This will result in achieving outcomes much faster.

Best practices 49
AWS Well-Architected Framework Framework

COST 4: How do you decommission resources?

Implement change control and resource management from project inception to end-of-life. This
facilitates shutting down unused resources to reduce waste.

You can use cost allocation tags to categorize and track your AWS usage and costs. When you apply
tags to your AWS resources (such as EC2 instances or S3 buckets), AWS generates a cost and usage
report with your usage and your tags. You can apply tags that represent organization categories
(such as cost centers, workload names, or owners) to organize your costs across multiple services.

Verify that you use the right level of detail and granularity in cost and usage reporting and
monitoring. For high level insights and trends, use daily granularity with AWS Cost Explorer. For
deeper analysis and inspection use hourly granularity in AWS Cost Explorer, or Amazon Athena and
Amazon QuickSight with the Cost and Usage Report (CUR) at an hourly granularity.

Combining tagged resources with entity lifecycle tracking (employees, projects) makes it
possible to identify orphaned resources or projects that are no longer generating value to the
organization and should be decommissioned. You can set up billing alerts to notify you of
predicted overspending.

Cost-eﬀective resources

Using the appropriate instances and resources for your workload is key to cost savings. For
example, a reporting process might take ﬁve hours to run on a smaller server but one hour to run
on a larger server that is twice as expensive. Both servers give you the same outcome, but the
smaller server incurs more cost over time.

A well-architected workload uses the most cost-eﬀective resources, which can have a signiﬁcant
and positive economic impact. You also have the opportunity to use managed services to reduce
costs. For example, rather than maintaining servers to deliver email, you can use a service that
charges on a per-message basis.

AWS offers a variety of flexible and cost-effective pricing options to acquire instances from Amazon
EC2 and other services in a way that more effectively fits your needs. On-Demand Instances
permit you to pay for compute capacity by the hour, with no minimum commitments required.
Savings Plans and Reserved Instances offer savings of up to 75% off On-Demand pricing. With Spot
Instances, you can leverage unused Amazon EC2 capacity and offer savings of up to 90% off On-

Best practices 51
AWS Well-Architected Framework Framework

By factoring in cost during service selection, and using tools such as Cost Explorer and AWS Trusted
Advisor to regularly review your AWS usage, you can actively monitor your utilization and adjust
your deployments accordingly.

Manage demand and supply resources

When you move to the cloud, you pay only for what you need. You can supply resources to match
the workload demand at the time they’re needed, this decreases the need for costly and wasteful
over provisioning. You can also modify the demand, using a throttle, buﬀer, or queue to smooth
the demand and serve it with less resources resulting in a lower cost, or process it at a later time
with a batch service.

In AWS, you can automatically provision resources to match the workload demand. Auto Scaling
using demand or time-based approaches permit you to add and remove resources as needed. If
you can anticipate changes in demand, you can save more money and validate that your resources
match your workload needs. You can use Amazon API Gateway to implement throttling, or Amazon
SQS to implementing a queue in your workload. These will both permit you to modify the demand
on your workload components.

The following question focuses on these considerations for cost optimization.

COST 9: How do you manage demand, and supply resources?

For a workload that has balanced spend and performance, verify that everything you pay for
is used and avoid signiﬁcantly underutilizing instances. A skewed utilization metric in either
direction has an adverse impact on your organization, in either operational costs (degraded
performance due to over-utilization), or wasted AWS expenditures (due to over-provisioning).

When designing to modify demand and supply resources, actively think about the patterns of
usage, the time it takes to provision new resources, and the predictability of the demand pattern.
When managing demand, verify you have a correctly sized queue or buﬀer, and that you are
responding to workload demand in the required amount of time.

Optimize over time

As AWS releases new services and features, it's a best practice to review your existing architectural
decisions to verify they continue to be the most cost eﬀective. As your requirements change, be
aggressive in decommissioning resources, entire services, and systems that you no longer require.

Best practices 53
AWS Well-Architected Framework Framework

Sustainability
The Sustainability pillar focuses on environmental impacts, especially energy consumption
and eﬃciency, since they are important levers for architects to inform direct action to reduce
resource usage. You can ﬁnd prescriptive guidance on implementation in the Sustainability Pillar
whitepaper.

Topics
• Design principles
• Deﬁnition
• Best practices
• Resources

Design principles
There are six design principles for sustainability in the cloud:

• Understand your impact: Measure the impact of your cloud workload and model the future
impact of your workload. Include all sources of impact, including impacts resulting from
customer use of your products, and impacts resulting from their eventual decommissioning and
retirement. Compare the productive output with the total impact of your cloud workloads by
reviewing the resources and emissions required per unit of work. Use this data to establish key
performance indicators (KPIs), evaluate ways to improve productivity while reducing impact, and
estimate the impact of proposed changes over time.
• Establish sustainability goals: For each cloud workload, establish long-term sustainability
goals such as reducing the compute and storage resources required per transaction. Model the
return on investment of sustainability improvements for existing workloads, and give owners the
resources they must invest in sustainability goals. Plan for growth, and architect your workloads
so that growth results in reduced impact intensity measured against an appropriate unit, such
as per user or per transaction. Goals help you support the wider sustainability goals of your
business or organization, identify regressions, and prioritize areas of potential improvement.
• Maximize utilization: Right-size workloads and implement efficient design to verify high
utilization and maximize the energy efficiency of the underlying hardware. Two hosts running
at 30% utilization are less efficient than one host running at 60% due to baseline power
consumption per host. At the same time, reduce or minimize idle resources, processing, and
storage to reduce the total energy required to power your workload.

Sustainability 55
AWS Well-Architected Framework Framework

Best practices
Topics
• Region selection
• Alignment to demand
• Software and architecture
• Data management
• Hardware and services
• Process and culture

Region selection

The choice of Region for your workload signiﬁcantly aﬀects its KPIs, including performance, cost,
and carbon footprint. To improve these KPIs, you should choose Regions for your workloads based
on both business requirements and sustainability goals.

The following question focuses on these considerations for sustainability. (For a list of
sustainability questions and best practices, see the Appendix.)

SUS 1: How do you select Regions for your workload?

The choice of Region for your workload signiﬁcantly aﬀects its KPIs, including performan
ce, cost, and carbon footprint. To improve these KPIs, you should choose Regions for your
workloads based on both business requirements and sustainability goals.

Alignment to demand

The way users and applications consume your workloads and other resources can help you identify
improvements to meet sustainability goals. Scale infrastructure to continually match demand and
verify that you use only the minimum resources required to support your users. Align service levels
to customer needs. Position resources to limit the network required for users and applications to
consume them. Remove unused assets. Provide your team members with devices that support their
needs and minimize their sustainability impact.

The following question focuses on this consideration for sustainability:

Best practices 57
AWS Well-Architected Framework Framework

lack of use because of changes in user behavior over time. Revise patterns and architecture to
consolidate under-utilized components to increase overall utilization. Retire components that are
no longer required. Understand the performance of your workload components, and optimize the
components that consume the most resources. Be aware of the devices that your customers use to
access your services, and implement patterns to minimize the need for device upgrades.

The following question focuses on these considerations for sustainability:

SUS 3: How do you take advantage of software and architecture patterns to support your
sustainability goals?

Implement patterns for performing load smoothing and maintaining consistent high utilizati
on of deployed resources to minimize the resources consumed. Components might become idle
from lack of use because of changes in user behavior over time. Revise patterns and architect
ure to consolidate under-utilized components to increase overall utilization. Retire component
s that are no longer required. Understand the performance of your workload components, and
optimize the components that consume the most resources. Be aware of the devices that your
customers use to access your services, and implement patterns to minimize the need for device
upgrades.

Optimize software and architecture for asynchronous and scheduled jobs: Use eﬃcient software
designs and architectures to minimize the average resources required per unit of work. Implement
mechanisms that result in even utilization of components to reduce resources that are idle between
tasks and minimize the impact of load spikes.

Remove or refactor workload components with low or no use: Monitor workload activity to identify
changes in utilization of individual components over time. Remove components that are unused
and no longer required, and refactor components with little utilization, to limit wasted resources.

Optimize areas of code that consume the most time or resources: Monitor workload activity to
identify application components that consume the most resources. Optimize the code that runs
within these components to minimize resource usage while maximizing performance.

Optimize impact on customer devices and equipment: Understand the devices and equipment
that your customers use to consume your services, their expected lifecycle, and the ﬁnancial
and sustainability impact of replacing those components. Implement software patterns and
architectures to minimize the need for customers to replace devices and upgrade equipment. For
example, implement new features using code that is backward compatible with earlier hardware

Best practices 59
AWS Well-Architected Framework Framework

Remove unneeded or redundant data: Duplicate data only when necessary to minimize total
storage consumed. Use backup technologies that deduplicate data at the ﬁle and block level. Limit
the use of Redundant Array of Independent Drives (RAID) conﬁgurations except where required to
meet SLAs.

Use shared ﬁle systems or object storage to access common data: Adopt shared storage and
single sources of truth to avoid data duplication and reduce the total storage requirements
of your workload. Fetch data from shared storage only as needed. Detach unused volumes to
release resources. Minimize data movement across networks: Use shared storage and access data
from Regional data stores to minimize the total networking resources required to support data
movement for your workload.

Back up data only when diﬃcult to recreate: To minimize storage consumption, only back up data
that has business value or is required to satisfy compliance requirements. Examine backup policies
and exclude ephemeral storage that doesn’t provide value in a recovery scenario.

Hardware and services

Look for opportunities to reduce workload sustainability impacts by making changes to your
hardware management practices. Minimize the amount of hardware needed to provision and
deploy, and select the most eﬃcient hardware and services for your individual workload.

The following question focuses on these considerations for sustainability:

SUS 5: How do you select and use cloud hardware and services in your architecture to
support your sustainability goals?

Use the minimum amount of hardware to meet your needs: Using the capabilities of the cloud, you
can make frequent changes to your workload implementations. Update deployed components as
your needs change.

Use instance types with the least impact: Continually monitor the release of new instance types
and take advantage of energy eﬃciency improvements, including those instance types designed to
support speciﬁc workloads such as machine learning training and inference, and video transcoding.

Best practices 61
AWS Well-Architected Framework Framework

Use managed device farms for testing: Managed device farms spread the sustainability impact
of hardware manufacturing and resource usage across multiple tenants. Managed device farms
oﬀer diverse device types so you can support earlier, less popular hardware, and avoid customer
sustainability impact from unnecessary device upgrades.

Resources
Refer to the following resources to learn more about our best practices for sustainability.

Whitepaper

• Sustainability Pillar

Video

• The Climate Pledge

Resources 63
AWS Well-Architected Framework Framework

• A meeting room with whiteboards

• Print outs of any diagrams or design notes
• Action list of questions that require out-of-band research to answer (for example, “did we
activate encryption or not?”)

After you have done a review, you should have a list of issues that you can prioritize based on your
business context. You will also want to take into account the impact of those issues on the day-to-
day work of your team. If you address these issues early, you could free up time to work on creating
business value rather than solving recurring problems. As you address issues, you can update your
review to see how the architecture is improving.

While the value of a review is clear after you have done one, you may find that a new team might
be resistant at first. Here are some objections that can be handled through educating the team on
the benefits of a review:

• “We are too busy!” (Often said when the team is getting ready for a signiﬁcant launch.)
• If you are getting ready for a big launch, you will want it to go smoothly. The review will
permit you to understand any problems you might have missed.
• We recommend that you carry out reviews early in the product lifecycle to uncover risks and
develop a mitigation plan aligned with the feature delivery roadmap.
• “We don’t have time to do anything with the results!” (Often said when there is an immovable
event, such as the Super Bowl, that they are targeting.)
• These events can’t be moved. Do you really want to go into it without knowing the risks in
your architecture? Even if you don’t address all of these issues you can still have playbooks for
handling them if they materialize.
• “We don’t want others to know the secrets of our solution implementation!”
• If you point the team at the questions in the Well-Architected Framework, they will see that
none of the questions reveal any commercial or technical proprietary information.

As you carry out multiple reviews with teams in your organization, you might identify thematic
issues. For example, you might see that a group of teams has clusters of issues in a particular
pillar or topic. You will want to look at all your reviews in a holistic manner, and identify any
mechanisms, training, or principal engineering talks that could help address those thematic issues.

65
AWS Well-Architected Framework Framework

Contributors
The following individuals and organizations contributed to this document:

• Brian Carlson, Operations Lead Well-Architected, Amazon Web Services

• Ben Potter, Security Lead Well-Architected, Amazon Web Services
• Seth Eliot, Reliability Lead Well-Architected, Amazon Web Services
• Eric Pullen, Sr. Solutions Architect, Amazon Web Services
• Rodney Lester, Principal Solutions Architect, Amazon Web Services
• Jon Steele, Sr. Technical Account Manager, Amazon Web Services
• Max Ramsay, Principal Security Solutions Architect, Amazon Web Services
• Callum Hughes, Solutions Architect, Amazon Web Services
• Ben Mergen, Senior Cost Lead Solutions Architect, Amazon Web Services
• Chris Kozlowski, Senior Specialist Technical Account Manager, Enterprise Support, Amazon Web
Services
• Alex Livingstone, Principal Specialist Solutions Architect, Cloud Operations, Amazon Web
Services
• Paul Moran, Principal Technologist, Enterprise Support, Amazon Web Services
• Peter Mullen, Advisory Consultant, Professional Services, Amazon Web Services
• Chris Pates, Senior Specialist Technical Account Manager, Enterprise Support, Amazon Web
Services
• Arvind Raghunathan, Principal Specialist Technical Account Manager, Enterprise Support,
Amazon Web Services
• Sam Mokhtari, Senior Eﬃciency Lead Solutions Architect, Amazon Web Services

67
AWS Well-Architected Framework Framework

Document revisions
To be notiﬁed about updates to this whitepaper, subscribe to the RSS feed.

Change Description Date

Updated best practice Large-scale best practice June 27, 2024

guidance updates were made
throughout the pillars.
Security and cost both
received new best practices.

Major update Major pillar updates. October 3, 2023

Updates for new Framework Best practices updated with April 10, 2023
prescriptive guidance and
new best practices added.
New questions added to the
Security and Cost Optimizat
ion pillars.

Minor update Added deﬁnition for level October 20, 2022

of eﬀort and updated best
practices in the appendix.

Whitepaper updated Added Sustainability Pillar December 2, 2021

and updated links.

Major update Sustainability Pillar added to November 20, 2021

the framework.

Minor update Removed non-inclusive April 22, 2021

language.

Minor update Fixed numerous links. March 10, 2021

Minor update Minor editorial changes July 15, 2020

throughout.

69
AWS Well-Architected Framework Framework

Whitepaper updated Updated the Framework to November 1, 2016

include operational excellenc
e pillar, and revised and
updated the other pillars
to reduce duplication and
incorporate learnings from
carrying out reviews with
thousands of customers.

Minor updates Updated the Appendix with November 1, 2015

current Amazon CloudWatch
Logs information.

Initial publication AWS Well-Architected October 1, 2015

Framework published.

Note
To subscribe to RSS updates, you must have an RSS plugin enabled for the browser that
you are using.

Framework versions:

• 2023-10-03 (current)
• 2023-04-10
• 2022-03-31

71
AWS Well-Architected Framework Framework

OPS 1. How do you determine what your priorities are?

Everyone should understand their part in enabling business success. Have shared goals in order to
set priorities for resources. This will maximize the beneﬁts of your eﬀorts.

Best practices
• OPS01-BP01 Evaluate customer needs
• OPS01-BP02 Evaluate internal customer needs
• OPS01-BP03 Evaluate governance requirements
• OPS01-BP04 Evaluate compliance requirements
• OPS01-BP05 Evaluate threat landscape
• OPS01-BP06 Evaluate tradeoﬀs while managing beneﬁts and risks

OPS01-BP01 Evaluate customer needs

Involve key stakeholders, including business, development, and operations teams, to determine
where to focus eﬀorts on external customer needs. This veriﬁes that you have a thorough
understanding of the operations support that is required to achieve your desired business
outcomes.

Desired outcome:

• You work backwards from customer outcomes.

• You understand how your operational practices support business outcomes and objectives.
• You engage all relevant parties.
• You have mechanisms to capture customer needs.

Common anti-patterns:

• You have decided not to have customer support outside of core business hours, but you haven't
reviewed historical support request data. You do not know whether this will have an impact on
your customers.
• You are developing a new feature but have not engaged your customers to ﬁnd out if it is
desired, if desired in what form, and without experimentation to validate the need and method
of delivery.

Organization 73
AWS Well-Architected Framework Framework

• You have decided to change IP address allocations for your product teams, without consulting
them, to make managing your network easier. You do not know the impact this will have on your
product teams.

• You are implementing a new development tool but have not engaged your internal customers to
ﬁnd out if it is needed or if it is compatible with their existing practices.

• You are implementing a new monitoring system but have not contacted your internal customers
to ﬁnd out if they have monitoring or reporting needs that should be considered.

Beneﬁts of establishing this best practice: Evaluating and understanding internal customer needs
informs how you prioritize your eﬀorts to deliver business value.

Level of risk exposed if this best practice is not established: High

Implementation guidance

• Understand business needs: Business success is created by shared goals and understanding
across stakeholders including business, development, and operations teams.

• Review business goals, needs, and priorities of internal customers: Engage key stakeholders,
including business, development, and operations teams, to discuss goals, needs, and priorities
of internal customers. This ensures that you have a thorough understanding of the operational
support that is required to achieve business and customer outcomes.

• Establish shared understanding: Establish shared understanding of the business functions of

the workload, the roles of each of the teams in operating the workload, and how these factors
support shared business goals across internal and external customers.

Resources

Related best practices:

• OPS11-BP03 Implement feedback loops

OPS01-BP03 Evaluate governance requirements

Governance is the set of policies, rules, or frameworks that a company uses to achieve its business
goals. Governance requirements are generated from within your organization. They can aﬀect the
types of technologies you choose or inﬂuence the way you operate your workload. Incorporate

Organization 75
AWS Well-Architected Framework Framework

instances. If teams need system access, they are required to use AWS Systems Manager Session
Manager. The cloud operations team regularly updates governance requirements as new services
become available.

Implementation steps

1. Identify the stakeholders for your workload, including any centralized teams.
2. Work with stakeholders to identify governance requirements.
3. Once you’ve generated a list, prioritize the improvement items, and begin implementing them
into your workload.
a. Use services like AWS Conﬁg to create governance-as-code and validate that governance
requirements are followed.
b. If you use AWS Organizations, you can leverage Service Control Policies to implement
governance requirements.
4. Provide documentation that validates the implementation.

Level of eﬀort for the implementation plan: Medium. Implementing missing governance
requirements may result in rework of your workload.

Resources

Related best practices:

• OPS01-BP04 Evaluate compliance requirements - Compliance is like governance but comes from
outside an organization.

Related documents:

• AWS Management and Governance Cloud Environment Guide

• Best Practices for AWS Organizations Service Control Policies in a Multi-Account Environment
• Governance in the AWS Cloud: The Right Balance Between Agility and Safety
• What is Governance, Risk, And Compliance (GRC)?

Related videos:

• AWS Management and Governance: Conﬁguration, Compliance, and Audit - AWS Online Tech
Talks

Organization 77
AWS Well-Architected Framework Framework

• Your software developers and architects are unaware of the compliance framework that your
organization must adhere to.
• The yearly Systems and Organizations Control (SOC2) Type II audit is happening soon and you
are unable to verify that controls are in place.

Beneﬁts of establishing this best practice:

• Evaluating and understanding the compliance requirements that apply to your workload will
inform how you prioritize your eﬀorts to deliver business value.

• You choose the right locations and technologies that are congruent with your compliance
framework.

• Designing your workload for auditability helps you to prove you are adhering to your compliance
framework.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Implementing this best practice means that you incorporate compliance requirements into your
architecture design process. Your team members are aware of the required compliance framework.
You validate compliance in line with the framework.

Customer example

AnyCompany Retail stores credit card information for customers. Developers on the card storage
team understand that they need to comply with the PCI-DSS framework. They’ve taken steps
to verify that credit card information is stored and accessed securely in line with the PCI-DSS
framework. Every year they work with their security team to validate compliance.

Implementation steps

1. Work with your security and governance teams to determine what industry, regulatory, or
internal compliance frameworks that your workload must adhere to. Incorporate the compliance
frameworks into your workload.

a. Validate continual compliance of AWS resources with services like AWS Compute Optimizer
and AWS Security Hub.

Organization 79
AWS Well-Architected Framework Framework

• AWS re:Invent 2020: Achieve compliance as code using AWS Compute Optimizer
• AWS re:Invent 2021 - Cloud compliance, assurance, and auditing
• AWS Summit ATL 2022 - Implementing compliance, assurance, and auditing on AWS (COP202)

Related examples:

• PCI DSS and AWS Foundational Security Best Practices on AWS

Related services:

• AWS Artifact
• AWS Audit Manager
• AWS Compute Optimizer
• AWS Security Hub

OPS01-BP05 Evaluate threat landscape

Evaluate threats to the business (for example, competition, business risk and liabilities, operational
risks, and information security threats) and maintain current information in a risk registry. Include
the impact of risks when determining where to focus eﬀorts.

The Well-Architected Framework emphasizes learning, measuring, and improving. It provides a

consistent approach for you to evaluate architectures, and implement designs that will scale over
time. AWS provides the AWS Well-Architected Tool to help you review your approach prior to
development, the state of your workloads prior to production, and the state of your workloads
in production. You can compare them to the latest AWS architectural best practices, monitor the
overall status of your workloads, and gain insight to potential risks.

AWS customers are eligible for a guided Well-Architected Review of their mission-critical workloads
to measure their architectures against AWS best practices. Enterprise Support customers are
eligible for an Operations Review, designed to help them to identify gaps in their approach to
operating in the cloud.

The cross-team engagement of these reviews helps to establish common understanding of your
workloads and how team roles contribute to success. The needs identiﬁed through the review can
help shape your priorities.

Organization 81
AWS Well-Architected Framework Framework

• Maintain a threat model: Establish and maintain a threat model identifying potential threats,
planned and in place mitigations, and their priority. Review the probability of threats manifesting
as incidents, the cost to recover from those incidents and the expected harm caused, and the cost
to prevent those incidents. Revise priorities as the contents of the threat model change.

Resources

Related best practice:

• SEC01-BP07 Identify threats and prioritize mitigations using a threat model

Related documents:

• AWS Cloud Compliance

• AWS Latest Security Bulletins
• AWS Trusted Advisor

Related videos:

• AWS re:Inforce 2023 - A tool to help improve your threat modeling

OPS01-BP06 Evaluate tradeoﬀs while managing beneﬁts and risks

Competing interests from multiple parties can make it challenging to prioritize efforts, build
capabilities, and deliver outcomes aligned with business strategies. For example, you may be asked
to accelerate speed-to-market for new features over optimizing IT infrastructure costs. This can
put two interested parties in conflict with one another. In these situations, decisions need to be
brought to a higher authority to resolve conflict. Data is required to remove emotional attachment
from the decision-making process.

The same challenge may occur at a tactical level. For example, the choice between using relational
or non-relational database technologies can have a signiﬁcant impact on the operation of an
application. It's critical to understand the predictable results of various choices.

Organization 83
AWS Well-Architected Framework Framework

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Managing benefits and risks should be defined by a governing body that drives the requirements
for key decision-making. You want decisions to be made and prioritized based on how they
benefit the organization, with an understanding of the risks involved. Accurate information is
critical for making the organizational decisions. This should be based on solid measurements and
defined by common industry practices of cost benefit analysis. To make these types of decisions,
strike a balance between centralized and decentralized authority. There is always a tradeoff, and
it's important to understand how each choice impacts defined strategies and desired business
outcomes.

Implementation steps

1. Formalize beneﬁts measurement practices within a holistic cloud governance framework.

a. Balance central control of decision-making with decentralized authority for some decisions.
b. Understand that burdensome decision-making processes imposed on every decision can slow
you down.
c. Incorporate external factors into your decision making process (like compliance requirements).
2. Establish an agreed-upon decision-making framework for various levels of decisions, which
includes who is required to unblock decisions that are subject to conflicted interests.
a. Centralize one-way door decisions that could be irreversible.
b. Allow two-way door decisions to be made by lower level organizational leaders.
3. Understand and manage benefits and risks. Balance the benefits of decisions against the risks
involved.
a. Identify benefits: Identify benefits based on business goals, needs, and priorities. Examples
include business case impact, time-to-market, security, reliability, performance, and cost.
b. Identify risks: Identify risks based on business goals, needs, and priorities. Examples include
time-to-market, security, reliability, performance, and cost.
c. Assess benefits against risks and make informed decisions: Determine the impact of
benefits and risks based on goals, needs, and priorities of your key stakeholders, including
business, development, and operations. Evaluate the value of the benefit against the
probability of the risk being realized and the cost of its impact. For example, emphasizing
speed-to-market over reliability might provide competitive advantage. However, it may result
in reduced uptime if there are reliability issues.

Organization 85
AWS Well-Architected Framework Framework

goals. Understanding responsibility, ownership, how decisions are made, and who has authority to
make decisions will help focus eﬀorts and maximize the beneﬁts from your teams.

Best practices
• OPS02-BP01 Resources have identified owners
• OPS02-BP02 Processes and procedures have identified owners
• OPS02-BP03 Operations activities have identified owners responsible for their performance
• OPS02-BP04 Mechanisms exist to manage responsibilities and ownership
• OPS02-BP05 Mechanisms exist to request additions, changes, and exceptions
• OPS02-BP06 Responsibilities between teams are predefined or negotiated

OPS02-BP01 Resources have identiﬁed owners

Resources for your workload must have identiﬁed owners for change control, troubleshooting,
and other functions. Owners are assigned for workloads, accounts, infrastructure, platforms, and
applications. Ownership is recorded using tools like a central register or metadata attached to
resources. The business value of components informs the processes and procedures applied to
them.

Desired outcome:

• Resources have identiﬁed owners using metadata or a central register.

• Team members can identify who owns resources.
• Accounts have a single owner where possible.

Common anti-patterns:

• The alternate contacts for your AWS accounts are not populated.
• Resources lack tags that identify what teams own them.
• You have an ITSM queue without an email mapping.
• Two teams have overlapping ownership of a critical piece of infrastructure.

Beneﬁts of establishing this best practice:

• Change control for resources is straightforward with assigned ownership.

Organization 87
AWS Well-Architected Framework Framework

a. You can use AWS Config rules to enforce that resources have the required ownership tags.
b. For in-depth guidance on how to build a tagging strategy for your organization, see AWS
Tagging Best Practices whitepaper.
4. Use Amazon Q Business, a conversational assistant that uses generative AI to enhance workforce
productivity, answer questions, and complete tasks based on information in your enterprise
systems.
a. Connect Amazon Q Business to your company's data source. Amazon Q Business offers
prebuilt connectors to over 40 supported data sources, including Amazon Simple Storage
Service (Amazon S3), Microsoft SharePoint, Salesforce, and Atlassian Confluence. For more
information, see Amazon Q Business connectors.
5. For other resources, platforms, and infrastructure, create documentation that identifies
ownership. This should be accessible to all team members.

Level of eﬀort for the implementation plan: Low. Leverage account contact information and tags
to assign ownership of AWS resources. For other resources you can use something as simple as a
table in a wiki to record ownership and contact information, or use an ITSM tool to map ownership.

Resources

Related best practices:

• OPS02-BP02 Processes and procedures have identiﬁed owners

• OPS02-BP04 Mechanisms exist to manage responsibilities and ownership

Related documents:

• AWS Account Management - Updating contact information

• AWS Organizations - Updating alternative contacts in your organization
• AWS Tagging Best Practices whitepaper
• Build private and secure enterprise generative AI apps with Amazon Q Business and AWS IAM
Identity Center
• Amazon Q Business, now generally available, helps boost workforce productivity with generative
AI
• AWS Cloud Operations & Migrations Blog - Implementing automated and centralized tagging
controls with AWS Conﬁg and AWS Organizations

Organization 89
AWS Well-Architected Framework Framework

Beneﬁts of establishing this best practice:

• Processes and procedures boost your eﬀorts to operate your workloads.

• New team members become eﬀective more quickly.
• Reduced time to mitigate incidents.
• Diﬀerent team members (and teams) can use the same processes and procedures in a consistent
manner.
• Teams can scale their processes with repeatable processes.
• Standardized processes and procedures help mitigate the impact of transferring workload
responsibilities between teams.

Level of risk exposed if this best practice is not established: High

Implementation guidance

• Processes and procedures have identified owners who are responsible for their definition.
• Identify the operations activities conducted in support of your workloads. Document these
activities in a discoverable location.
• Uniquely identify the individual or team responsible for the specification of an activity. They
are responsible to verify that it can be successfully performed by an adequately skilled team
member with the correct permissions, access, and tools. If there are issues with performing
that activity, the team members performing it are responsible for providing the detailed
feedback necessary for the activity to be improved.
• Capture ownership in the metadata of the activity artifact through services like AWS Systems
Manager, through documents, and AWS Lambda. Capture resource ownership using tags or
resource groups, specifying ownership and contact information. Use AWS Organizations to
create tagging polices and capture ownership and contact information.
• Over time, these procedures should be evolved to be runnable as code, reducing the need for
human intervention.
• For example, consider AWS Lambda functions, CloudFormation templates, or AWS Systems
Manager automation docs.
• Perform version control in appropriate repositories.
• Include suitable resource tagging so owners and documentation can readily be identified.

Customer example

Organization 91
AWS Well-Architected Framework Framework

4. Automate when appropriate.

a. Automations should be developed when services and technologies provide an API.
b. Educate adequately on processes. Develop the user stories and requirements to automate
those processes.
c. Measure the use of your processes and proceedurees successfully, with issues to support
iterative improvement.

Level of eﬀort for the implementation plan: Medium

Resources

Related best practices:

• OPS02-BP01 Resources have identiﬁed owners

• OPS02-BP04 Mechanisms exist to manage responsibilities and ownership
• OPS11-BP04 Perform knowledge management

Related documents:

• AWS Whitepaper - Introduction to DevOps on AWS

• AWS Whitepaper - Best Practices for Tagging AWS Resources
• AWS Whitepaper - Organizing Your AWS Environment Using Multiple Accounts
• AWS Cloud Operations & Migrations Blog - Build a Cloud Automation Practice for Operational
Excellence: Best Practices from AWS Managed Services
• AWS Cloud Operations & Migrations Blog - Implementing automated and centralized tagging
controls with AWS Conﬁg and AWS Organizations
• AWS Security Blog - Extend your pre-commit hooks with AWS CloudFormation Guard
• AWS DevOps Blog - Integrating AWS CloudFormation Guard into CI/CD pipelines

Related workshops:

• AWS Well-Architected Operational Excellence Workshop

• AWS Workshop - Tagging

Related videos:

Organization 93
AWS Well-Architected Framework Framework

• You understand who is responsible to perform an activity, who to notify when action is needed,
and who performs the action, validates the result, and provides feedback to the owner of the
activity.
• Processes and procedures boost your efforts to operate your workloads.
• New team members become effective more quickly.
• You reduce the time it takes to mitigate incidents.
• Different teams use the same processes and procedures to perform tasks in a consistent manner.
• Teams can scale their processes with repeatable processes.
• Standardized processes and procedures help mitigate the impact of transferring workload
responsibilties between teams.

Level of risk exposed if this best practice is not established: High

Implementation guidance

To begin to deﬁne responsibilities, start with existing documentation, like responsibility matrices,
processes and procedures, roles and responsibilities, and tools and automation. Review and host
discussions on the responsibilities for documented processes. Review with teams to identify
misalignments between document responsibilities and processes. Discuss services oﬀered with
internal customers of that team to identify expectations gaps between teams.

Analyze and address the discrepancies. Identify opportunities to improvement, and look for
frequently requested, resource-intensive activities, which are typically strong candidates for
improvement. Explore best practices, patterns, and prescriptive guidance to simplify and
standardize improvements. Record improvement opportunities, and track the improvements to
completion.

Over time, these procedures should be evolved to be run as code, reducing the need for
human intervention. For example, procedures can be initiated as AWS Lambda functions, AWS
CloudFormation templates, or AWS Systems Manager Automation documents. Verify that these
procedures are version-controlled in appropriate repositories, and include suitable resource
tagging so that teams can readily identify owners and documentation. Document the responsibility
for carrying out the activities, and then monitor the automations for successful initiation and
operation, as well as performance of the desired outcomes.

Customer example

Organization 95
AWS Well-Architected Framework Framework

4. Automate when it is appropriate to do so.

a. Where services and technologies provide an API, develop automations.
b. Verify that processes are well-understood, and develop the user stories and requirements to
automate those processes.
c. Measure the successful use of processes and procedures, with issue tracking to support
iterative improvement.

Level of eﬀort for the implementation plan: Medium

Resources

Related best practices:

• OPS02-BP01 Resources have identiﬁed owners

• OPS02-BP02 Processes and procedures have identiﬁed owners
• OPS02-BP04 Mechanisms exist to manage responsibilities and ownership
• OPS02-BP05 Mechanisms exist to identify responsibility and ownership
• OPS11-BP04 Perform knowledge management

Related documents:

• AWS Whitepaper | Introduction to DevOps on AWS

• AWS Whitepaper | Best Practices for Tagging AWS Resources
• AWS Whitepaper | Organizing Your AWS Environment Using Multiple Accounts
• AWS Cloud Operations & Migrations Blog | Build a Cloud Automation Practice for Operational
Excellence: Best Practices from AWS Managed Services
• AWS Workshop - Tagging
• AWS Service Management Connector

Related videos:

• AWS Knowledge Center Live | Tagging AWS Resources

• AWS re:Invent 2020 | Automate anything with AWS Systems Manager
• AWS re:Inforce 2022 | Automating patch management and compliance using AWS (NIS306)

Organization 97
AWS Well-Architected Framework Framework

• Roles, responsibilities, and escalation paths are not discoverable, and they are not readily
available when required (for example, in response to an incident).

Beneﬁts of establishing this best practice:

• When you understand who has responsibility or ownership, you can contact the proper team or
team member to make a request or transition a task.
• To reduce the risk of inaction and unaddressed needs, you have identified a person who has the
authority to assign responsibility or ownership.
• When you clearly define the scope of a responsibility, your team members gain autonomy and
ownership.
• Your responsibilities inform the decisions you make, the actions you take, and your handoff
activities to their proper owners.
• It's easy to identify abandoned responsibilities because you have a clear understanding of what
falls outside of your team's responsibility, which helps you escalate for clarification.
• Teams avoid confusion and tension, and they can more adequately manage their workloads and
resources.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Identify team members roles and responsibilities, and verify that they understand the expectations
of their role. Make this information discoverable so that members of your organization can identify
who they need to contact for speciﬁc needs, whether it's a team or individual. As organizations
seek to capitalize on the opportunities to migrate and modernize on AWS, roles and responsibilities
might also change. Keep your teams and their members aware of their responsibilities, and train
them appropriately to carry out their tasks during this change.

Determine the role or team that should receive escalations to identify responsibility and ownership.
This team can engage with various stakeholders to come to a decision. However, they should own
the management of the decision making process.

Provide accessible mechanisms for members of your organization to discover and identify
ownership and responsibility. These mechanisms teach them who to contact for speciﬁc needs.

Customer example

Organization 99
AWS Well-Architected Framework Framework

c. Record improvement opportunities, and track them to completion.

5. If a team doesn't already hold responsibility for managing and tracking the assignment of
responsibilities, identify someone on the team to hold this responsibility.
6. Define a process for teams to request clarification of responsibility.
a. Review the process, and verify that it is clear and simple to use.
b. Make sure that someone owns and tracks escalations to their conclusion.
c. Establish operational metrics to measure effectiveness.
d. Create a feedback mechanisms to verify that teams can highlight improvement opportunities.
e. Implement a mechanism for periodic review.
7. Document in a discoverable and accessible location.
a. Wikis or documentation portal are common choices.

Level of eﬀort for the implementation plan: Medium

Resources

Related best practices:

• OPS01-BP06 Evaluate tradeoﬀs

• OPS03-BP02 Team members are empowered to take action when outcomes are at risk
• OPS03-BP03 Escalation is encouraged
• OPS03-BP07 Resource teams appropriately
• OPS09-BP01 Measure operations goals and KPIs with metrics
• OPS09-BP03 Review operations metrics and prioritize improvement
• OPS11-BP01 Have a process for continuous improvement

Related documents:

• AWS Whitepaper - Introduction to DevOps on AWS

• AWS Whitepaper - AWS Cloud Adoption Framework: Operations Perspective
• AWS Well-Architected Framework Operational Excellence - Workload level Operating model
topologies
• AWS Prescriptive Guidance - Building your Cloud Operating Model

Organization 101
AWS Well-Architected Framework Framework

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

To implement this best practice, you need to be able to request changes to processes, procedures,
and resources. The change management process can be lightweight. Document the change
management process.

Customer example

AnyCompany Retail uses a responsibility assignment (RACI) matrix to identify who owns changes
for processes, procedures, and resources. They have a documented change management process
that’s lightweight and easy to follow. Using the RACI matrix and the process, anyone can submit
change requests.

Implementation steps

1. Identify the processes, procedures, and resources for your workload and the owners for each.
Document them in your knowledge management system.
a. If you have not implemented OPS02-BP01 Resources have identified owners, OPS02-BP02
Processes and procedures have identified owners, or OPS02-BP03 Operations activities have
identified owners responsible for their performance, start with those first.
2. Work with stakeholders in your organization to develop a change management process.
The process should cover additions, changes, and exceptions for resources, processes, and
procedures.
a. You can use AWS Systems Manager Change Manager as a change management platform for
workload resources.
3. Document the change management process in your knowledge management system.

Level of eﬀort for the implementation plan: Medium. Developing a change management process
requires alignment with multiple stakeholders across your organization.

Resources

Related best practices:

• OPS02-BP01 Resources have identiﬁed owners - Resources need identiﬁed owners before you
build a change management process.

Organization 103
AWS Well-Architected Framework Framework

• The operations team needs assistance from the development team but there is no agreed to
response time. The request is stuck in the backlog.

Beneﬁts of establishing this best practice:

• Teams know how to interact and support each other.

• Expectations for responsiveness are known.

• Communications channels are clearly deﬁned.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

Implementing this best practice means that there is no ambiguity about how teams work with
each other. Formal agreements codify how teams work together or support each other. Inter-team
communication channels are documented.

Customer example

AnyCompany Retail’s SRE team has a service level agreement with their development team.
Whenever the development team makes a request in their ticketing system, they can expect
a response within ﬁfteen minutes. If there is a site outage, the SRE team takes lead in the
investigation with support from the development team.

Implementation steps

1. Working with stakeholders across your organization, develop agreements between teams based
on processes and procedures.

a. If a process or procedure is shared between two teams, develop a runbook on how the teams
will work together.

b. If there are dependencies between teams, agree to a response SLA for requests.

2. Document responsibilities in your knowledge management system.

Level of eﬀort for the implementation plan: Medium. If there are no existing agreements
between teams, it can take eﬀort to come to agreement with stakeholders across your
organization.

Organization 105
AWS Well-Architected Framework Framework

organization understands each capability required by the organization to accomplish a new

outcome and assigns ownership to functional teams for development. Leadership actively
sets this direction, assigns ownership, takes accountability, and deﬁnes the work. As a result,
individuals across the organization can mobilize, feel inspired, and actively work towards the
desired objectives.

Common anti-patterns:

• There is a mandate for workload owners to migrate workloads to AWS without a clear sponsor
and plan for cloud operations. This results in teams not consciously collaborating to improve
and mature their operational capabilities. Lack of operational best practice standards overwhelm
teams (such as operator-toil, on-calls, and technical debt), which constrains innovation.

• A new organization-wide goal has been set to adopt an emerging technology without providing
leadership sponsor and strategy. Teams interpret goals diﬀerently, which causes confusion
on where to focus eﬀorts, why they matter, and how to measure impact. Consequently, the
organization loses momentum in adopting the technology.

Benefits of establishing this best practice: When executive sponsorship clearly communicates and
shares vision, direction, and goals, team members know what is expected of them. Individuals and
teams begin to intensely focus effort in the same direction to accomplish defined objectives when
leaders are actively engaged. As a result, the organization maximies the ability to succeed. When
you evaluate success, you can better identify barriers to success so that they can be addressed
through intervention by the executive sponsor.

Level of risk exposed if this best practice is not established: High

Implementation guidance

• At every phase of the cloud journey (migration, adoption, or optimization), success requires
active involvement at the highest level of leadership with a designated executive sponsor. The
executive sponsor aligns the team's mindset, skillsets, and ways of working to the deﬁned
strategy.

• Explain the why: Bring clarity and explain the reasoning behind the vision and strategy.

• Set expectations: Deﬁne and publish goals for your organizations, including how progress and
success are measured.

Organization 107
AWS Well-Architected Framework Framework

c. Communicate the vision consistently to all teams and individuals responsible for parts of the
strategy.
4. Develop communication planning matrices that specify what message needs to be delivered to
speciﬁed leaders, managers, and individual contributors. Specify the person or team that should
deliver this message.

a. Fulﬁll communications plans consistently and reliably.

b. Set and manage expectations through in-person events on a regular basis.

c. Accept feedback on the eﬀectiveness of communications, and adjust the communications and
plan accordingly.

d. Schedule communication events to proactively understand challenges from teams, and

establish a consistent feedback loop that allows for correcting course where necessary.

5. Actively engage each initiative from a leadership perspective to verify that all impacted teams
understand the outcomes they are accountable to achieve.

6. At every status meeting, executive sponsors should look for blockers, inspect established
metrics, anecdotes, or feedback from the teams, and measure progress towards objectives.

Level of eﬀort for the implementation plan Medium

Resources

Related best practices:

• OPS03-BP04 Communications are timely, clear, and actionable

• OP11-BP01 Have a process for continuous improvement

• OPS11-BP07 Perform operations metrics reviews

Related documents:

• Untangling Your Organisational Hairball: Highly Aligned

• The Living Transformation: Pragmatically approaching changes

• Becoming a Future-Ready Enterprise

• 7 Pitfalls to Avoid When Building a CCOE

• Navigating the Cloud: Key Performance Indicators for Success

Organization 109
AWS Well-Architected Framework Framework

is no process to track such improvements. The organization continues to be plagued with failed
deployments impacting customers and causing further negative sentiment.
• In order to stay compliant, your infosec team oversees a long-established process to rotate
shared SSH keys regularly on behalf of operators connecting to their Amazon EC2 Linux
instances. It takes several days for the infosec teams to complete rotating keys, and you are
blocked from connecting to those instances. No one inside or outside of infosec suggests using
other options on AWS to achieve the same result.

Beneﬁts of establishing this best practice: By decentralizing authority to make decisions and
empowering your teams to decide key decisions, you are able to address issues more quickly with
increasing success rates. In addition, teams start to realize a sense of ownership, and failures are
acceptable. Experimentation becomes a cultural mainstay. Managers and directors do not feel as
though they are micro-managed through every aspect of their work.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

1. Develop a culture where it is expected that failures can occur.

2. Define clear ownership and accountability for various functional areas within the organization.
3. Communicate ownership and accountability to everyone so that individuals know who can help
them facilitate decentralized decisions.
4. Define your one-way and two-way door decisions to help individuals know when they do need to
escalate to higher levels of leadership.
5. Create organizational awareness that all employees are empowered to take action at various
levels when outcomes are at risk. Provide your team members documentation of governance,
permission-levels, tools, and opportunities to practice the skills necessary to respond effectively.
6. Give your team members the opportunity to practice the skills necessary to respond to various
decisions. Once decision levels are defined, perform game days to verify that all individual
contributors understand and can demonstrate the process.
a. Provide alternative safe environments where processes and procedures can be tested and
trained upon.
b. Acknowledge and create awareness that team members have authority to take action when
the outcome has a predefined level of risk.
c. Define the authority of your team members to take action by assigning permissions and
access to the workloads and components they support.

Organization 111
AWS Well-Architected Framework Framework

Escalation should be done early and often so that risks can be identiﬁed and prevented from
causing incidents. Leadership does not reprimand individuals for escalating an issue.

Desired outcome: Individuals throughout the organization are comfortable to escalate problems
to their immediate and higher levels of leadership. Leadership has deliberately and consciously
established expectations that their teams should feel safe to escalate any issue. A mechanism
exists to escalate issues at each level within the organization. When employees escalate to their
manager, they jointly decide the level of impact and whether the issue should be escalated. In
order to initiate an escalation, employees are required to include a recommended work plan to
address the issue. If direct management does not take timely action, employees are encouraged to
take issues to the highest level of leadership if they feel strongly that the risks to the organization
warrant the escalation.

Common anti-patterns:

• Executive leaders do not ask enough probing questions during your cloud transformation
program status meeting to ﬁnd where issues and blockers are occurring. Only good news is
presented as status. The CIO has made it clear that she only likes to hear good news, as any
challenges brought up make the CEO think that the program is failing.

• You are a cloud operations engineer and you notice that the new knowledge management
system is not being widely adopted by application teams. The company invested one year and
several million dollars to implement this new knowledge management system, but people
are still authoring their runbooks locally and sharing them on an organizational cloud share,
making it difficult to find knowledge pertinent to supported workloads. You try to bring this
to leadership's attention, because consistent use of this system can enhance operational
efficiency. When you bring this to the director who lead the implementation of the knowledge
management system, she reprimands you because it calls the investment into question.

• The infosec team responsible for hardening compute resources has decided to put a process
in place that requires performing the scans necessary to ensure that EC2 instances are fully
secured before the compute team releases the resource for use. This has created a time delay of
an additional week for resources to be deployed, which breaks their SLA. The compute team is
afraid to escalate this to the VP over cloud because this makes the VP of information security
look bad.

Beneﬁts of establishing this best practice:

Organization 113
AWS Well-Architected Framework Framework

b. Protect employees who escalate. Have policy that protects team members from retribution
if they escalate around a non-responsive decision maker or stakeholder. Have mechanisms in
place to identify if this is occurring and respond appropriately.

5. Encourage a culture of continuous improvement feedback loops in everything that the

organization produces. Feedback loops act as minor escalations to individuals responsible, and
they identify improvement opportunities, even when escalation is not needed. Continuous
improvement cultures force everyone to be more proactive.

6. Leadership should periodically reemphasize the policies, standards, mechanisms, and the desire
for open escalation and continuous feedback loops without retribution.

Level of eﬀort for the Implementation Plan: Medium

Resources

Related best practices:

• OPS02-BP05 Mechanisms exist to request additions, changes, and exceptions

Related documents:

• How do you foster a culture of continuous improvement and learning from Andon and escalation
systems?

• The Andon Cord (IT Revolution)

• AWS DevOps Guidance | Establish clear escalation paths and encourage constructive
disagreement

Related videos:

• Jeﬀ Bezos on how to make decisions (& increase velocity)

• Toyota Product System: Stopping Production, a Button, and an Andon Electric Board

• Andon Cord in LEAN Manufacturing

Related examples:

• Working with escalation plans in Incident Manager

Organization 115
AWS Well-Architected Framework Framework

informed of this strategic change and thus, they are not ready with enough skilled capacity to
support a greater number of workloads lifted and shifted into AWS.

Beneﬁts of establishing this best practice:

• Your organization is well-informed on new or changed strategies, and they act accordingly with
strong motivation to help each other achieve the overall objectives and metrics set by leadership.
• Mechanisms exist and are used to provide timely notice to team members of known risks and
planned events.
• New ways of working (including changes to people or the organization, processes, or
technology), along with required skills, are more effectively adopted by the organization, and
your organization realizes business benefits more quickly.
• Team members have the necessary context of the communications being received, and they can
be more effective in their jobs.

Level of risk exposed if this best practice is not established: High

Implementation guidance

To implement this best practice, you must work with stakeholders across your organization
to agree to communication standards. Publicize those standards to your organization. For any
significant IT transitions, an established planning team can more successfully manage the impact
of change to its people than an organization that ignores this practice. Larger organizations can
be more challenging when managing change because it's critical to establish strong buy-in on a
new strategy with all individual contributors. In the absence of such a transition planning team,
leadership holds 100% of the responsibility for effective communications. When establishing
a transition planning team, assign team members to work with all organizational leadership to
define and manage effective communications at every level.

Customer example

AnyCompany Retail signed up for AWS Enterprise Support and depends on other third-
party providers for its cloud operations. The company uses chat and chatops as their main
communication medium for operational activities. Alerts and other information populate speciﬁc
channels. When someone must act, they clearly state the desired outcome, and in many cases, they
receive a runbook or playbook to use. They schedule major changes to production systems with a
change calendar.

Organization 117
AWS Well-Architected Framework Framework

a. You can use AWS Systems Manager Documents to build playbooks and runbooks for alerts.
16.Mechanisms are in place to provide notiﬁcation of risks or planned events in a clear and
actionable way with enough notice to allow appropriate responses. Use email lists or chat
channels to send notiﬁcations ahead of planned events.

a. AWS Chatbot can be used to send alerts and respond to events within your organizations
messaging platform.

17.Provide an accessible source of information where planned events can be discovered. Provide
notiﬁcations of planned events from the same system.

a. AWS Systems Manager Change Calendar can be used to create change windows when
changes can occur. This provides team members notice when they can make changes safely.

18.Monitor vulnerability notiﬁcations and patch information to understand vulnerabilities in the

wild and potential risks associated to your workload components. Provide notiﬁcation to team
members so that they can act.

a. You can subscribe to AWS Security Bulletins to receive notiﬁcations of vulnerabilities on AWS.

19.Seek diverse opinions and perspectives: Encourage contributions from everyone. Give
communication opportunities to under-represented groups. Rotate roles and responsibilities in
meetings.

a. Expand roles and responsibilities: Provide opportunities for team members to take on roles
that they might not otherwise. They can gain experience and perspective from the role and
from interactions with new team members with whom they might not otherwise interact.
They can also bring their experience and perspective to the new role and team members
they interact with. As perspective increases, identify emergent business opportunities or new
opportunities for improvement. Rotate common tasks between members within a team that
others typically perform to understand the demands and impact of performing them.

b. Provide a safe and welcoming environment: Establish policy and controls that protect the
mental and physical safety of team members within your organization. Team members should
be able to interact without fear of reprisal. When team members feel safe and welcome, they
are more likely to be engaged and productive. The more diverse your organization, the better
your understanding can be of the people you support, including your customers. When your
team members are comfortable, feel free to speak, and are conﬁdent they str heard, they
are more likely to share valuable insights (for example, marketing opportunities, accessibility
needs, unserved market segments, and unacknowledged risks in your environment).

c. Encourage team members to participate fully: Provide the resources necessary for your
employees to participate fully in all work related activities. Team members that face daily

Organization 119
AWS Well-Architected Framework Framework

OPS03-BP05 Experimentation is encouraged

Experimentation is a catalyst for turning new ideas into products and features. It accelerates
learning and keeps team members interested and engaged. Team members are encouraged
to experiment often to drive innovation. Even when an undesired result occurs, there is value
in knowing what not to do. Team members are not punished for successful experiments with
undesired results.

Desired outcome:

• Your organization encourages experimentation to foster innovation.

• Experiments are used as an opportunity to learn.

Common anti-patterns:

• You want to run an A/B test but there is no mechanism to run the experiment. You deploy a UI
change without the ability to test it. It results in a negative customer experience.

• Your company only has a stage and production environment. There is no sandbox environment
to experiment with new features or products so you must experiment within the production
environment.

Beneﬁts of establishing this best practice:

• Experimentation drives innovation.

• You can react faster to feedback from users through experimentation.

• Your organization develops a culture of learning.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Experiments should be run in a safe manner. Leverage multiple environments to experiment

without jeopardizing production resources. Use A/B testing and feature ﬂags to test experiments.
Provide team members the ability to conduct experiments in a sandbox environment.

Customer example

Organization 121
AWS Well-Architected Framework Framework

• Experiment More, Fail Less

• Organizing Your AWS Environment Using Multiple Accounts - Sandbox OU

• Using AWS AppConﬁg Feature Flags

Related videos:

• AWS On Air ft. Amazon CloudWatch Evidently | AWS Events

• AWS On Air San Fran Summit 2022 ft. AWS AppConﬁg Feature Flags integration with Jira

• AWS re:Invent 2022 - A deployment is not a release: Control your launches w/feature ﬂags
(BOA305-R)

• Programmatically Create an AWS account with AWS Control Tower

• Set Up a Multi-Account AWS Environment that Uses Best Practices for AWS Organizations

Related examples:

• AWS Innovation Sandbox

• End-to-end Personalization 101 for E-Commerce

Related services:

• Amazon CloudWatch Evidently

• AWS AppConﬁg

• AWS Control Tower

OPS03-BP06 Team members are encouraged to maintain and grow their skill sets

Teams must grow their skill sets to adopt new technologies, and to support changes in demand
and responsibilities in support of your workloads. Growth of skills in new technologies is frequently
a source of team member satisfaction and supports innovation. Support your team members'
pursuit and maintenance of industry certiﬁcations that validate and acknowledge their growing
skills. Cross train to promote knowledge transfer and reduce the risk of signiﬁcant impact when
you lose skilled and experienced team members with institutional knowledge. Provide dedicated
structured time for learning.

Organization 123
AWS Well-Architected Framework Framework

Implementation guidance

To adopt new technologies, fuel innovation, and keep pace with changes in demand and
responsibilities to support your workloads, continually invest in the professional growth of your
teams.

Implementation steps

1. Use structured cloud advocacy programs: AWS Skills Guild provides consultative training to
increase cloud skill confidence and igniting culture of continuous learning.
2. Provide resources for education: Provided dedicated, structured time and access to training
materials and lab resources, and support participation in conferences and professional
organizations that provide opportunities for learning from both educators and peers. Provide
your junior team members with access to senior team members as mentors, or allow the junior
team members to shadow their seniors' work and be exposed to their methods and skills.
Encourage learning about content not directly related to work in order to have a broader
perspective.
3. Encourage use of expert technical resources: Leverage resources such as AWS re:Post to get
access to curated knowledge and vibrant community.
4. Build and maintain an up-to-date knowledge repository: Use knowledge sharing platforms
such as wikis and runbooks. Create your own reusable expert knowledge source with AWS
re:Post Private to streamline collaboration, improve productivity, and accelerate employee
onboarding.
5. Team education and cross-team engagement: Plan for the continuing education needs of your
team members. Provide opportunities for team members to join other teams (temporarily or
permanently) to share skills and best practices benefiting your entire organization.
6. Support pursuit and maintenance of industry certifications: Support your team members in
the acquisition and maintenance of industry certifications that validate what they have learned
and acknowledge their accomplishments.

Level of eﬀort for the implementation plan: High

Resources

Related best practices:

• OPS03-BP01 Provide executive sponsorship

Organization 125
AWS Well-Architected Framework Framework

• You have appropriately staffed your team to gain the skillsets needed for them to operate
workloads in AWS in accordance with your migration plan. As your team has scaled itself up
during the course of your migration project, they have gained proficiency in the core AWS
technologies that the business plans to use when migrating or modernizing their applications.
• You have carefully aligned your staffing plan to make efficient use of resources by leveraging
automation and workflow. A smaller team can now manage more infrastructure on behalf of the
application development teams.
• With shifting operational priorities, any resource staffing constraints are proactively identified to
protect the success of business initiatives.
• Operational metrics that report operational toil (such as on-call fatigue or excessive paging) are
reviewed to verify that staff are not overwhelmed.

Common anti-patterns:

• Your staff have not ramped up on AWS skills as you close in on your multi-year cloud migration
plan, which risks support of the workloads and lowers employee morale.
• Your entire IT organization is shifting into agile ways of working. The business is prioritizing
the product portfolio and setting metrics for what features need to be developed first. Your
agile process does not require teams to assign story points to their work plans. As a result, it is
impossible to know the level of capacity required for the next amount of work, or if you have the
right skills assigned to the work.
• You are having an AWS partner migrate your workloads, and you don't have a support transition
plan for your teams once the partner completes the migration project. Your teams struggle to
efficiently and effectively support the workloads.

Beneﬁts of establishing this best practice: You have appropriately-skilled team members
available in your organization to support the workloads. Resource allocation can adapt to shifting
priorities without impacting performance. The result is teams being proﬁcient at supporting
workloads while maximizing time to focus on innovating for customers, which in turn raises
employee satisfaction.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Resource planning for your cloud migration should occur at an organizational level that aligns to
your migration plan, as well as the desired operating model being implemented to support your

Organization 127
AWS Well-Architected Framework Framework

• High performing organization - the Amazon Two-Pizza team

• How Cloud-Mature Enterprises Succeed

Prepare
Questions
• OPS 4. How do you implement observability in your workload?
• OPS 5. How do you reduce defects, ease remediation, and improve ﬂow into production?
• OPS 6. How do you mitigate deployment risks?
• OPS 7. How do you know that you are ready to support a workload?

OPS 4. How do you implement observability in your workload?

Implement observability in your workload so that you can understand its state and make data-
driven decisions based on business requirements.

Best practices
• OPS04-BP01 Identify key performance indicators
• OPS04-BP02 Implement application telemetry
• OPS04-BP03 Implement user experience telemetry
• OPS04-BP04 Implement dependency telemetry
• OPS04-BP05 Implement distributed tracing

OPS04-BP01 Identify key performance indicators

Implementing observability in your workload starts with understanding its state and making
data-driven decisions based on business requirements. One of the most eﬀective ways to ensure
alignment between monitoring activities and business objectives is by deﬁning and monitoring key
performance indicators (KPIs).

Desired outcome: Eﬃcient observability practices that are tightly aligned with business objectives,
ensuring that monitoring eﬀorts are always in service of tangible business outcomes.

Common anti-patterns:

Prepare 129
AWS Well-Architected Framework Framework

Resources

Related best practices:

• the section called “OPS04-BP02 Implement application telemetry”

• the section called “OPS04-BP03 Implement user experience telemetry”
• the section called “OPS04-BP04 Implement dependency telemetry”
• the section called “OPS04-BP05 Implement distributed tracing”

Related documents:

• AWS Observability Best Practices

• CloudWatch User Guide
• AWS Observability Skill Builder Course

Related videos:

• Developing an observability strategy

Related examples:

• One Observability Workshop

OPS04-BP02 Implement application telemetry

Application telemetry serves as the foundation for observability of your workload. It's crucial
to emit telemetry that oﬀers actionable insights into the state of your application and the
achievement of both technical and business outcomes. From troubleshooting to measuring the
impact of a new feature or ensuring alignment with business key performance indicators (KPIs),
application telemetry informs the way you build, operate, and evolve your workload.

Metrics, logs, and traces form the three primary pillars of observability. These serve as diagnostic
tools that describe the state of your application. Over time, they assist in creating baselines and
identifying anomalies. However, to ensure alignment between monitoring activities and business
objectives, it's pivotal to deﬁne and monitor KPIs. Business KPIs often make it easier to identify
issues compared to technical metrics alone.

Prepare 131
AWS Well-Architected Framework Framework

resources, enhancing your understanding of how your workload operates. In tandem, AWS X-Ray
lets you trace, analyze, and debug your applications, giving you a deep understanding of your
workload's behavior. With features like service maps, latency distributions, and trace timelines,
AWS X-Ray provides insights into your workload's performance and the bottlenecks aﬀecting it.

Implementation steps

1. Identify what data to collect: Ascertain the essential metrics, logs, and traces that would oﬀer
substantial insights into your workload's health, performance, and behavior.

2. Deploy the CloudWatch agent: The CloudWatch agent is instrumental in procuring system
and application metrics and logs from your workload and its underlying infrastructure. The
CloudWatch agent can also be used to collect OpenTelemetry or X-Ray traces and send them to
X-Ray.

3. Implement anomaly detection for logs and metrics: Use CloudWatch Logs anomaly detection
and CloudWatch Metrics anomaly detection to automatically identify unusual activities in
your application's operations. These tools use machine learning algorithms to detect and alert
on anomalies, which enanhces your monitoring capabilities and speeds up response time to
potential disruptions or security threats. Set up these features to proactively manage application
health and security.

4. Secure sensitive log data: Use Amazon CloudWatch Logs data protection to mask sensitive
information within your logs. This feature helps maintain privacy and compliance through
automatic detection and masking of sensitive data before it is accessed. Implement data
masking to securely handle and protect sensitive details such as personally identiﬁable
information (PII).

5. Deﬁne and monitor business KPIs: Establish custom metrics that align with your business
outcomes.

6. Instrument your application with AWS X-Ray: In addition to deploying the CloudWatch agent,
it's crucial to instrument your application to emit trace data. This process can provide further
insights into your workload's behavior and performance.

7. Standardize data collection across your application: Standardize data collection practices
across your entire application. Uniformity aids in correlating and analyzing data, providing a
comprehensive view of your application's behavior.

8. Implement cross-account observability: Enhance monitoring eﬃciency across multiple AWS

accounts with Amazon CloudWatch cross-account observability. With this feature, you can
consolidate metrics, logs, and alarms from diﬀerent accounts into a single view, which simpliﬁes

Prepare 133
AWS Well-Architected Framework Framework

• One Observability Workshop

• AWS Solutions Library: Application Monitoring with Amazon CloudWatch

OPS04-BP03 Implement user experience telemetry

Gaining deep insights into customer experiences and interactions with your application is crucial.
Real user monitoring (RUM) and synthetic transactions serve as powerful tools for this purpose.
RUM provides data about real user interactions granting an unﬁltered perspective of user
satisfaction, while synthetic transactions simulate user interactions, helping in detecting potential
issues even before they impact real users.

Desired outcome: A holistic view of the customer experience, proactive detection of issues, and
optimization of user interactions to deliver seamless digital experiences.

Common anti-patterns:

• Applications without real user monitoring (RUM):

• Delayed issue detection: Without RUM, you might not become aware of performance
bottlenecks or issues until users complain. This reactive approach can lead to customer
dissatisfaction.
• Lack of user experience insights: Not using RUM means you lose out on crucial data that
shows how real users interact with your application, limiting your ability to optimize the user
experience.
• Applications without synthetic transactions:
• Missed edge cases: Synthetic transactions help you test paths and functions that might not be
frequently used by typical users but are critical to certain business functions. Without them,
these paths could malfunction and go unnoticed.
• Checking for issues when the application is not being used: Regular synthetic testing can
simulate times when real users aren't actively interacting with your application, ensuring the
system always functions correctly.

Beneﬁts of establishing this best practice:

• Proactive issue detection: Identify and address potential issues before they impact real users.
• Optimized user experience: Continuous feedback from RUM aids in reﬁning and enhancing the
overall user experience.

Prepare 135
AWS Well-Architected Framework Framework

Resources

Related best practices:

• OPS04-BP01 Identify key performance indicators

• OPS04-BP02 Implement application telemetry
• OPS04-BP04 Implement dependency telemetry
• OPS04-BP05 Implement distributed tracing

Related documents:

• Amazon CloudWatch RUM Guide

• Amazon CloudWatch Synthetics Guide

Related videos:

• Optimize applications through end user insights with Amazon CloudWatch RUM
• AWS on Air ft. Real-User Monitoring for Amazon CloudWatch

Related examples:

• One Observability Workshop

• Git Repository for Amazon CloudWatch RUM Web Client
• Using Amazon CloudWatch Synthetics to measure page load time

OPS04-BP04 Implement dependency telemetry

Dependency telemetry is essential for monitoring the health and performance of the external
services and components your workload relies on. It provides valuable insights into reachability,
timeouts, and other critical events related to dependencies such as DNS, databases, or third-
party APIs. When you instrument your application to emit metrics, logs, and traces about these
dependencies, you gain a clearer understanding of potential bottlenecks, performance issues, or
failures that might impact your workload.

Desired outcome: Ensure that the dependencies your workload relies on are performing as
expected, allowing you to proactively address issues and ensure optimal workload performance.

Prepare 137
AWS Well-Architected Framework Framework

external databases, third-party APIs, network connectivity routes to other environments, and
DNS services. The ﬁrst step towards eﬀective dependency telemetry is being comprehensive in
understanding what those dependencies are.

2. Develop a monitoring strategy: Once you have a clear picture of your external dependencies,
architect a monitoring strategy tailored to them. This involves understanding the criticality of
each dependency, its expected behavior, and any associated service-level agreements or targets
(SLA or SLTs). Set up proactive alerts to notify you of status changes or performance deviations.

3. Use network monitoring: Use Internet Monitor and Network Monitor, which provide
comprehensive insights into global internet and network conditions. These tools help you
understand and respond to outages, disruptions, or performance degradations that aﬀect your
external dependencies.

4. Stay informed with AWS Health Dashboard: It provides alerts and remediation guidance when
AWS is experiencing events that may impact your services.
a. Monitor AWS Health events with Amazon EventBridge rules, or integrate programatically
with AWS Health API to automate actions when you receive AWS Health events. These can be
general actions, such as sending all planned lifecycle event messages to a chat interface, or
speciﬁc actions, such as the initiation of a workﬂow in an IT service management tool.

b. If you use AWS Organizations, aggregate AWS Health events across accounts.

5. Instrument your application with AWS X-Ray: AWS X-Ray provides insights into how
applications and their underlying dependencies are performing. By tracing requests from start
to end, you can identify bottlenecks or failures in the external services or components your
application relies on.

6. Use Amazon DevOps Guru: This machine learning-driven service identiﬁes operational issues,
predicts when critical issues might occur, and recommends speciﬁc actions to take. It's invaluable
for gaining insights into dependencies and ensuring they're not the source of operational
problems.

7. Monitor regularly: Continually monitor metrics and logs related to external dependencies. Set
up alerts for unexpected behavior or degraded performance.

8. Validate after changes: Whenever there's an update or change in any of the external
dependencies, validate their performance and check their alignment with your application's
requirements.

Level of eﬀort for the implementation plan: Medium

Prepare 139
AWS Well-Architected Framework Framework

Desired outcome: Achieve a holistic view of requests ﬂowing through your distributed system,
allowing for precise debugging, optimized performance, and improved user experiences.

Common anti-patterns:

• Inconsistent instrumentation: Not all services in a distributed system are instrumented for
tracing.
• Ignoring latency: Only focusing on errors and not considering the latency or gradual
performance degradations.

Beneﬁts of establishing this best practice:

• Comprehensive system overview: Visualizing the entire path of requests, from entry to exit.
• Enhanced debugging: Quickly identifying where failures or performance issues occur.
• Improved user experience: Monitoring and optimizing based on actual user data, ensuring the
system meets real-world demands.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Begin by identifying all of the elements of your workload that require instrumentation. Once all
components are accounted for, leverage tools such as AWS X-Ray and OpenTelemetry to gather
trace data for analysis with tools like X-Ray and Amazon CloudWatch ServiceLens Map. Engage
in regular reviews with developers, and supplement these discussions with tools like Amazon
DevOps Guru, X-Ray Analytics and X-Ray Insights to help uncover deeper ﬁndings. Establish alerts
from trace data to notify when outcomes, as deﬁned in the workload monitoring plan, are at risk.

Implementation steps

To implement distributed tracing eﬀectively:

1. Adopt AWS X-Ray: Integrate X-Ray into your application to gain insights into its behavior,
understand its performance, and pinpoint bottlenecks. Utilize X-Ray Insights for automatic trace
analysis.
2. Instrument your services: Verify that every service, from an AWS Lambda function to an EC2
instance, sends trace data. The more services you instrument, the clearer the end-to-end view.

Prepare 141
AWS Well-Architected Framework Framework

Related examples:

• Instrumenting your application for AWS X-Ray

OPS 5. How do you reduce defects, ease remediation, and improve ﬂow into
production?

Adopt approaches that improve flow of changes into production, that activate refactoring, fast
feedback on quality, and bug fixing. These accelerate beneficial changes entering production, limit
issues deployed, and achieve rapid identification and remediation of issues introduced through
deployment activities.

Best practices
• OPS05-BP01 Use version control
• OPS05-BP02 Test and validate changes
• OPS05-BP03 Use conﬁguration management systems
• OPS05-BP04 Use build and deployment management systems
• OPS05-BP05 Perform patch management
• OPS05-BP06 Share design standards
• OPS05-BP07 Implement practices to improve code quality
• OPS05-BP08 Use multiple environments
• OPS05-BP09 Make frequent, small, reversible changes
• OPS05-BP10 Fully automate integration and deployment

OPS05-BP01 Use version control

Use version control to activate tracking of changes and releases.

Many AWS services oﬀer version control capabilities. Use a revision or source control system
such as AWS CodeCommit to manage code and other artifacts, such as version-controlled AWS
CloudFormation templates of your infrastructure.

Desired outcome: Your teams collaborate on code. When merged, the code is consistent and no
changes are lost. Errors are easily reverted through correct versioning.

Common anti-patterns:

Prepare 143
AWS Well-Architected Framework Framework

OPS05-BP02 Test and validate changes

Every change deployed must be tested to avoid errors in production. This best practice is focused
on testing changes from version control to artifact build. Besides application code changes, testing
should include infrastructure, conﬁguration, security controls, and operations procedures. Testing
takes many forms, from unit tests to software component analysis (SCA). Move tests further to the
left in the software integration and delivery process results in higher certainty of artifact quality.

Your organization must develop testing standards for all software artifacts. Automated tests
reduce toil and avoid manual test errors. Manual tests may be necessary in some cases. Developers
must have access to automated test results to create feedback loops that improve software quality.

Desired outcome: Your software changes are tested before they are delivered. Developers have
access to test results and validations. Your organization has a testing standard that applies to all
software changes.

Common anti-patterns:

• You deploy a new software change without any tests. It fails to run in production, which leads to
an outage.
• New security groups are deployed with AWS CloudFormation without being tested in a pre-
production environment. The security groups make your app unreachable for your customers.
• A method is modiﬁed but there are no unit tests. The software fails when it is deployed to
production.

Benefits of establishing this best practice: Change fail rate of software deployments are reduced.
Software quality is improved. Developers have increased awareness on the viability of their
code. Security policies can be rolled out with confidence to support organization's compliance.
Infrastructure changes such as automatic scaling policy updates are tested in advance to meet
traffic needs.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Testing is done on all changes, from application code to infrastructure, as part of your continuous
integration practice. Test results are published so that developers have fast feedback. Your
organization has a testing standard that all changes must pass.

Prepare 145
AWS Well-Architected Framework Framework

• OPS05-BP06 Share design standards

• OPS05-BP07 Implement practices to improve code quality
• OPS05-BP10 Fully automate integration and deployment

Related documents:

• Adopt a test-driven development approach

• Accelerate your Software Development Lifecycle with Amazon Q
• Amazon Q Developer, now generally available, includes previews of new capabilities to reimagine
developer experience
• The Ultimate Cheat Sheet for Using Amazon Q Developer in Your IDE
• Shift-Left Workload, leveraging AI for Test Creation
• Amazon Q Developer Center
• 10 ways to build applications faster with Amazon CodeWhisperer
• Looking beyond code coverage with Amazon CodeWhisperer
• Best Practices for Prompt Engineering with Amazon CodeWhisperer
• Automated AWS CloudFormation Testing Pipeline with TaskCat and CodePipeline
• Building end-to-end AWS DevSecOps CI/CD pipeline with open source SCA, SAST, and DAST
tools
• Getting started with testing serverless applications
• My CI/CD pipeline is my release captain
• Practicing Continuous Integration and Continuous Delivery on AWS Whitepaper

Related videos:

• Implement an API with Amazon Q Developer Agent for Software Development

• Installing, Conﬁguring, & Using Amazon Q Developer with JetBrains IDEs (How-to)
• Mastering the art of Amazon CodeWhisperer - YouTube playlist
• AWS re:Invent 2020: Testable infrastructure: Integration testing on AWS
• AWS Summit ANZ 2021 - Driving a test-ﬁrst strategy with CDK and test driven development
• Testing Your Infrastructure as Code with AWS CDK

Prepare 147
AWS Well-Architected Framework Framework

Desired outcome: You conﬁgure, validate, and deploy as part of your continuous integration,
continuous delivery (CI/CD) pipeline. You monitor to validate conﬁgurations are correct. This
minimizes any impact to end users and customers.

Common anti-patterns:

• You manually update the web server configuration across your fleet and a number of servers
become unresponsive due to update errors.
• You manually update your application server fleet over the course of many hours. The
inconsistency in configuration during the change causes unexpected behaviors.
• Someone has updated your security groups and your web servers are no longer accessible.
Without knowledge of what was changed you spend significant time investigating the issue
extending your time to recovery.
• You push a pre-production configuration into production through CI/CD without validation. You
expose users and customers to incorrect data and services.

Benefits of establishing this best practice: Adopting configuration management systems reduces
the level of effort to make and track changes, and the frequency of errors caused by manual
procedures. Configuration management systems provide assurances with regards to governance,
compliance, and regulatory requirements.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Configuration management systems are used to track and implement changes to application and
environment configurations. Configuration management systems are also used to reduce errors
caused by manual processes, make configuration changes repeatable and auditable, and reduce the
level of effort.

On AWS, you can use AWS Config to continually monitor your AWS resource configurations
across accounts and Regions. It helps you to track their configuration history, understand how a
configuration change would affect other resources, and audit them against expected or desired
configurations using AWS Config Rules and AWS Config Conformance Packs.

For dynamic configurations in your applications running on Amazon EC2 instances, AWS Lambda,
containers, mobile applications, or IoT devices, you can use AWS AppConfig to configure, validate,
deploy, and monitor them across your environments.

Prepare 149
AWS Well-Architected Framework Framework

Related videos:

• AWS re:Invent 2022 - Proactive governance and compliance for AWS workloads

• AWS re:Invent 2020: Achieve compliance as code using AWS Conﬁg

• Manage and Deploy Application Conﬁgurations with AWS AppConﬁg

OPS05-BP04 Use build and deployment management systems

Use build and deployment management systems. These systems reduce errors caused by manual
processes and reduce the level of eﬀort to deploy changes.

In AWS, you can build continuous integration/continuous deployment (CI/CD) pipelines using
services such as AWS Developer Tools (for example, AWS CodeCommit, AWS CodeBuild, AWS
CodePipeline, AWS CodeDeploy, and AWS CodeStar).

Desired outcome: Your build and deployment management systems support your organization's
continuous integration continuous delivery (CI/CD) system that provide capabilities for automating
safe rollouts with the correct conﬁgurations.

Common anti-patterns:

• After compiling your code on your development system, you copy the executable onto your
production systems and it fails to start. The local log ﬁles indicates that it has failed due to
missing dependencies.

• You successfully build your application with new features in your development environment and
provide the code to quality assurance (QA). It fails QA because it is missing static assets.

• On Friday, after much eﬀort, you successfully built your application manually in your
development environment including your newly coded features. On Monday, you are unable to
repeat the steps that allowed you to successfully build your application.

• You perform the tests you have created for your new release. Then you spend the next week
setting up a test environment and performing all the existing integration tests followed by
the performance tests. The new code has an unacceptable performance impact and must be
redeveloped and then retested.

Beneﬁts of establishing this best practice: By providing mechanisms to manage build and
deployment activities you reduce the level of eﬀort to perform repetitive tasks, free your team

Prepare 151
AWS Well-Architected Framework Framework

4. Monitor your deployments.

Resources

Related best practices:

• OPS06-BP04 Automate testing and rollback

Related documents:

• AWS Developer Tools

• What is AWS CodeCommit?
• What is AWS CodeBuild?
• AWS CodeBuild
• What is AWS CodeDeploy?

Related videos:

• AWS re:Invent 2022 - AWS Well-Architected best practices for DevOps on AWS

OPS05-BP05 Perform patch management

Perform patch management to gain features, address issues, and remain compliant with
governance. Automate patch management to reduce errors caused by manual processes, scale, and
reduce the level of eﬀort to patch.

Patch and vulnerability management are part of your beneﬁt and risk management activities. It is
preferable to have immutable infrastructures and deploy workloads in veriﬁed known good states.
Where that is not viable, patching in place is the remaining option.

Amazon EC2 Image Builder provides pipelines to update machine images. As a part of patch
management, consider Amazon Machine Images (AMIs) using an AMI image pipeline or container
images with a Docker image pipeline, while AWS Lambda provides patterns for custom runtimes
and additional libraries to remove vulnerabilities.

You should manage updates to Amazon Machine Images for Linux or Windows Server images using
Amazon EC2 Image Builder. You can use Amazon Elastic Container Registry (Amazon ECR) with

Prepare 153
AWS Well-Architected Framework Framework

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Patch systems to remediate issues, to gain desired features or capabilities, and to remain compliant
with governance policy and vendor support requirements. In immutable systems, deploy with the
appropriate patch set to achieve the desired result. Automate the patch management mechanism
to reduce the elapsed time to patch, to avoid errors caused by manual processes, and lower the
level of eﬀort to patch.

Implementation steps

For Amazon EC2 Image Builder:

1. Using Amazon EC2 Image Builder, specify pipeline details:

a. Create an image pipeline and name it

b. Deﬁne pipeline schedule and time zone

c. Conﬁgure any dependencies

2. Choose a recipe:

a. Select existing recipe or create a new one

b. Select image type

c. Name and version your recipe

d. Select your base image

e. Add build components and add to target registry

3. Optional - deﬁne your infrastructure conﬁguration.

4. Optional - deﬁne conﬁguration settings.

5. Review settings.

6. Maintain recipe hygiene regularly.

For Systems Manager Patch Manager:

1. Create a patch baseline.

2. Select a patching operations method.

3. Enable compliance reporting and scanning.

Prepare 155
AWS Well-Architected Framework Framework

• Two development teams have each created a user authentication service. Your users must
maintain a separate set of credentials for each part of the system they want to access.
• Each team manages their own infrastructure. A new compliance requirement forces a change to
your infrastructure and each team implements it in a diﬀerent way.

Benefits of establishing this best practice: Using shared standards supports the adoption of best
practices and maximizes the benefits of development efforts. Documenting and updating design
standards keeps your organization up-to-date with best practices and security and compliance
requirements.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Share existing best practices, design standards, checklists, operating procedures, guidance, and
governance requirements across teams. Have procedures to request changes, additions, and
exceptions to design standards to support improvement and innovation. Make teams are aware of
published content. Have a mechanism to keep design standards up-to-date as new best practices
emerge.

Customer example

AnyCompany Retail has a cross-functional architecture team that creates software architecture
patterns. This team builds the architecture with compliance and governance built in. Teams that
adopt these shared standards get the beneﬁts of having compliance and governance built in. They
can quickly build on top of the design standard. The architecture team meets quarterly to evaluate
architecture patterns and update them if necessary.

Implementation steps

1. Identify a cross-functional team that owns developing and updating design standards. This team
should work with stakeholders across your organization to develop design standards, operating
procedures, checklists, guidance, and governance requirements. Document the design standards
and share them within your organization.
a. AWS Service Catalog can be used to create portfolios representing design standards using
infrastructure as code. You can share portfolios across accounts.
2. Have a mechanism in place to keep design standards up-to-date as new best practices are
identiﬁed.

Prepare 157
AWS Well-Architected Framework Framework

Related examples:

• AWS Service Catalog Reference Architecture

• AWS Service Catalog Workshop

Related services:

• AWS Service Catalog

OPS05-BP07 Implement practices to improve code quality

Implement practices to improve code quality and minimize defects. Some examples include test-
driven development, code reviews, standards adoption, and pair programming. Incorporate these
practices into your continuous integration and delivery process.

Desired outcome: Your organization uses best practices like code reviews or pair programming to
improve code quality. Developers and operators adopt code quality best practices as part of the
software development lifecycle.

Common anti-patterns:

• You commit code to the main branch of your application without a code review. The change
automatically deploys to production and causes an outage.

• A new application is developed without any unit, end-to-end, or integration tests. There is no
way to test the application before deployment.

• Your teams make manual changes in production to address defects. Changes do not go through
testing or code reviews and are not captured or logged through continuous integration and
delivery processes.

Beneﬁts of establishing this best practice: By adopting practices to improve code quality, you can
help minimize issues introduced to production. Code quality facilitates the use of best practices like
pair programming, code reviews, and implementation of AI productivity tools.

Level of risk exposed if this best practice is not established: Medium

Prepare 159
AWS Well-Architected Framework Framework

• OPS05-BP02 Test and validate changes

• OPS05-BP06 Share design standards

Related documents:

• Adopt a test-driven development approach

• Accelerate your Software Development Lifecycle with Amazon Q
• Amazon Q Developer, now generally available, includes previews of new capabilities to reimagine
developer experience
• The Ultimate Cheat Sheet for Using Amazon Q Developer in Your IDE
• Shift-Left Workload, leveraging AI for Test Creation
• Amazon Q Developer Center
• 10 ways to build applications faster with Amazon CodeWhisperer
• Looking beyond code coverage with Amazon CodeWhisperer
• Best Practices for Prompt Engineering with Amazon CodeWhisperer
• Agile Software Guide
• My CI/CD pipeline is my release captain
• Automate code reviews with Amazon CodeGuru Reviewer
• Adopt a test-driven development approach
• How DevFactory builds better applications with Amazon CodeGuru
• On Pair Programming
• RENGA Inc. automates code reviews with Amazon CodeGuru
• The Art of Agile Development: Test-Driven Development
• Why code reviews matter (and actually save time!)

Related videos:

• Implement an API with Amazon Q Developer Agent for Software Development

• Installing, Conﬁguring, & Using Amazon Q Developer with JetBrains IDEs (How-to)
• Mastering the art of Amazon CodeWhisperer - YouTube playlist
• AWS re:Invent 2020: Continuous improvement of code quality with Amazon CodeGuru
• AWS Summit ANZ 2021 - Driving a test-ﬁrst strategy with CDK and test driven development

Prepare 161
AWS Well-Architected Framework Framework

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Use multiple environments and provide developers sandbox environments with minimized
controls to aid in experimentation. Provide individual development environments to help
work in parallel, increasing development agility. Implement more rigorous controls in the
environments approaching production to allow developers to innovate. Use infrastructure as code
and configuration management systems to deploy environments that are configured consistent
with the controls present in production to ensure systems operate as expected when deployed.
When environments are not in use, turn them off to avoid costs associated with idle resources
(for example, development systems on evenings and weekends). Deploy production equivalent
environments when load testing to improve valid results.

Resources

Related documents:

• Instance Scheduler on AWS

• What is AWS CloudFormation?

OPS05-BP09 Make frequent, small, reversible changes

Frequent, small, and reversible changes reduce the scope and impact of a change. When used in
conjunction with change management systems, conﬁguration management systems, and build and
delivery systems frequent, small, and reversible changes reduce the scope and impact of a change.
This results in more eﬀective troubleshooting and faster remediation with the option to roll back
changes.

Common anti-patterns:

• You deploy a new version of your application quarterly with a change window that means a core
service is turned oﬀ.

• You frequently make changes to your database schema without tracking changes in your
management systems.

• You perform manual in-place updates, overwriting existing installations and conﬁgurations, and
have no clear roll-back plan.

Prepare 163
AWS Well-Architected Framework Framework

Processes are repeatable and are standardized across teams. Developers are free to focus on
development and code pushes, increasing productivity.

Common anti-patterns:

• On Friday, you finish authoring the new code for your feature branch. On Monday, after running
your code quality test scripts and each of your unit tests scripts, you check in your code for the
next scheduled release.
• You are assigned to code a fix for a critical issue impacting a large number of customers in
production. After testing the fix, you commit your code and email change management to
request approval to deploy it to production.
• As a developer, you log into the AWS Management Console to create a new development
environment using non-standard methods and systems.

Beneﬁts of establishing this best practice: By implementing automated build and deployment
management systems, you reduce errors caused by manual processes and reduce the eﬀort to
deploy changes helping your team members to focus on delivering business value. You increase the
speed of delivery as you promote through to production.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

You use build and deployment management systems to track and implement change, to reduce
errors caused by manual processes, and reduce the level of eﬀort. Fully automate the integration
and deployment pipeline from code check-in through build, testing, deployment, and validation.
This reduces lead time, encourages increased frequency of change, reduces the level of eﬀort,
increases the speed to market, results in increased productivity, and increases the security of your
code as you promote through to production.

Resources

Related best practices:

• OPS05-BP03 Use conﬁguration management systems

• OPS05-BP04 Use build and deployment management systems

Related documents:

Prepare 165
AWS Well-Architected Framework Framework

• You performed a deployment and your application has become unstable but there appear to be
active users on the system. You have to decide whether to rollback the change and impact the
active users or wait to rollback the change knowing the users may be impacted regardless.
• After making a routine change, your new environments are accessible, but one of your subnets
has become unreachable. You have to decide whether to rollback everything or try to fix the
inaccessible subnet. While you are making that determination, the subnet remains unreachable.
• Your systems are not architected in a way that allows them to be updated with smaller releases.
As a result, you have difficulty in reversing those bulk changes during a failed deployment.
• You do not use infrastructure as code (IaC) and you made manual updates to your infrastructure
that resulted in an undesired configuration. You are unable to effectively track and revert the
manual changes.
• Because you have not measured increased frequency of your deployments, your team is not
incentivized to reduce the size of their changes and improve their rollback plans for each change,
leading to more risk and increased failure rates.
• You do not measure the total duration of an outage caused by unsuccessful changes. Your team
is unable to prioritize and improve its deployment process and recovery plan effectiveness.

Beneﬁts of establishing this best practice: Having a plan to recover from unsuccessful changes
minimizes the mean time to recover (MTTR) and reduces your business impact.

Level of risk exposed if this best practice is not established: High

Implementation guidance

A consistent, documented policy and practice adopted by release teams allows an organization
to plan what should happen if unsuccessful changes occur. The policy should allow for fixing
forward in specific circumstances. In either situation, a fix forward or rollback plan should be well
documented and tested before deployment to live production so that the time it takes to revert a
change is minimized.

Implementation steps

1. Document the policies that require teams to have effective plans to reverse changes within a
specified period.
a. Policies should specify when a fix-forward situation is allowed.
b. Require a documented rollback plan to be accessible by all involved.

Prepare 167
AWS Well-Architected Framework Framework

OPS06-BP02 Test deployments

Test release procedures in pre-production by using the same deployment configuration, security
controls, steps, and procedures as in production. Validate that all deployed steps are completed
as expected, such as inspecting files, configurations, and services. Further test all changes with
functional, integration, and load tests, along with any monitoring such as health checks. By doing
these tests, you can identify deployment issues early with an opportunity to plan and mitigate
them prior to production.

You can create temporary parallel environments for testing every change. Automate the
deployment of the test environments using infrastructure as code (IaC) to help reduce amount of
work involved and ensure stability, consistency, and faster feature delivery.

Desired outcome: Your organization adopts a test-driven development culture that includes
testing deployments. This ensures teams are focused on delivering business value rather than
managing releases. Teams are engaged early upon identiﬁcation of deployment risks to determine
the appropriate course of mitigation.

Common anti-patterns:

• During production releases, untested deployments cause frequent issues that require
troubleshooting and escalation.

• Your release contains infrastructure as code (IaC) that updates existing resources. You are unsure
if the IaC runs successfully or causes impact to the resources.

• You deploy a new feature to your application. It doesn't work as intended and there is no
visibility until it gets reported by impacted users.

• You update your certiﬁcates. You accidentally install the certiﬁcates to the wrong components,
which goes undetected and impacts website visitors because a secure connection to the website
can't be established.

Beneﬁts of establishing this best practice: Extensive testing in pre-production of deployment

procedures, and the changes introduced by them, minimizes the potential impact to production
caused by the deployments steps. This increases conﬁdence during production release and
minimizes operational support without slowing down velocity of the changes being delivered.

Level of risk exposed if this best practice is not established: High

Prepare 169
AWS Well-Architected Framework Framework

7. Troubleshoot deployment issues.

8. Successful validation of preceding steps should initiate a manual approval workﬂow to authorize
deployment to production.

Level of eﬀort for the implementation plan: High

Resources

Related best practices:

• OPS05-BP02 Test and validate changes

Related documents:

• AWS Builders' Library | Automating safe, hands-oﬀ deployments | Test Deployments

• AWS Whitepaper | Practicing Continuous Integration and Continuous Delivery on AWS
• The Story of Apollo - Amazon's Deployment Engine
• How to test and debug AWS CodeDeploy locally before you ship your code
• Integrating Network Connectivity Testing with Infrastructure Deployment

Related videos:

• re:Invent 2020 | Testing software and systems at Amazon

Related examples:

• Tutorial | Deploy and Amazon ECS service with a validation test

OPS06-BP03 Employ safe deployment strategies

Safe production roll-outs control the flow of beneficial changes with an aim to minimize any
perceived impact for customers from those changes. The safety controls provide inspection
mechanisms to validate desired outcomes and limit the scope of impact from any defects
introduced by the changes or from deployment failures. Safe roll-outs may include strategies such
as feature-flags, one-box, rolling (canary releases), immutable, traffic splitting, and blue/green
deployments.

Prepare 171
AWS Well-Architected Framework Framework

CodeDeploy workflow for CodeDeploy workflow for CodeDeploy workflow for

Amazon EC2 Amazon ECS Lambda

Implementation steps

1. Use an approval workﬂow to initiate the sequence of production roll-out steps upon promotion
to production .

2. Use an automated deployment system such as AWS CodeDeploy. AWS CodeDeploy deployment
options include in-place deployments for EC2/On-Premises and blue/green deployments for
EC2/On-Premises, AWS Lambda, and Amazon ECS (see the preceding workﬂow diagram).

a. Where applicable, integrate AWS CodeDeploy with other AWS services or integrate AWS
CodeDeploy with partner product and services.

3. Use blue/green deployments for databases such as Amazon Aurora and Amazon RDS.

4. Monitor deployments using Amazon CloudWatch, AWS CloudTrail, and Amazon Simple
Notiﬁcation Service (Amazon SNS) event notiﬁcations.

5. Perform post-deployment automated testing including functional, security, regression,

integration, and any load tests.

6. Troubleshoot deployment issues.

Prepare 173
AWS Well-Architected Framework Framework

OPS06-BP04 Automate testing and rollback

To increase the speed, reliability, and conﬁdence of your deployment process, have a strategy
for automated testing and rollback capabilities in pre-production and production environments.
Automate testing when deploying to production to simulate human and system interactions
that verify the changes being deployed. Automate rollback to revert back to a previous known
good state quickly. The rollback should be initiated automatically on pre-deﬁned conditions such
as when the desired outcome of your change is not achieved or when the automated test fails.
Automating these two activities improves your success rate for your deployments, minimizes
recovery time, and reduces the potential impact to the business.

Desired outcome: Your automated tests and rollback strategies are integrated into your
continuous integration, continuous delivery (CI/CD) pipeline. Your monitoring is able to validate
against your success criteria and initiate automatic rollback upon failure. This minimizes any
impact to end users and customers. For example, when all testing outcomes have been satisﬁed,
you promote your code into the production environment where automated regression testing is
initiated, leveraging the same test cases. If regression test results do not match expectations, then
automated rollback is initiated in the pipeline workﬂow.

Common anti-patterns:

• Your systems are not architected in a way that allows them to be updated with smaller releases.
As a result, you have diﬃculty in reversing those bulk changes during a failed deployment.

• Your deployment process consists of a series of manual steps. After you deploy changes to your
workload, you start post-deployment testing. After testing, you realize that your workload is
inoperable and customers are disconnected. You then begin rolling back to the previous version.
All of these manual steps delay overall system recovery and cause a prolonged impact to your
customers.

• You spent time developing automated test cases for functionality that is not frequently used in
your application, minimizing the return on investment in your automated testing capability.

• Your release is comprised of application, infrastructure, patches and conﬁguration updates that
are independent from one another. However, you have a single CI/CD pipeline that delivers
all changes at once. A failure in one component forces you to revert all changes, making your
rollback complex and ineﬃcient.

• Your team completes the coding work in sprint one and begins sprint two work, but your plan
did not include testing until sprint three. As a result, automated tests revealed defects from

Prepare 175
AWS Well-Architected Framework Framework

3. Decide which test cases you wish to automate and which should be performed manually. These
can be deﬁned based on business value priority of the feature being tested. Align all team
members to this plan and verify accountability for performing manual tests.

a. Apply automated testing capabilities to speciﬁc test cases that make sense for automation,
such as repeatable or frequently run cases, those that require repetitive tasks, or those that
are required across multiple conﬁgurations.

b. Define test automation scripts as well as the success criteria in the automation tool so
continued workflow automation can be initiated when specific cases fail.

c. Deﬁne speciﬁc failure criteria for automated rollback.

4. Prioritize test automation to drive consistent results with thorough test case development where
complexity and human interaction have a higher risk of failure.

5. Integrate your automated testing and rollback tools into your CI/CD pipeline.
a. Develop clear success criteria for your changes.

b. Monitor and observe to detect these criteria and automatically reverse changes when speciﬁc
rollback criteria are met.

6. Perform diﬀerent types of automated production testing, such as:

a. A/B testing to show results in comparison to the current version between two user testing
groups.

b. Canary testing that allows you to roll out your change to a subset of users before releasing it
to all.

c. Feature-flag testing which allows a single feature of the new version at a time to be flagged
on and off from outside the application so that each new feature can be validated one at a
time.

d. Regression testing to verify new functionality with existing interrelated components.

7. Monitor the operational aspects of the application, transactions, and interactions with other
applications and components. Develop reports to show success of changes by workload so that
you can identify what parts of the automation and workﬂow can be further optimized.

a. Develop test result reports that help you make quick decisions on whether or not rollback
procedures should be invoked.

b. Implement a strategy that allows for automated rollback based upon pre-deﬁned failure
conditions that result from one or more of your test methods.

8. Develop your automated test cases to allow for reusability across future repeatable changes.

Prepare 177
AWS Well-Architected Framework Framework

• OPS07-BP06 Create support plans for production workloads

OPS07-BP01 Ensure personnel capability

Have a mechanism to validate that you have the appropriate number of trained personnel to
support the workload. They must be trained on the platform and services that make up your
workload. Provide them with the knowledge necessary to operate the workload. You must have
enough trained personnel to support the normal operation of the workload and troubleshoot any
incidents that occur. Have enough personnel so that you can rotate during on-call and vacations to
avoid burnout.

Desired outcome:

• There are enough trained personnel to support the workload at times when the workload is
available.
• You provide training for your personnel on the software and services that make up your
workload.

Common anti-patterns:

• Deploying a workload without team members trained to operate the platform and services in
use.
• Not having enough personnel to support on-call rotations or personnel taking time oﬀ.

Beneﬁts of establishing this best practice:

• Having skilled team members helps eﬀective support of your workload.

• With enough team members, you can support the workload and on-call rotations while
decreasing the risk of burnout.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Validate that there are suﬃcient trained personnel to support the workload. Verify that you have
enough team members to cover normal operational activities, including on-call rotations.

Customer example

Prepare 179
AWS Well-Architected Framework Framework

ORR is a review and inspection process using a checklist of requirements. An ORR is a self-service
experience that teams use to certify their workloads. ORRs include best practices from lessons
learned from our years of building software.

An ORR checklist is composed of architectural recommendations, operational process, event

management, and release quality. Our Correction of Error (CoE) process is a major driver of these
items. Your own post-incident analysis should drive the evolution of your own ORR. An ORR is
not only about following best practices but preventing the recurrence of events that you’ve seen
before. Lastly, security, governance, and compliance requirements can also be included in an ORR.

Run ORRs before a workload launches to general availability and then throughout the software
development lifecycle. Running the ORR before launch increases your ability to operate the
workload safely. Periodically re-run your ORR on the workload to catch any drift from best
practices. You can have ORR checklists for new services launches and ORRs for periodic reviews.
This helps keep you up to date on new best practices that arise and incorporate lessons learned
from post-incident analysis. As your use of the cloud matures, you can build ORR requirements into
your architecture as defaults.

Desired outcome: You have an ORR checklist with best practices for your organization. ORRs are
conducted before workloads launch. ORRs are run periodically over the course of the workload
lifecycle.

Common anti-patterns:

• You launch a workload without knowing if you can operate it.

• Governance and security requirements are not included in certifying a workload for launch.
• Workloads are not re-evaluated periodically.
• Workloads launch without required procedures in place.
• You see repetition of the same root cause failures in multiple workloads.

Beneﬁts of establishing this best practice:

• Your workloads include architecture, process, and management best practices.

• Lessons learned are incorporated into your ORR process.
• Required procedures are in place when workloads launch.
• ORRs are run throughout the software lifecycle of your workloads.

Prepare 181
AWS Well-Architected Framework Framework

4. Identify one workload to conduct the ORR on. A pre-launch workload or an internal workload is
ideal.
5. Run through the ORR checklist and take note of any discoveries made. Discoveries might not be
ok if a mitigation is in place. For any discovery that lacks a mitigation, add those to your backlog
of items and implement them before launch.
6. Continue to add best practices and requirements to your ORR checklist over time.

AWS Support customers with Enterprise Support can request the Operational Readiness Review
Workshop from their Technical Account Manager. The workshop is an interactive working
backwards session to develop your own ORR checklist.

Level of eﬀort for the implementation plan: High. Adopting an ORR practice in your organization
requires executive sponsorship and stakeholder buy-in. Build and update the checklist with inputs
from across your organization.

Resources

Related best practices:

• OPS01-BP03 Evaluate governance requirements – Governance requirements are a natural ﬁt for

an ORR checklist.
• OPS01-BP04 Evaluate compliance requirements – Compliance requirements are sometimes
included in an ORR checklist. Other times they are a separate process.
• OPS03-BP07 Resource teams appropriately – Team capability is a good candidate for an ORR
requirement.
• OPS06-BP01 Plan for unsuccessful changes – A rollback or rollforward plan must be established
before you launch your workload.
• OPS07-BP01 Ensure personnel capability – To support a workload you must have the required
personnel.
• SEC01-BP03 Identify and validate control objectives – Security control objectives make excellent
ORR requirements.
• REL13-BP01 Deﬁne recovery objectives for downtime and data loss – Disaster recovery plans are
a good ORR requirement.
• COST02-BP01 Develop policies based on your organization requirements – Cost management
policies are good to include in your ORR checklist.

Prepare 183
AWS Well-Architected Framework Framework

As your organization matures, begin automating runbooks. Start with runbooks that are short and
frequently used. Use scripting languages to automate steps or make steps easier to perform. As
you automate the ﬁrst few runbooks, you'll dedicate time to automating more complex runbooks.
Over time, most of your runbooks should be automated in some way.

Desired outcome: Your team has a collection of step-by-step guides for performing workload
tasks. The runbooks contain the desired outcome, necessary tools and permissions, and
instructions for error handling. They are stored in a central location (version control system) and
updated frequently. For example, your runbooks provide capabilities for your teams to monitor,
communicate, and respond to AWS Health events for critical accounts during application alarms,
operational issues, and planned lifecycle events.

Common anti-patterns:

• Relying on memory to complete each step of a process.

• Manually deploying changes without a checklist.
• Diﬀerent team members performing the same process but with diﬀerent steps or outcomes.
• Letting runbooks drift out of sync with system changes and automation.

Beneﬁts of establishing this best practice:

• Reducing error rates for manual tasks.

• Operations are performed in a consistent manner.
• New team members can start performing tasks sooner.
• Runbooks can be automated to reduce toil.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Runbooks can take several forms depending on the maturity level of your organization. At a
minimum, they should consist of a step-by-step text document. The desired outcome should
be clearly indicated. Clearly document necessary special permissions or tools. Provide detailed
guidance on error handling and escalations in case something goes wrong. List the runbook
owner and publish it in a central location. Once your runbook is documented, validate it by having
someone else on your team run it. As procedures evolve, update your runbooks in accordance with
your change management process.

Prepare 185
AWS Well-Architected Framework Framework

5. Give the runbook to a team member. Have them use the runbook to validate the steps. If
something is missing or needs clarity, update the runbook.
6. Publish the runbook to your internal documentation store. Once published, tell your team and
other stakeholders.
7. Over time, you'll build a library of runbooks. As that library grows, start working to automate
runbooks.

Level of eﬀort for the implementation plan: Low. The minimum standard for a runbook is a step-
by-step text guide. Automating runbooks can increase the implementation eﬀort.

Resources

Related best practices:

• OPS02-BP02 Processes and procedures have identiﬁed owners

• OPS07-BP04 Use playbooks to investigate issues
• OPS10-BP01 Use a process for event, incident, and problem management
• OPS10-BP02 Have a process per alert
• OPS11-BP04 Perform knowledge management

Related documents:

• AWS Well-Architected Framework: Concepts: Runbook development

• Achieving Operational Excellence using automated playbook and runbook
• AWS Systems Manager: Working with runbooks
• Migration playbook for AWS large migrations - Task 4: Improving your migration runbooks
• Use AWS Systems Manager Automation runbooks to resolve operational tasks

Related videos:

• AWS re:Invent 2019: DIY guide to runbooks, incident reports, and incident response
• How to automate IT Operations on AWS | Amazon Web Services
• Integrate Scripts into AWS Systems Manager

Related examples:

Prepare 187
AWS Well-Architected Framework Framework

Desired outcome: Your organization has playbooks for common incidents. The playbooks are
stored in a central location and available to your team members. Playbooks are updated frequently.
For any known root causes, companion runbooks are built.

Common anti-patterns:

• There is no standard way to investigate an incident.

• Team members rely on muscle memory or institutional knowledge to troubleshoot a failed

deployment.

• New team members learn how to investigate issues through trial and error.

• Best practices for investigating issues are not shared across teams.

Beneﬁts of establishing this best practice:

• Playbooks boost your eﬀorts to mitigate incidents.

• Diﬀerent team members can use the same playbook to identify a root cause in a consistent
manner.

• Known root causes can have runbooks developed for them, speeding up recovery time.

• Playbooks help team members to start contributing sooner.

• Teams can scale their processes with repeatable playbooks.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

How you build and use playbooks depends on the maturity of your organization. If you are new
to the cloud, build playbooks in text form in a central document repository. As your organization
matures, playbooks can become semi-automated with scripting languages like Python. These
scripts can be run inside a Jupyter notebook to speed up discovery. Advanced organizations have
fully automated playbooks for common issues that are auto-remediated with runbooks.

Start building your playbooks by listing common incidents that happen to your workload. Choose
playbooks for incidents that are low risk and where the root cause has been narrowed down to
a few issues to start. After you have playbooks for simpler scenarios, move on to the higher risk
scenarios or scenarios where the root cause is not well known.

Prepare 189
AWS Well-Architected Framework Framework

2. Identify a common issue that requires investigation. This should be a scenario where the root
cause is limited to a few issues and resolution is low risk.
3. Using the Markdown template, ﬁll in the Playbook Name section and the ﬁelds under Playbook
Info.
4. Fill in the troubleshooting steps. Be as clear as possible on what actions to perform or what
areas you should investigate.
5. Give a team member the playbook and have them go through it to validate it. If there's anything
missing or something isn't clear, update the playbook.
6. Publish your playbook in your document repository and inform your team and any stakeholders.
7. This playbook library will grow as you add more playbooks. Once you have several playbooks,
start automating them using tools like AWS Systems Manager Automations to keep automation
and playbooks in sync.

Level of eﬀort for the implementation plan: Low. Your playbooks should be text documents
stored in a central location. More mature organizations will move towards automating playbooks.

Resources

Related best practices:

• OPS02-BP02 Processes and procedures have identiﬁed owners

• OPS07-BP03 Use runbooks to perform procedures
• OPS10-BP01 Use a process for event, incident, and problem management
• OPS10-BP02 Have a process per alert
• OPS11-BP04 Perform knowledge management

Related documents:

• AWS Well-Architected Framework: Concepts: Playbook development

• Achieving Operational Excellence using automated playbook and runbook
• AWS Systems Manager: Working with runbooks
• Use AWS Systems Manager Automation runbooks to resolve operational tasks

Related videos:

Prepare 191
AWS Well-Architected Framework Framework

• Making changes to your production environment that are out of compliance with governance
requirements.
• Deploying a new version of your workload without establishing a baseline for resource
utilization.

Beneﬁts of establishing this best practice:

• You are prepared for unsuccessful changes to your workload.

• Changes to your workload are compliant with governance policies.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

Use pre-mortems to develop processes for unsuccessful changes. Document your processes for
unsuccessful changes. Ensure that all changes comply with governance. Evaluate the beneﬁts and
risks to deploying changes to your workload.

Customer example

AnyCompany Retail regularly conducts pre-mortems to validate their processes for unsuccessful
changes. They document their processes in a shared Wiki and update it frequently. All changes
comply with governance requirements.

Implementation steps

1. Make informed decisions when deploying changes to your workload. Establish and review
criteria for a successful deployment. Develop scenarios or criteria that would initiate a rollback
of a change. Weigh the beneﬁts of deploying changes against the risks of an unsuccessful
change.

2. Verify that all changes comply with governance policies.

3. Use pre-mortems to plan for unsuccessful changes and document mitigation strategies. Run a
table-top exercise to model an unsuccessful change and validate roll-back procedures.

Level of eﬀort for the implementation plan: Moderate. Implementing a practice of pre-mortems
requires coordination and eﬀort from stakeholders across your organization

Prepare 193
AWS Well-Architected Framework Framework

• A developer that was the primary point of contact for a software vendor left the company.
You are not able to reach the vendor support directly. You must spend time researching and
navigating generic contact systems, increasing the time required to respond when needed.
• A production outage occurs with a software vendor. There is no documentation on how to ﬁle a
support case.

Beneﬁts of establishing this best practice:

• With the appropriate support level, you are able to get a response in the time frame necessary to
meet service-level needs.
• As a supported customer you can escalate if there are production issues.
• Software and services vendors can assist in troubleshooting during an incident.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

Enable support plans for any software and services vendors that your production workload relies
on. Set up appropriate support plans to meet service-level needs. For AWS customers, this means
activating AWS Business Support or greater on any accounts where you have production workloads.
Meet with support vendors on a regular cadence to get updates about support oﬀerings, processes,
and contacts. Document how to request support from software and services vendors, including
how to escalate if there is an outage. Implement mechanisms to keep support contacts up to date.

Customer example

At AnyCompany Retail, all commercial software and services dependencies have support plans.
For example, they have AWS Enterprise Support activated on all accounts with production
workloads. Any developer can raise a support case when there is an issue. There is a wiki page with
information on how to request support, whom to notify, and best practices for expediting a case.

Implementation steps

1. Work with stakeholders in your organization to identify software and services vendors that your
workload relies on. Document these dependencies.
2. Determine service-level needs for your workload. Select a support plan that aligns with them.
3. For commercial software and services, establish a support plan with the vendors.

Prepare 195
AWS Well-Architected Framework Framework

• OPS 10. How do you manage workload and operations events?

OPS 8. How do you utilize workload observability in your organization?

Ensure optimal workload health by leveraging observability. Utilize relevant metrics, logs, and
traces to gain a comprehensive view of your workload's performance and address issues eﬃciently.

Best practices
• OPS08-BP01 Analyze workload metrics
• OPS08-BP02 Analyze workload logs
• OPS08-BP03 Analyze workload traces
• OPS08-BP04 Create actionable alerts
• OPS08-BP05 Create dashboards

OPS08-BP01 Analyze workload metrics

After implementing application telemetry, regularly analyze the collected metrics. While latency,
requests, errors, and capacity (or quotas) provide insights into system performance, it's vital to
prioritize the review of business outcome metrics. This ensures you're making data-driven decisions
aligned with your business objectives.

Desired outcome: Accurate insights into workload performance that drive data-informed decisions,
ensuring alignment with business objectives.

Common anti-patterns:

• Analyzing metrics in isolation without considering their impact on business outcomes.

• Over-reliance on technical metrics while sidelining business metrics.
• Infrequent review of metrics, missing out on real-time decision-making opportunities.

Beneﬁts of establishing this best practice:

• Enhanced understanding of the correlation between technical performance and business

outcomes.
• Improved decision-making process informed by real-time data.
• Proactive identiﬁcation and mitigation of issues before they aﬀect business outcomes.

Operate 197
AWS Well-Architected Framework Framework

Resources

Related best practices:

• OPS04-BP01 Identify key performance indicators

• OPS04-BP02 Implement application telemetry

Related documents:

• The Wheel Blog - Emphasizing the importance of continually reviewing metrics

• Percentile are important
• Using AWS Cost Anomaly Detection
• CloudWatch cross-account observability
• Query your metrics with CloudWatch Metrics Insights

Related videos:

• Enable Cross-Account Observability in Amazon CloudWatch

• Introduction to Amazon DevOps Guru
• Continuously Analyze Metrics using AWS Cost Anomaly Detection

Related examples:

• One Observability Workshop

• Gaining operation insights with AIOps using Amazon DevOps Guru

OPS08-BP02 Analyze workload logs

Regularly analyzing workload logs is essential for gaining a deeper understanding of the
operational aspects of your application. By eﬃciently sifting through, visualizing, and interpreting
log data, you can continually optimize application performance and security.

Desired outcome: Rich insights into application behavior and operations derived from thorough
log analysis, ensuring proactive issue detection and mitigation.

Common anti-patterns:

Operate 199
AWS Well-Architected Framework Framework

4. Monitor logs in real-time with Live Tail: Use Amazon CloudWatch Logs Live Tail to view log
data in real-time. You can actively monitor your application's operational activities as they occur,
which provides immediate visibility into system performance and potential issues.
5. Leverage Contributor Insights: Use CloudWatch Contributor Insights to identify top talkers in
high cardinality dimensions like IP addresses or user-agents.
6. Implement CloudWatch Logs metric filters: Configure CloudWatch Logs metric filters to
convert log data into actionable metrics. This allows you to set alarms or further analyze
patterns.
7. Implement CloudWatch cross-account observability: Monitor and troubleshoot applications
that span multiple accounts within a Region.
8. Regular review and refinement: Periodically review your log analysis strategies to capture all
relevant information and continually optimize application performance.

Level of eﬀort for the implementation plan: Medium

Resources

Related best practices:

• OPS04-BP01 Identify key performance indicators

• OPS04-BP02 Implement application telemetry
• OPS08-BP01 Analyze workload metrics

Related documents:

• Analyzing Log Data with CloudWatch Logs Insights

• Using CloudWatch Contributor Insights
• Creating and Managing CloudWatch Log Metric Filters

Related videos:

• Analyze Log Data with CloudWatch Logs Insights

• Use CloudWatch Contributor Insights to Analyze High-Cardinality Data

Related examples:

Operate 201
AWS Well-Architected Framework Framework

Implementation steps

The following steps oﬀer a structured approach to eﬀectively implementing trace data analysis
using AWS services:

1. Integrate AWS X-Ray: Ensure X-Ray is integrated with your applications to capture trace data.
2. Analyze X-Ray metrics: Delve into metrics derived from X-Ray traces, such as latency, request
rates, fault rates, and response time distributions, using the service map to monitor application
health.
3. Use ServiceLens: Leverage the ServiceLens map for enhanced observability of your services and
applications. This allows for integrated viewing of traces, metrics, logs, alarms, and other health
information.
4. Enable X-Ray Insights:
a. Turn on X-Ray Insights for automated anomaly detection in traces.
b. Examine insights to pinpoint patterns and ascertain root causes, such as increased fault rates
or latencies.
c. Consult the insights timeline for a chronological analysis of detected issues.
5. Use X-Ray Analytics: X-Ray Analytics allows you to thoroughly explore trace data, pinpoint
patterns, and extract insights.
6. Use groups in X-Ray: Create groups in X-Ray to filter traces based on criteria such as high
latency, allowing for more targeted analysis.
7. Incorporate Amazon DevOps Guru: Engage Amazon DevOps Guru to benefit from machine
learning models pinpointing operational anomalies in traces.
8. Use CloudWatch Synthetics: Use CloudWatch Synthetics to create canaries for continually
monitoring your endpoints and workflows. These canaries can integrate with X-Ray to provide
trace data for in-depth analysis of the applications being tested.
9. Use Real User Monitoring (RUM): With AWS X-Ray and CloudWatch RUM, you can analyze and
debug the request path starting from end users of your application through downstream AWS
managed services. This helps you identify latency trends and errors that impact your end users.
10.Correlate with logs: Correlate trace data with related logs within the X-Ray trace view for
a granular perspective on application behavior. This allows you to view log events directly
associated with traced transactions.
11.Implement CloudWatch cross-account observability: Monitor and troubleshoot applications
that span multiple accounts within a Region.

Operate 203
AWS Well-Architected Framework Framework

Common anti-patterns:

• Setting up too many non-critical alerts, leading to alert fatigue.

• Not prioritizing alerts based on KPIs, making it hard to understand the business impact of issues.
• Neglecting to address root causes, leading to repetitive alerts for the same issue.

Beneﬁts of establishing this best practice:

• Reduced alert fatigue by focusing on actionable and relevant alerts.

• Improved system uptime and reliability through proactive issue detection and mitigation.
• Enhanced team collaboration and quicker issue resolution by integrating with popular alerting
and communication tools.

Level of risk exposed if this best practice is not established: High

Implementation guidance

To create an eﬀective alerting mechanism, it's vital to use metrics, logs, and trace data that ﬂag
when outcomes based on KPIs are at risk or anomalies are detected.

Implementation steps

1. Determine key performance indicators (KPIs): Identify your application's KPIs. Alerts should be
tied to these KPIs to reflect the business impact accurately.
2. Implement anomaly detection:
• Use Amazon CloudWatch anomaly detection: Set up Amazon CloudWatch anomaly detection
to automatically detect unusual patterns, which helps you only generate alerts for genuine
anomalies.
• Use AWS X-Ray Insights:
a. Set up X-Ray Insights to detect anomalies in trace data.
b. Configure notifications for X-Ray Insights to be alerted on detected issues.
• Integrate with Amazon DevOps Guru:
a. Leverage Amazon DevOps Guru for its machine learning capabilities in detecting
operational anomalies with existing data.
b. Navigate to the notification settings in DevOps Guru to set up anomaly alerts.
Operate 205
AWS Well-Architected Framework Framework

• Using Amazon CloudWatch alarms

• Create a composite alarm
• Create a CloudWatch alarm based on anomaly detection
• DevOps Guru Notiﬁcations
• X-ray insights notiﬁcations
• Monitor, operate, and troubleshoot your AWS resources with interactive ChatOps
• Amazon CloudWatch Integration Guide | PagerDuty
• Integrate Opsgenie with Amazon CloudWatch

Related videos:

• Create Composite Alarms in Amazon CloudWatch

• AWS Chatbot Overview
• AWS On Air ft. Mutative Commands in AWS Chatbot

Related examples:

• Alarms, incident management, and remediation in the cloud with Amazon CloudWatch
• Tutorial: Creating an Amazon EventBridge rule that sends notiﬁcations to AWS Chatbot
• One Observability Workshop

OPS08-BP05 Create dashboards

Dashboards are the human-centric view into the telemetry data of your workloads. While they
provide a vital visual interface, they should not replace alerting mechanisms, but complement
them. When crafted with care, not only can they oﬀer rapid insights into system health and
performance, but they can also present stakeholders with real-time information on business
outcomes and the impact of issues.

Desired outcome:

Clear, actionable insights into system and business health using visual representations.

Common anti-patterns:

• Overcomplicating dashboards with too many metrics.

Operate 207
AWS Well-Architected Framework Framework

significance of the represented metrics, and can also contain links to other dashboards and
troubleshooting tools.
3. Create dashboard variables: Incorporate dashboard variables where appropriate to allow for
dynamic and flexible dashboard views.
4. Create metrics widgets: Add metric widgets to visualize various metrics your application emits,
tailoring these widgets to effectively represent system health and business outcomes.
5. Log Insights queries: Utilize CloudWatch Log Insights to derive actionable metrics from your
logs and display these insights on your dashboard.
6. Set up alarms: Integrate CloudWatch Alarms into your dashboard for a quick view of any metrics
breaching their thresholds.
7. Use Contributor Insights: Incorporate CloudWatch Contributor Insights to analyze high-
cardinality fields and get a clearer understanding of your resource's top contributors.
8. Design custom widgets: For specific needs not met by standard widgets, consider creating
custom widgets. These can pull from various data sources or represent data in unique ways.
9. Use AWS Health Dashboard: Use AWS Health Dashboard to get deeper insights into your
account health, events, and upcoming changes that might affect your services and resources.
You can also get a centralized view for health events in your AWS Organizations or build your
own custom dashboards (for more detail, see Related examples).
10.Iterate and refine: As your application evolves, regularly revisit your dashboard to ensure its
relevance.

Resources

Related best practices:

• OPS04-BP01 Identify key performance indicators

• OPS08-BP01 Analyze workload metrics
• OPS08-BP02 Analyze workload logs
• OPS08-BP03 Analyze workload traces
• OPS08-BP04 Create actionable alerts

Related documents:

• Building Dashboards for Operational Visibility

Operate 209
AWS Well-Architected Framework Framework

• Time spent working issues with or without a standardized operating procedure (SOP)
• Amount of time spent recovering from a failed code push
• Call volume

Common anti-patterns:

• Deployment deadlines are missed because developers are pulled away to perform
troubleshooting tasks. Development teams argue for more personnel, but cannot quantify how
many they need because the time taken away cannot be measured.
• A Tier 1 desk was set up to handle user calls. Over time, more workloads were added, but no
headcount was allocated to the Tier 1 desk. Customer satisfaction suffers as call times increase
and issues go longer without resolution, but management sees no indicators of such, preventing
any action.
• A problematic workload has been handed off to a separate operations team for upkeep. Unlike
other workloads, this new one was not supplied with proper documentation and runbooks. As
such, teams spend longer troubleshooting and addressing failures. However, there are no metrics
documenting this, which makes accountability difficult.

Benefits of establishing this best practice: Where workload monitoring shows the state of
our applications and services, monitoring operations teams provide owners gain insight into
changes among the consumers of those workloads, such as shifting business needs. Measure the
effectiveness of these teams and evaluate them against business goals by creating metrics that can
reflect the state of operations. Metrics can highlight support issues or identify when drifts occur
away from a service level target.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Schedule time with business leaders and stakeholders to determine the overall goals of the service.
Determine what the tasks of various operations teams should be and what challenges they could
be approached with. Using these, brainstorm key performance indicators (KPIs) that might reﬂect
these operations goals. These might be customer satisfaction, time from feature conception to
deployment, average issue resolution time, and others.

Working from KPIs, identify the metrics and sources of data that might reﬂect these goals best.
Customer satisfaction may be an combination of various metrics such as call wait or response

Operate 211
AWS Well-Architected Framework Framework

• A workload goes down, leaving a service unavailable. Call volumes spike as users request to
know what's going on. Managers add to the volume requesting to know who's working an issue.
Various operations teams duplicate efforts in trying to investigate.
• A desire for a new capability leads to several personnel being reassigned to an engineering
effort. No backfill is provided, and issue resolution times spike. This information is not captured,
and only after several weeks and dissatisfied user feedback does leadership become aware of the
issue.

Beneﬁts of establishing this best practice: During operational events where the business is
impacted, much time and energy can be wasted querying information from various teams
attempting to understand the situation. By establishing widely-disseminated status pages and
dashboards, stakeholders can quickly obtain information such as whether or not an issue was
detected, who has lead on the issue, or when a return to normal operations may be expected. This
frees team members from spending too much time communicating status to others and more time
addressing issues.

In addition, dashboards and reports can provide insights to decision-makers and stakeholders to
see how operations teams are able to respond to business needs and how their resources are being
allocated. This is crucial for determining if adequate resources are in place to support the business.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Build dashboards that show the current key metrics for your ops teams, and make them readily
accessible both to operations leaders and management.

Build status pages that can be updated quickly to show when an incident or event is unfolding,
who has ownership and who is coordinating the response. Share any steps or workarounds that
users should consider on this page, and disseminate the location widely. Encourage users to check
this location ﬁrst when confronted with an unknown issue.

Collect and provide reports that show the health of operations over time, and distribute this to
leaders and decision makers to illustrate the work of operations along with challenges and needs.

Share between teams these metrics and reports that best reﬂect goals and KPIs and where they
have been inﬂuential in driving change. Dedicate time to these activities to elevate the importance
of operations inside of and between teams.

Operate 213
AWS Well-Architected Framework Framework

Benefits of establishing this best practice: In some organizations, it can become a challenge
to allocate the same time and attention that is afforded to service delivery and new products or
offerings. When this occurs, the line of business can suffer as the level of service expected slowly
deteriorates. This is because operations does not change and evolve with the growing business,
and can soon be left behind. Without regular review into the insights operations collects, the risk
to the business may become visible only when it's too late. By allocating time to review metrics
and procedures both among the operations staff and with leadership, the crucial role operations
plays remains visible, and risks can be identified long before they reach critical levels. Operations
teams get better insight into impending business changes and initiatives, allowing for proactive
efforts to be undertaken. Leadership visibility into operations metrics showcases the role that these
teams play in customer satisfaction, both internal and external, and let them better weigh choices
for priorities, or ensure that operations has the time and resources to change and evolve with new
business and workload initiatives.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Dedicate time to review operations metrics between stakeholders and operations teams and
review report data. Place these reports in the contexts of the organizations goals and objectives to
determine if they're being met. Identify sources of ambiguity where goals are not clear, or where
there may be conﬂicts between what is asked for and what is given.

Identify where time, people, and tools can aid in operations outcomes. Determine which KPIs this
would impact and what targets for success should be. Revisit regularly to ensure operations is
resourced suﬃciently to support the line of business.

Resources

Related documents:

• Amazon Athena
• Amazon CloudWatch metrics and dimensions reference
• Amazon QuickSight
• AWS Glue
• AWS Glue Data Catalog
• Collect metrics and logs from Amazon EC2 instances and on-premises servers with the Amazon
CloudWatch Agent

Operate 215
AWS Well-Architected Framework Framework

• Streamlined and standardized response processes.

• Reduced impact of incidents on services and customers.
• Expedited issue resolution.
• Continuous improvement in operational processes.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Implementing this best practice means you are tracking workload events. You have processes to
handle incidents and problems. The processes are documented, shared, and updated frequently.
Problems are identiﬁed, prioritized, and ﬁxed.

Understanding events, incidents, and problems

• Events: An event is an observation of an action, occurrence, or change of state. Events can be

planned or unplanned and they can originate internally or externally to the workload.
• Incidents: Incidents are events that require a response, like unplanned interruptions or
degradations of service quality. They represent disruptions that need immediate attention to
restore normal workload operation.
• Problems: Problems are the underlying causes of one or more incidents. Identifying and
resolving problems involves digging deeper into the incidents to prevent future occurrences.

Implementation steps

Events

1. Monitor events:
• Implement observability and utilize workload observability.
• Monitor actions taken by a user, role, or an AWS service are recorded as events in AWS
CloudTrail.
• Respond to operational changes in your applications in real time with Amazon EventBridge.
• Continually assess, monitor, and record resource configuration changes with AWS Config.
2. Create processes:
• Develop a process to assess which events are significant and require monitoring. This involves
setting thresholds and parameters for normal and abnormal activities.

Operate 217
AWS Well-Architected Framework Framework

1. Identify problems:
• Use data from previous incidents to identify recurring patterns that may indicate deeper
systemic issues.

• Leverage tools like AWS CloudTrail and Amazon CloudWatch to analyze trends and uncover
underlying problems.

• Engage cross-functional teams, including operations, development, and business units, to gain
diverse perspectives on the root causes.

2. Create a problem management process:

• Develop a structured process for problem management, focusing on long-term solutions

rather than quick ﬁxes.

• Incorporate root cause analysis (RCA) techniques to investigate and understand the underlying
causes of incidents.
• Update operational policies, procedures, and infrastructure based on ﬁndings to prevent
recurrence.

3. Continue to improve:

• Foster a culture of constant learning and improvement, encouraging teams to proactively

identify and address potential problems.

• Regularly review and revise problem management processes and tools to align with evolving
business and technology landscapes.

• Share insights and best practices across the organization to build a more resilient and eﬃcient
operational environment.

4. Engage AWS Support:

• Use AWS support resources, such as AWS Trusted Advisor, for proactive guidance and
optimization recommendations.

• Enterprise Support customers can access specialized programs like AWS Countdown for
support during critical events.

Level of eﬀort for the implementation plan: Medium

Resources

Related best practices:

• OPS04-BP01 Identify key performance indicators

Operate 219
AWS Well-Architected Framework Framework

• Amazon EventBridge

OPS10-BP02 Have a process per alert

Establishing a clear and defined process for each alert in your system is essential for effective and
efficient incident management. This practice ensures that every alert leads to a specific, actionable
response, improving the reliability and responsiveness of your operations.

Desired outcome: Every alert initiates a specific, well-defined response plan. Where possible,
responses are automated, with clear ownership and a defined escalation path. Alerts are linked
to an up-to-date knowledge base so that any operator can respond consistently and effectively.
Responses are quick and uniform across the board, enhancing operational efficiency and reliability.

Common anti-patterns:

• Alerts have no predeﬁned response process, leading to makeshift and delayed resolutions.

• Alert overload causes important alerts to be overlooked.

• Alerts are inconsistently handled due to lack of clear ownership and responsibility.

Beneﬁts of establishing this best practice:

• Reduced alert fatigue by only raising actionable alerts.

• Decreased mean time to resolution (MTTR) for operational issues.

• Decreased mean time to investigate (MTTI), which helps reduce MTTR.

• Enhanced ability to scale operational responses.

• Improved consistency and reliability in handling operational events.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Having a process per alert involves establishing a clear response plan for each alert, automating
responses where possible, and continually reﬁning these processes based on operational feedback
and evolving requirements.

Operate 221
AWS Well-Architected Framework Framework

1. Use composite alarms: Create composite alarms in CloudWatch to group related alarms,
reducing noise and allowing for more meaningful responses.
2. Integrate Amazon CloudWatch alarms with Incident Manager Conﬁgure CloudWatch alarms to
automatically create incidents in AWS Systems Manager Incident Manager.

Operate 223
AWS Well-Architected Framework Framework

OPS10-BP03 Prioritize operational events based on business impact

Responding promptly to operational events is critical, but not all events are equal. When you
prioritize based on business impact, you also prioritize addressing events with the potential
for signiﬁcant consequences, such as safety, ﬁnancial loss, regulatory violations, or damage to
reputation.

Desired outcome: Responses to operational events are prioritized based on potential impact to
business operations and objectives. This makes the responses eﬃcient and eﬀective.

Common anti-patterns:

• Every event is treated with the same level of urgency, leading to confusion and delays in
addressing critical issues.

• You fail to distinguish between high and low impact events, leading to misallocation of
resources.

• Your organization lacks a clear prioritization framework, resulting in inconsistent responses to

operational events.

• Events are prioritized based on the order they are reported, rather than their impact on business
outcomes.

Beneﬁts of establishing this best practice:

• Ensures critical business functions receive attention ﬁrst, minimizing potential damage.

• Improves resource allocation during multiple concurrent events.

• Enhances the organization's ability to maintain trust and meet regulatory requirements.

Level of risk exposed if this best practice is not established: High

Implementation guidance

When faced with multiple operational events, a structured approach to prioritization based on
impact and urgency is essential. This approach helps you make informed decisions, direct eﬀorts
where they're needed most, and mitigate the risk to business continuity.

Operate 225
AWS Well-Architected Framework Framework

• Make the matrix accessible and understood by all team members responsible for operational
event responses.
• The following example matrix displays incident severity according to urgency and impact:

Urgency and impact High Medium Low

High Critical Urgent High

Medium Urgent High Normal

Low High Normal Low

4. Train and communicate: Train response teams on the prioritization matrix and the importance
of following it during an event. Communicate the prioritization process to all stakeholders to set
clear expectations.
5. Integrate with incident response:
• Incorporate the prioritization matrix into your incident response plans and tools.
• Automate the classiﬁcation and prioritization of events where possible to speed up response
times.
• Enterprise Support customers can leverage AWS Incident Detection and Response, which
provides 24x7 proactive monitoring and incident management for production workloads.
6. Review and adapt: Regularly review the eﬀectiveness of the prioritization process and make
adjustments based on feedback and changes in the business environment.

Resources

Related best practices:

• OPS03-BP03 Escalation is encouraged

• OPS08-BP04 Create actionable alerts
• OPS09-BP01 Measure operations goals and KPIs with metrics

Related documents:

• Atlassian - Understanding incident severity levels

• IT Process Map - Checklist Incident Priority

Operate 227
AWS Well-Architected Framework Framework

3. Detail escalation procedures:

• Determine specific conditions under which an incident should be escalated.
• Create escalation plans in Incident Manager.
• Escalation channels should consist of a contact or an on-call schedule.
• Define the roles and responsibilities of the team at each escalation level.
4. Pre-approve mitigation actions: Collaborate with decision-makers to pre-approve actions for
anticipated scenarios. Use Systems Manager Automation runbooks integrated with Incident
Manager to speed up incident resolution.
5. Specify ownership: Clearly identify internal owners for each step of the escalation path.
6. Detail third-party escalations:
• Document third-party service-level agreements (SLAs), and align them with internal goals.
• Set clear protocols for vendor communication during incidents.
• Integrate vendor contacts into incident management tools for direct access.
• Conduct regular drills that include third-party response scenarios.
• Keep vendor escalation information well-documented and easily accessible.
7. Train and rehearse escalation plans: Train your team on the escalation process and conduct
regular incident response drills or game days. Enterprise Support customers can request an
Incident Management Workshop.
8. Continue to improve: Review the effectiveness of your escalation paths regularly. Update your
processes based on lessons learned from incident post-mortems and continuous feedback.

Level of eﬀort for the implementation plan: Moderate

Resources

Related best practices:

• OPS08-BP04 Create actionable alerts

• OPS10-BP02 Have a process per alert
• OPS11-BP02 Perform post-incident analysis

Related documents:

• AWS Systems Manager Incident Manager Escalation Plans

Operate 229
AWS Well-Architected Framework Framework

Implementation guidance

Creating a comprehensive communication plan for service impacting events involves multiple
facets, from choosing the right channels to crafting the message and tone. The plan should be
adaptable, scalable, and cater to diﬀerent outage scenarios.

Implementation steps

1. Deﬁne roles and responsibilities:

• Assign a major incident manager to oversee incident response activities.

• Designate a communications manager responsible for coordinating all external and internal
communications.

• Include the support manager to provide consistent communication through support tickets.

2. Identify communication channels: Select channels like workplace chat, email, SMS, social
media, in-app notiﬁcations, and status pages. These channels should be resilient and able to
operate independently during service impacting events.

3. Communicate quickly, clearly, and regularly to customers:

• Develop templates for various service impairment scenarios, emphasizing simplicity and
essential details. Include information about the service impairment, expected resolution time,
and impact.

• Use Amazon Pinpoint to alert customers using push notiﬁcations, in-app notiﬁcations, emails,
text messages, voice messages, and messages over custom channels.

• Use Amazon Simple Notiﬁcation Service (Amazon SNS) to alert subscribers programatically or
through email, mobile push notiﬁcations, and text messages.
• Communicate status through dashboards by sharing an Amazon CloudWatch dashboard
publicly.

• Encourage social media engagement:

• Actively monitor social media to understand customer sentiment.

• Post on social media platforms for public updates and community engagement.

• Prepare templates for consistent and clear social media communication.

4. Coordinate internal communication: Implement internal protocols using tools like AWS
Chatbot for team coordination and communication. Use CloudWatch dashboards to
communicate status.

5. Orchestrate communication with dedicated tools and services:

Operate 231
AWS Well-Architected Framework Framework

Related examples:

• AWS Health Dashboard

• Example AWS status updates

OPS10-BP06 Communicate status through dashboards

Use dashboards as a strategic tool to convey real-time operational status and key metrics
to different audiences, including internal technical teams, leadership, and customers. These
dashboards offer a centralized, visual representation of system health and business performance,
enhancing transparency and decision-making efficiency.

Desired outcome:

• Your dashboards provide a comprehensive view of the system and business metrics relevant to
diﬀerent stakeholders.

• Stakeholders can proactively access operational information, reducing the need for frequent
status requests.

• Real-time decision-making is enhanced during normal operations and incidents.

Common anti-patterns:

• Engineers joining an incident management call require status updates to get up to speed.

• Relying on manual reporting for management, which leads to delays and potential inaccuracies.

• Operations teams are frequently interrupted for status updates during incidents.

Beneﬁts of establishing this best practice:

• Empowers stakeholders with immediate access to critical information, promoting informed

decision-making.

• Reduces operational ineﬃciencies by minimizing manual reporting and frequent status inquiries.

• Increases transparency and trust through real-time visibility into system performance and
business metrics.

Level of risk exposed if this best practice is not established: Medium

Operate 233
AWS Well-Architected Framework Framework

Resources

Related best practices:

• OPS08-BP05 Create dashboards

Related documents:

• Building dashboards for operational visibility

• Using Amazon CloudWatch dashboards
• Create ﬂexible dashboards with dashboard variables
• Sharing CloudWatch dashboards
• Query metrics from other data sources
• Add a custom widget to a CloudWatch dashboard

Related examples:

• One Observability Workshop - Dashboards

OPS10-BP07 Automate responses to events

Automating event responses is key for fast, consistent, and error-free operational handling. Create
streamlined processes and use tools to automatically manage and respond to events, minimizing
manual interventions and enhancing operational eﬀectiveness.

Desired outcome:

• Reduced human errors and faster resolution times through automation.

• Consistent and reliable operational event handling.
• Enhanced operational eﬃciency and system reliability.

Common anti-patterns:

• Manual event handling leads to delays and errors.

• Automation is overlooked in repetitive, critical tasks.
• Repetitive, manual tasks lead to alert fatigue and missing critical issues.

Operate 235
AWS Well-Architected Framework Framework

• Use AWS Systems Manager State Manager to reduce conﬁguration drift.

• Remediate noncompliant resources with AWS Conﬁg Rules.

Level of eﬀort for the implementation plan: High

Resources

Related best practices:

• OPS08-BP04 Create actionable alerts

• OPS10-BP02 Have a process per alert

Related documents:

• Using Systems Manager Automation runbooks with Incident Manager

• Creating incidents in Incident Manager
• AWS service quotas
• Monitor resource usage and send notifications when approaching quotas
• AWS Auto Scaling
• What is Amazon CodeCatalyst?
• Using Amazon CloudWatch alarms
• Using Amazon CloudWatch alarm actions
• Remediating Noncompliant Resources with AWS Config Rules
• Creating metrics from log events using filters
• AWS Systems Manager State Manager

Related videos:

• Create Automation Runbooks with AWS Systems Manager

• How to automate IT Operations on AWS
• AWS Security Hub automation rules
• Start your software project fast with Amazon CodeCatalyst blueprints

Related examples:

Operate 237
AWS Well-Architected Framework Framework

• You give improvement opportunities equal priority to features in your software development
process.

Common anti-patterns:

• You have not conducted an architecture review on your workload since it was deployed several
years ago.
• You give a lower priority to improvement opportunities. Compared to new features, these
opportunities stay in the backlog.
• There is no standard for implementing modiﬁcations to best practices for the organization.

Beneﬁts of establishing this best practice:

• Your workload is kept up-to-date on architecture best practices.

• You evolve your workload in an intentional manner.
• You can leverage organization best practices to improve all workloads.
• You make marginal gains that have a cumulative impact, which drives deeper eﬃciencies.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Frequently conduct an architectural review of your workload. Use internal and external best
practices, evaluate your workload, and identify improvement opportunities. Prioritize improvement
opportunities into your software development cadence.

Implementation steps

1. Conduct periodic architecture reviews of your production workload with an agreed-upon

frequency. Use a documented architectural standard that includes AWS-speciﬁc best practices.
a. Use your internally-deﬁned standards for these reviews. If you do not have an internal
standard, use the AWS Well-Architected Framework.
b. Use the AWS Well-Architected Tool to create a custom lens of your internal best practices and
conduct your architecture review.
c. Contact your AWS Solution Architect or Technical Account Manager to conduct a guided Well-
Architected Framework Review of your workload.

Evolve 239
AWS Well-Architected Framework Framework

Desired outcome:

• You have established incident management processes that include post-incident analysis.
• You have observability plans in place to collect data on events.
• With this data, you understand and collect metrics that support your post-incident analysis
process.
• You learn from incidents to improve future outcomes.

Common anti-patterns:

• You administer an application server. Approximately every 23 hours and 55 minutes all
your active sessions are terminated. You have tried to identify what is going wrong on your
application server. You suspect it could instead be a network issue but are unable to get
cooperation from the network team as they are too busy to support you. You lack a predeﬁned
process to follow to get support and collect the information necessary to determine what is
going on.
• You have had data loss within your workload. This is the ﬁrst time it has happened and the cause
is not obvious. You decide it is not important because you can recreate the data. Data loss starts
occurring with greater frequency impacting your customers. This also places addition operational
burden on you as you restore the missing data.

Beneﬁts of establishing this best practice:

• You have a predeﬁned process to determine the components, conditions, actions, and events
that contributed to an incident, which helps you identify opportunities for improvement.
• You use data from post-incident analysis to make improvements.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Use a process to determine contributing factors. Review all customer impacting incidents. Have a
process to identify and document the contributing factors of an incident so that you can develop
mitigations to limit or prevent recurrence and you can develop procedures for prompt and eﬀective
responses. Communicate incident root causes as appropriate, and tailor the communication to your
target audience. Share learnings openly within your organization.

Evolve 241
AWS Well-Architected Framework Framework

They also validate investments made in improvements. These feedback loops are the foundation
for continuously improving your workload.

Feedback loops fall into two categories: immediate feedback and retrospective analysis. Immediate
feedback is gathered through review of the performance and outcomes from operations activities.
This feedback comes from team members, customers, or the automated output of the activity.
Immediate feedback is received from things like A/B testing and shipping new features, and it is
essential to failing fast.

Retrospective analysis is performed regularly to capture feedback from the review of operational
outcomes and metrics over time. These retrospectives happen at the end of a sprint, on a cadence,
or after major releases or events. This type of feedback loop validates investments in operations or
your workload. It helps you measure success and validates your strategy.

Desired outcome: You use immediate feedback and retrospective analysis to drive improvements.
There is a mechanism to capture user and team member feedback. Retrospective analysis is used to
identify trends that drive improvements.

Common anti-patterns:

• You launch a new feature but have no way of receiving customer feedback on it.

• After investing in operations improvements, you don’t conduct a retrospective to validate them.

• You collect customer feedback but don’t regularly review it.

• Feedback loops lead to proposed action items but they aren’t included in the software
development process.

• Customers don’t receive feedback on improvements they’ve proposed.

Beneﬁts of establishing this best practice:

• You can work backwards from the customer to drive new features.

• Your organization culture can react to changes faster.

• Trends are used to identify improvement opportunities.

• Retrospectives validate investments made to your workload and operations.

Level of risk exposed if this best practice is not established: High

Evolve 243
AWS Well-Architected Framework Framework

• Add the actions to your software development process and communicate status updates to
stakeholders as you make the improvements.

Level of eﬀort for the implementation plan: Medium. To implement this best practice, you need
a way to take in immediate feedback and analyze it. Also, you need to establish a retrospective
analysis process.

Resources

Related best practices:

• OPS01-BP01 Evaluate customer needs: Feedback loops are a mechanism to gather external
customer needs.
• OPS01-BP02 Evaluate internal customer needs: Internal stakeholders can use feedback loops to
communicate needs and requirements.
• OPS11-BP02 Perform post-incident analysis: Post-incident analyses are an important form of
retrospective analysis conducted after incidents.
• OPS11-BP07 Perform operations metrics reviews: Operations metrics reviews identify trends and
areas for improvement.

Related documents:

• 7 Pitfalls to Avoid When Building a CCOE

• Atlassian Team Playbook - Retrospectives
• Email Deﬁnitions: Feedback Loops
• Establishing Feedback Loops Based on the AWS Well-Architected Framework Review
• IBM Garage Methodology - Hold a retrospective
• Investopedia – The PDCS Cycle
• Maximizing Developer Eﬀectiveness by Tim Cochran
• Operations Readiness Reviews (ORR) Whitepaper - Iteration
• ITIL CSI - Continual Service Improvement
• When Toyota met e-commerce: Lean at Amazon

Related videos:

Evolve 245
AWS Well-Architected Framework Framework

Beneﬁts of establishing this best practice:

• Team members are empowered because information is shared freely.

• New team members are onboarded faster because documentation is up to date and searchable.

• Information is timely, accurate, and actionable.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Knowledge management is an important facet of learning organizations. To begin, you need a

central repository to store your knowledge (as a common example, a self-hosted wiki). You must
develop processes for adding, updating, and archiving knowledge. Develop standards for what
should be documented and let everyone contribute.

Customer example

AnyCompany Retail hosts an internal Wiki where all knowledge is stored. Team members are
encouraged to add to the knowledge base as they go about their daily duties. On a quarterly basis,
a cross-functional team evaluates which pages are least updated and determines if they should be
archived or updated.

Implementation steps

1. Start with identifying the content management system where knowledge will be stored. Get
agreement from stakeholders across your organization.
a. If you don’t have an existing content management system, consider running a self-hosted wiki
or using a version control repository as a starting point.

2. Develop runbooks for adding, updating, and archiving information. Educate your team on these
processes.

3. Identify what knowledge should be stored in the content management system. Start with daily
activities (runbooks and playbooks) that team members perform. Work with stakeholders to
prioritize what knowledge is added.

4. On a periodic basis, work with stakeholders to identify out-of-date information and archive it or
bring it up to date.

Evolve 247
AWS Well-Architected Framework Framework

• You collect data from across your environment but do not correlate events and activities.
• You collect detailed data from across your estate, and it drives high Amazon CloudWatch and
AWS CloudTrail activity and cost. However, you do not use this data meaningfully.

• You do not account for business outcomes when deﬁning drivers for improvement.

• You do not measure the eﬀects of new features.

Beneﬁts of establishing this best practice:

• You minimize the impact of event-based motivations or emotional investment by determining

criteria for improvement.

• You respond to business events, not just technical ones.

• You measure your environment to identify areas of improvement.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

• Understand drivers for improvement: You should only make changes to a system when a desired
outcome is supported.

• Desired capabilities: Evaluate desired features and capabilities when evaluating opportunities
for improvement.

• What's New with AWS

• Unacceptable issues: Evaluate unacceptable issues, bugs, and vulnerabilities when evaluating
opportunities for improvement. Track rightsizing options, and seek optimization opportunities.

• AWS Latest Security Bulletins

• AWS Trusted Advisor

• Cloud Intelligence Dashboards

• Compliance requirements: Evaluate updates and changes required to maintain compliance

with regulation, policy, or to remain under support from a third party, when reviewing
opportunities for improvement.

• AWS Compliance

• AWS Compliance Programs

• AWS Compliance Latest News

Evolve 249
AWS Well-Architected Framework Framework

OPS11-BP06 Validate insights

Review your analysis results and responses with cross-functional teams and business owners. Use
these reviews to establish common understanding, identify additional impacts, and determine
courses of action. Adjust responses as appropriate.

Desired outcomes:

• You review insights with business owners on a regular basis. Business owners provide additional
context to newly-gained insights.
• You review insights and request feedback from technical peers, and you share your learnings
across teams.
• You publish data and insights for other technical and business teams to review. You factor in
your learnings to new practices by other departments.
• Summarize and review new insights with senior leaders. Senior leaders use new insights to deﬁne
strategy.

Common anti-patterns:

• You release a new feature. This feature changes some of your customer behaviors. Your
observability does not take these changes into account. You do not quantify the beneﬁts of
these changes.
• You push a new update and neglect refreshing your CDN. The CDN cache is no longer compatible
with the latest release. You measure the percentage of requests with errors. All of your users
report HTTP 400 errors when communicating with backend servers. You investigate the client
errors and ﬁnd that because you measured the wrong dimension, your time was wasted.
• Your service-level agreement stipulates 99.9% uptime, and your recovery point objective is
four hours. The service owner maintains that the system is zero downtime. You implement an
expensive and complex replication solution, which wastes time and money.

Beneﬁts of establishing this best practice:

• When you validate insights with business owners and subject matter experts, you establish
common understanding and more eﬀectively guide improvement.
• You discover hidden issues and factor them into future decisions.
• Your focus moves from technical outcomes to business outcomes.

Evolve 251
AWS Well-Architected Framework Framework

• Your maintenance window interrupts a significant retail promotion. The business remains
unaware that there is a standard maintenance window that could be delayed if there are other
business impacting events.
• You suffered an extended outage because you commonly use an outdated library in your
organization. You have since migrated to a supported library. The other teams in your
organization do not know that they are at risk.
• You do not regularly review attainment of customer SLAs. You are trending to not meet your
customer SLAs. There are financial penalties related to not meeting your customer SLAs.

Beneﬁts of establishing this best practice:

• When you meet regularly to review operations metrics, events, and incidents, you maintain
common understanding across teams.
• Your team meets routinely to review metrics and incidents, which positions you to take action on
risks and recognize customer SLAs.
• You share lessons learned, which provides data for prioritization and targeted improvements for
business outcomes.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

• Regularly perform retrospective analysis of operations metrics with cross-team participants from
diﬀerent areas of the business.
• Engage stakeholders, including the business, development, and operations teams, to validate
your ﬁndings from immediate feedback and retrospective analysis and share lessons learned.
• Use their insights to identify opportunities for improvement and potential courses of action.

Resources

Related best practices:

• OPS08-BP05 Create dashboards

• OPS09-BP03 Review operations metrics and prioritize improvement
• OPS10-BP01 Use a process for event, incident, and problem management

Evolve 253
AWS Well-Architected Framework Framework

Beneﬁts of establishing this best practice: Share lessons learned to support improvement and to
maximize the beneﬁts of experience.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

• Document and share lessons learned: Have procedures to document the lessons learned from
the running of operations activities and retrospective analysis so that they can be used by other
teams.

• Share learnings: Have procedures to share lessons learned and associated artifacts across teams.
For example, share updated procedures, guidance, governance, and best practices through an
accessible wiki. Share scripts, code, and libraries through a common repository.

• Delegating access to your AWS environment

• Share an AWS CodeCommit repository

Resources

Related best practices:

• OPS02-BP06 Responsibilities between teams are predeﬁned or negotiated

• OPS05-BP01 Use version control

• OPS05-BP06 Share design standards

• OPS11-BP03 Implement feedback loops

• OPS11-BP07 Perform operations metrics reviews

Related documents:

• Reduce project delays with a docs-as-code solution

Related videos:

• Delegating access to your AWS environment

• AWS Supports You | Exploring the Incident Management Tabletop Exercise

Evolve 255
AWS Well-Architected Framework Framework

Resources

Related best practices:

• OPS05-BP08 Use multiple environments

Related videos:

• AWS re:Invent 2023 - Improve application resilience with AWS Fault Injection Service

Security
The Security pillar encompasses the ability to protect data, systems, and assets to take
advantage of cloud technologies to improve your security. You can ﬁnd prescriptive guidance on
implementation in the Security Pillar whitepaper.

Best practice areas

• Security foundations
• Identity and access management
• Detection
• Infrastructure protection
• Data protection
• Incident response
• Application security

Security foundations
Question
• SEC 1. How do you securely operate your workload?

SEC 1. How do you securely operate your workload?

To operate your workload securely, you must apply overarching best practices to every area of
security. Take requirements and processes that you have deﬁned in operational excellence at an

Security 257
AWS Well-Architected Framework Framework

• Automated account creation and maintenance process.

• Centralized auditing of your infrastructure for compliance and regulatory requirements.

Level of risk exposed if this best practice is not established: High

Implementation guidance

AWS accounts provide a security isolation boundary between workloads or resources that operate
at diﬀerent sensitivity levels. AWS provides tools to manage your cloud workloads at scale through
a multi-account strategy to leverage this isolation boundary. For guidance on the concepts,
patterns, and implementation of a multi-account strategy on AWS, see Organizing Your AWS
Environment Using Multiple Accounts.

When you have multiple AWS accounts under central management, your accounts should be
organized into a hierarchy deﬁned by layers of organizational units (OUs). Security controls
can then be organized and applied to the OUs and member accounts, establishing consistent
preventative controls on member accounts in the organization. The security controls are inherited,
allowing you to ﬁlter permissions available to member accounts located at lower levels of an OU
hierarchy. A good design takes advantage of this inheritance to reduce the number and complexity
of security policies required to achieve the desired security controls for each member account.

AWS Organizations and AWS Control Tower are two services that you can use to implement and
manage this multi-account structure in your AWS environment. AWS Organizations allows you to
organize accounts into a hierarchy defined by one or more layers of OUs, with each OU containing
a number of member accounts. Service control policies (SCPs) allow the organization administrator
to establish granular preventative controls on member accounts, and AWS Config can be used to
establish proactive and detective controls on member accounts. Many AWS services integrate with
AWS Organizations to provide delegated administrative controls and performing service-specific
tasks across all member accounts in the organization.

Layered on top of AWS Organizations, AWS Control Tower provides a one-click best practices setup
for a multi-account AWS environment with a landing zone. The landing zone is the entry point
to the multi-account environment established by Control Tower. Control Tower provides several
beneﬁts over AWS Organizations. Three beneﬁts that provide improved account governance are:

• Integrated mandatory security controls that are automatically applied to accounts admitted into
the organization.
• Optional controls that can be turned on or oﬀ for a given set of OUs.

Security foundations 259

AWS Well-Architected Framework Framework

• Use CloudFormation StackSets to provision resources across multiple AWS accounts and regions
• Organizations FAQ
• AWS Organizations terminology and concepts
• Best Practices for Service Control Policies in an AWS Organizations Multi-Account Environment
• AWS Account Management Reference Guide
• Organizing Your AWS Environment Using Multiple Accounts

Related videos:

• Enable AWS adoption at scale with automation and governance

• Security Best Practices the Well-Architected Way
• Building and Governing Multiple Accounts using AWS Control Tower
• Enable Control Tower for Existing Organizations

Related workshops:

• Control Tower Immersion Day

SEC01-BP02 Secure account root user and properties

The root user is the most privileged user in an AWS account, with full administrative access to
all resources within the account, and in some cases cannot be constrained by security policies.
Deactivating programmatic access to the root user, establishing appropriate controls for the root
user, and avoiding routine use of the root user helps reduce the risk of inadvertent exposure of the
root credentials and subsequent compromise of the cloud environment.

Desired outcome: Securing the root user helps reduce the chance that accidental or intentional
damage can occur through the misuse of root user credentials. Establishing detective controls can
also alert the appropriate personnel when actions are taken using the root user.

Common anti-patterns:

• Using the root user for tasks other than the few that require root user credentials.
• Neglecting to test contingency plans on a regular basis to verify the functioning of critical
infrastructure, processes, and personnel during an emergency.

Security foundations 261

AWS Well-Architected Framework Framework

management account’s root user can diﬀer from your member account root users, and you can
place preventative security controls on your member account root users.

Implementation steps

The following implementation steps are recommended to establish controls for the root user.
Where applicable, recommendations are cross-referenced to CIS AWS Foundations benchmark
version 1.4.0. In addition to these steps, consult AWS best practice guidelines for securing your
AWS account and resources.

Preventative controls

1. Set up accurate contact information for the account.

a. This information is used for the lost password recovery flow, lost MFA device account recovery
flow, and for critical security-related communications with your team.
b. Use an email address hosted by your corporate domain, preferably a distribution list, as the
root user’s email address. Using a distribution list rather than an individual’s email account
provides additional redundancy and continuity for access to the root account over long
periods of time.
c. The phone number listed on the contact information should be a dedicated, secure phone for
this purpose. The phone number should not be listed or shared with anyone.
2. Do not create access keys for the root user. If access keys exist, remove them (CIS 1.4).
a. Eliminate any long-lived programmatic credentials (access and secret keys) for the root user.
b. If root user access keys already exist, you should transition processes using those keys to use
temporary access keys from an AWS Identity and Access Management (IAM) role, then delete
the root user access keys.
3. Determine if you need to store credentials for the root user.
a. If you are using AWS Organizations to create new member accounts, the initial password for
the root user on new member accounts is set to a random value that is not exposed to you.
Consider using the password reset flow from your AWS Organization management account to
gain access to the member account if needed.
b. For standalone AWS accounts or the management AWS Organization account, consider
creating and securely storing credentials for the root user. Use MFA for the root user.
4. Use preventative controls for member account root users in AWS multi-account environments.
a. Consider using the Disallow Creation of Root Access Keys for the Root User preventative
guard rail for member accounts.

Security foundations 263

AWS Well-Architected Framework Framework

• Determine who in the organization should have access to the root user credentials.
• Use a two-person rule so that no one individual has access to all necessary credentials and MFA
to obtain root user access.
• Verify that the organization, and not a single individual, maintains control over the phone
number and email alias associated with the account (which are used for password reset and
MFA reset ﬂow).
• Use root user only by exception (CIS 1.7).
• The AWS root user must not be used for everyday tasks, even administrative ones. Only log
in as the root user to perform AWS tasks that require root user. All other actions should be
performed by other users assuming appropriate roles.
• Periodically check that access to the root user is functioning so that procedures are tested prior
to an emergency situation requiring the use of the root user credentials.
• Periodically check that the email address associated with the account and those listed under
Alternate Contacts work. Monitor these email inboxes for security notiﬁcations you might receive
from <[email protected]>. Also ensure any phone numbers associated with the account are
working.
• Prepare incident response procedures to respond to root account misuse. Refer to the AWS
Security Incident Response Guide and the best practices in the Incident Response section of the
Security Pillar whitepaper for more information on building an incident response strategy for
your AWS account.

Resources

Related best practices:

• SEC01-BP01 Separate workloads using accounts

• SEC02-BP01 Use strong sign-in mechanisms
• SEC03-BP02 Grant least privilege access
• SEC03-BP03 Establish emergency access process
• SEC10-BP05 Pre-provision access

Related documents:

• AWS Control Tower

• AWS Security Audit Guidelines

Security foundations 265

AWS Well-Architected Framework Framework

• The implementation of controls does not strongly align to your control objectives in a
measurable way
• You do not use automation to report on the eﬀectiveness of your controls

Level of risk exposed if this best practice is not established: High

Implementation guidance

There are many common cybersecurity frameworks that can form the basis for your security
control objectives. Consider the regulatory requirements, market expectations, and industry
standards for your business to determine which frameworks best supports your needs. Examples
include AICPA SOC 2, HITRUST, PCI-DSS, ISO 27001, and NIST SP 800-53.

For the control objectives you identify, understand how AWS services you consume help you to
achieve those objectives. Use AWS Artifact to ﬁnd documentation and reports aligned to your
target frameworks that describe the scope of responsibility covered by AWS and guidance for the
remaining scope that is your responsibility. For further service-speciﬁc guidance as they align to
various framework control statements, see AWS Customer Compliance Guides.

As you define the controls that achieve your objectives, codify enforcement using preventative
controls, and automate mitigations using detective controls. Help prevent non-compliant resource
configurations and actions across your AWS Organizations using service control policies (SCP).
Implement rules in AWS Config to monitor and report on non-compliant resources, then switch
rules to an enforcement model once confident in their behavior. To deploy sets of pre-defined
and managed rules that align to your cybersecurity frameworks, evaluate the use of AWS Security
Hub standards as your first option. The AWS Foundational Service Best Practices (FSBP) standard
and the CIS AWS Foundations Benchmark are good starting points with controls that align to
many objectives that are shared across multiple standard frameworks. Where Security Hub does
not intrinsically have the control detections desired, it can be complemented using AWS Config
conformance packs.

Use APN Partner Bundles recommended by the AWS Global Security and Compliance Acceleration
(GSCA) team to get assistance from security advisors, consulting agencies, evidence collection and
reporting systems, auditors, and other complementary services when required.

Implementation steps

1. Evaluate common cybersecurity frameworks, and align your control objectives to the ones
chosen.

Security foundations 267

AWS Well-Architected Framework Framework

you identify new threats. You take mitigating action against these threats. You adopt AWS services
that automatically update with the latest threat intelligence.

Common anti-patterns:

• Not having a reliable and repeatable mechanism to stay informed of the latest threat
intelligence.
• Maintaining manual inventory of your technology portfolio, workloads, and dependencies that
require human review for potential vulnerabilities and exposures.
• Not having mechanisms in place to update your workloads and dependencies to the latest
versions available that provide known threat mitigations.

Beneﬁts of establishing this best practice: Using threat intelligence sources to stay up to date
reduces the risk of missing out on important changes to the threat landscape that can impact
your business. Having automation in place to scan, detect, and remediate where potential
vulnerabilities or exposures exist in your workloads and their dependencies can help you mitigate
risks quickly and predictably, compared to manual alternatives. This helps control time and costs
related to vulnerability mitigation.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Review trusted threat intelligence publications to stay on top of the threat landscape. Consult
the MITRE ATT&CK knowledge base for documentation on known adversarial tactics, techniques,
and procedures (TTPs). Review MITRE's Common Vulnerabilities and Exposures (CVE) list to
stay informed on known vulnerabilities in products you rely on. Understand critical risks to web
applications with the Open Worldwide Application Security Project (OWASP)'s popular OWASP Top
10 project.

Stay up to date on AWS security events and recommended remediation steps with AWS Security
Bulletins for CVEs.

To reduce your overall eﬀort and overhead of staying up to date, consider using AWS services that
automatically incorporate new threat intelligence over time. For example, Amazon GuardDuty
stays up to date with industry threat intelligence for detecting anomalous behaviors and threat
signatures within your accounts. Amazon Inspector automatically keeps a database of the CVEs
it uses for its continuous scanning features up to date. Both AWS WAF and AWS Shield Advanced
provide managed rule groups that are updated automatically as new threats emerge.

Security foundations 269

AWS Well-Architected Framework Framework

• Not including security management tasks into the total cost of ownership of hosting
technologies on virtual machines when compared to managed service options.

Beneﬁts of establishing this best practice: Using managed services can reduce your overall burden
of managing operational security controls, which can reduce your security risks and total cost of
ownership. Time that would otherwise be spent on certain security tasks can be reinvested into
tasks that provide more value to your business. Managed services can also reduce the scope of your
compliance requirements by shifting some control requirements to AWS.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

There are multiple ways you can integrate the components of your workload on AWS. Installing
and running technologies on Amazon EC2 instances often requires you to take on the largest share
of the overall security responsibility. To help reduce the burden of operating certain controls,
identify AWS managed services that reduce the scope of your side of the shared responsibility
model and understand how you can use them in your existing architecture. Examples include
using the Amazon Relational Database Service (Amazon RDS) for deploying databases, Amazon
Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS)
for orchestrating containers, or using serverless options. When building new applications, think
through which services can help reduce time and cost when it comes to implementing and
managing security controls.

Compliance requirements can also be a factor when selecting services. Managed services can shift
the compliance of some requirements to AWS. Discuss with your compliance team about their
degree of comfort with auditing the aspects of services you operate and manage and accepting
control statements in relevant AWS audit reports. You can provide the audit artifacts found in AWS
Artifact to your auditors or regulators as evidence of AWS security controls. You can also use the
responsibility guidance provided by some of the AWS audit artifacts to design your architecture,
along with the AWS Customer Compliance Guides. This guidance helps determine the additional
security controls you should put in place in order to support the speciﬁc use cases of your system.

When using managed services, be familiar with the process of updating their resources to
newer versions (for example, updating the version of a database managed by Amazon RDS, or a
programming language runtime for an AWS Lambda function). While the managed service may
perform this operation for you, conﬁguring the timing of the update and understanding the impact

Security foundations 271

AWS Well-Architected Framework Framework

Related videos:

• How do I migrate to an Amazon RDS or Aurora MySQL DB instance using AWS DMS?

• AWS re:Invent 2023 - Manage resource lifecycle events at scale with AWS Health

SEC01-BP06 Automate deployment of standard security controls

Apply modern DevOps practices as you develop and deploy security controls that are standard
across your AWS environments. Deﬁne standard security controls and conﬁgurations using
Infrastructure as Code (IaC) templates, capture changes in a version control system, test changes as
part of a CI/CD pipeline, and automate the deployment of changes to your AWS environments.

Desired outcome: IaC templates capture standardized security controls and commit them
to a version control system. CI/CD pipelines are in places that detect changes and automate
testing and deploying your AWS environments. Guardrails are in place to detect and alert on
misconfigurations in templates before proceeding to deployment. Workloads are deployed into
environments where standard controls are in place. Teams have access to deploy approved service
configurations through a self-service mechanism. Secure backup and recovery strategies are in
place for control configurations, scripts, and related data.

Common anti-patterns:

• Making changes to your standard security controls manually, through a web console or
command-line interface.

• Relying on individual workload teams to manually implement the controls a central team
deﬁnes.
• Relying on a central security team to deploy workload-level controls at the request of a workload
team.
• Allowing the same individuals or teams to develop, test, and deploy security control automation
scripts without proper separation of duties or checks and balances.

Benefits of establishing this best practice: Using templates to define your standard security
controls allows you to track and compare changes over time using a version control system. Using
automation to test and deploy changes creates standardization and predictability, increasing
the chances of a successful deployment and reducing manual repetitive tasks. Providing a self-
serve mechanism for workload teams to deploy approved services and configurations reduces the

Security foundations 273

AWS Well-Architected Framework Framework

2. Create CI/CD pipelines to test and deploy your templates. Define tests to check for
misconfigurations and that templates adhere to your company standards.
3. Build a catalog of standardized templates for workload teams to deploy AWS accounts and
services according to your requirements.
4. Implement secure backup and recovery strategies for your control configurations, scripts, and
related data.

Resources

Related best practices:

• OPS05-BP01 Use version control

• OPS05-BP04 Use build and deployment management systems
• REL08-BP05 Deploy changes with automation
• SUS06-BP01 Adopt methods that can rapidly introduce sustainability improvements

Related documents:

• Organizing Your AWS Environment Using Multiple Accounts

Related examples:

• Automate account creation, and resource provisioning using Service Catalog, AWS Organizations,
and AWS Lambda
• Strengthen the DevOps pipeline and protect data with AWS Secrets Manager, AWS KMS, and
AWS Certiﬁcate Manager

Related tools:

• AWS CloudFormation Guard

• Landing Zone Accelerator on AWS

SEC01-BP07 Identify threats and prioritize mitigations using a threat model

Perform threat modeling to identify and maintain an up-to-date register of potential threats
and associated mitigations for your workload. Prioritize your threats and adapt your security

Security foundations 275

AWS Well-Architected Framework Framework

Implementation steps

How can we perform threat modeling?

There are many diﬀerent ways to perform threat modeling. Much like programming languages,
there are advantages and disadvantages to each, and you should choose the way that works best
for you. One approach is to start with Shostack’s 4 Question Frame for Threat Modeling, which
poses open-ended questions to provide structure to your threat modeling exercise:

1. What are working on?

The purpose of this question is to help you understand and agree upon the system you are
building and the details about that system that are relevant to security. Creating a model or
diagram is the most popular way to answer this question, as it helps you to visualize what
you are building, for example, using a data flow diagram. Writing down assumptions and
important details about your system also helps you define what is in scope. This allows everyone
contributing to the threat model to focus on the same thing, and avoid time-consuming detours
into out-of-scope topics (including out of date versions of your system). For example, if you are
building a web application, it is probably not worth your time threat modeling the operating
system trusted boot sequence for browser clients, as you have no ability to affect this with your
design.

2. What can go wrong?

This is where you identify threats to your system. Threats are accidental or intentional actions or
events that have unwanted impacts and could aﬀect the security of your system. Without a clear
understanding of what could go wrong, you have no way of doing anything about it.

There is no canonical list of what can go wrong. Creating this list requires brainstorming and
collaboration between all of the individuals within your team and relevant personas involved in
the threat modeling exercise. You can aid your brainstorming by using a model for identifying
threats, such as STRIDE, which suggests diﬀerent categories to evaluate: Spooﬁng, Tampering,
Repudiation, Information Disclosure, Denial of Service, and Elevation of privilege. In addition,
you might want to aid the brainstorming by reviewing existing lists and research for inspiration,
including the OWASP Top 10, HiTrust Threat Catalog, and your organization’s own threat
catalog.

3. What are we going to do about it?

Security foundations 277

AWS Well-Architected Framework Framework

Threat Composer

To aid and guide you in performing threat modeling, consider using the Threat Composer tool,
which aims to your reduce time-to-value when threat modeling. The tool helps you do the
following:

• Write useful threat statements aligned to threat grammar that work in a natural non-linear
workﬂow
• Generate a human-readable threat model
• Generate a machine-readable threat model to allow you treat threat models as code
• Help you to quickly identify areas of quality and coverage improvement using the Insights
Dashboard

For further reference, visit Threat Composer and switch to the system-deﬁned Example
Workspace.

Resources

Related best practices:

• SEC01-BP03 Identify and validate control objectives

• SEC01-BP04 Stay up to date with security threats and recommendations
• SEC01-BP05 Reduce security management scope
• SEC01-BP08 Evaluate and implement new security services and features regularly

Related documents:

• How to approach threat modeling (AWS Security Blog)

• NIST: Guide to Data-Centric System Threat Modelling

Related videos:

• AWS Summit ANZ 2021 - How to approach threat modelling

• AWS Summit ANZ 2022 - Scaling security – Optimise for fast and secure delivery

Related training:

Security foundations 279

AWS Well-Architected Framework Framework

• AWS Security Bulletins

• AWS documentation overview

You can subscribe to an AWS Daily Feature Updates topic using Amazon Simple Notiﬁcation Service
(Amazon SNS) for a comprehensive daily summary of updates. Some security services, such as
Amazon GuardDuty and AWS Security Hub, provide their own SNS topics to stay informed about
new standards, ﬁndings, and other updates for those particular services.

New services and features are also announced and described in detail during conferences, events,
and webinars conducted around the globe each year. Of particular note is the annual AWS
re:Inforce security conference and the more general AWS re:Invent conference. The previously-
mentioned AWS news channels share these conference announcements about security and other
services, and you can view deep dive educational breakout sessions online at the AWS Events
channel on YouTube.

You can also ask your AWS account team about the latest security service updates and
recommendations. You can reach out to your team through the Sales Support form if you do not
have their direct contact information. Similarly, if you subscribed to AWS Enterprise Support, you
will receive weekly updates from your Technical Account Manager (TAM) and can schedule a regular
review meeting with them.

Implementation steps

1. Subscribe to the various blogs and bulletins with your favorite RSS reader or to the Daily
Features Updates SNS topic.
2. Evaluate which AWS events to attend to learn ﬁrst-hand about new features and services.
3. Set up meetings with your AWS account team for any questions about updating security services
and features.
4. Consider subscribing to Enterprise Support to have regular consultations with a Technical
Account Manager (TAM).

Resources

Related best practices:

• PERF01-BP01 Learn about and understand available cloud services and features
• COST01-BP07 Keep up-to-date with new service releases

Security foundations 281

AWS Well-Architected Framework Framework

inadvertently disclosed or are easily guessed. Use strong sign-in mechanisms to reduce these risks
by requiring MFA and strong password policies.

Desired outcome: Reduce the risks of unintended access to credentials in AWS by using strong
sign-in mechanisms for AWS Identity and Access Management (IAM) users, the AWS account
root user, AWS IAM Identity Center (successor to AWS Single Sign-On), and third-party identity
providers. This means requiring MFA, enforcing strong password policies, and detecting anomalous
login behavior.

Common anti-patterns:

• Not enforcing a strong password policy for your identities including complex passwords and
MFA.
• Sharing the same credentials among diﬀerent users.
• Not using detective controls for suspicious sign-ins.

Level of risk exposed if this best practice is not established: High

Implementation guidance

There are many ways for human identities to sign-in to AWS. It is an AWS best practice to rely on a
centralized identity provider using federation (direct federation or using AWS IAM Identity Center)
when authenticating to AWS. In that case, you should establish a secure sign-in process with your
identity provider or Microsoft Active Directory.

When you ﬁrst open an AWS account, you begin with an AWS account root user. You should only
use the account root user to set up access for your users (and for tasks that require the root user).
It’s important to turn on MFA for the account root user immediately after opening your AWS
account and to secure the root user using the AWS best practice guide.

If you create users in AWS IAM Identity Center, then secure the sign-in process in that service. For
consumer identities, you can use Amazon Cognito user pools and secure the sign-in process in that
service, or by using one of the identity providers that Amazon Cognito user pools supports.

If you are using AWS Identity and Access Management (IAM) users, you would secure the sign-in
process using IAM.

Regardless of the sign-in method, it’s critical to enforce a strong sign-in policy.

Implementation steps

Identity and access management 283

AWS Well-Architected Framework Framework

• Create an IAM policy to enforce MFA sign-in so that users are allowed to manage their own
passwords and MFA devices.

Resources

Related best practices:

• SEC02-BP03 Store and use secrets securely

• SEC02-BP04 Rely on a centralized identity provider
• SEC03-BP08 Share resources securely within your organization

Related documents:

• AWS IAM Identity Center Password Policy

• IAM user password policy
• Setting the AWS account root user password
• Amazon Cognito password policy
• AWS credentials
• IAM security best practices

Related videos:

• Managing user permissions at scale with AWS IAM Identity Center

• Mastering identity at every layer of the cake

SEC02-BP02 Use temporary credentials

When doing any type of authentication, it’s best to use temporary credentials instead of long-term
credentials to reduce or eliminate risks, such as credentials being inadvertently disclosed, shared,
or stolen.

Desired outcome: To reduce the risk of long-term credentials, use temporary credentials wherever
possible for both human and machine identities. Long-term credentials create many risks,
for example, they can be uploaded in code to public GitHub repositories. By using temporary
credentials, you signiﬁcantly reduce the chances of credentials becoming compromised.

Identity and access management 285

AWS Well-Architected Framework Framework

done either with direct federation to each AWS account or using AWS IAM Identity Center and
the identity provider of your choice. Federation provides a number of advantages over using IAM
users in addition to eliminating long-term credentials. Your users can also request temporary
credentials from the command line for direct federation or by using IAM Identity Center. This
means that there are few uses cases that require IAM users or long-term credentials for your
users.
• When granting third parties, such as software as a service (SaaS) providers, access to resources in
your AWS account, you can use cross-account roles and resource-based policies.
• If you need to grant applications for consumers or customers access to your AWS resources, you
can use Amazon Cognito identity pools or Amazon Cognito user pools to provide temporary
credentials. The permissions for the credentials are conﬁgured through IAM roles. You can also
deﬁne a separate IAM role with limited permissions for guest users who are not authenticated.

For machine identities, you might need to use long-term credentials. In these cases, you should
require workloads to use temporary credentials with IAM roles to access AWS.

• For Amazon Elastic Compute Cloud (Amazon EC2), you can use roles for Amazon EC2.

• AWS Lambda allows you to conﬁgure a Lambda execution role to grant the service permissions
to perform AWS actions using temporary credentials. There are many other similar models for
AWS services to grant temporary credentials using IAM roles.
• For IoT devices, you can use the AWS IoT Core credential provider to request temporary
credentials.
• For on-premises systems or systems that run outside of AWS that need access to AWS resources,
you can use IAM Roles Anywhere.

There are scenarios where temporary credentials are not an option and you might need to use
long-term credentials. In these situations, audit and rotate credentials periodically and rotate
access keys regularly for use cases that require long-term credentials. Some examples that might
require long-term credentials include WordPress plugins and third-party AWS clients. In situations
where you must use long-term credentials, or for credentials other than AWS access keys, such
as database logins, you can use a service that is designed to handle the management of secrets,
such as AWS Secrets Manager. Secrets Manager makes it simple to manage, rotate, and securely
store encrypted secrets using supported services. For more information about rotating long-term
credentials, see rotating access keys.

Identity and access management 287

AWS Well-Architected Framework Framework

• Reducing the number of long-term credentials required by replacing them with short-term
credentials when possible.
• Establishing secure storage and automated rotation of remaining long-term credentials.
• Auditing access to secrets that exist in the workload.
• Continual monitoring to verify that no secrets are embedded in source code during the
development process.
• Reduce the likelihood of credentials being inadvertently disclosed.

Common anti-patterns:

• Not rotating credentials.

• Storing long-term credentials in source code or conﬁguration ﬁles.
• Storing credentials at rest unencrypted.

Beneﬁts of establishing this best practice:

• Secrets are stored encrypted at rest and in transit.

• Access to credentials is gated through an API (think of it as a credential vending machine).
• Access to a credential (both read and write) is audited and logged.
• Separation of concerns: credential rotation is performed by a separate component, which can be
segregated from the rest of the architecture.
• Secrets are automatically distributed on-demand to software components and rotation occurs in
a central location.
• Access to credentials can be controlled in a ﬁne-grained manner.

Level of risk exposed if this best practice is not established: High

Implementation guidance

In the past, credentials used to authenticate to databases, third-party APIs, tokens, and other
secrets might have been embedded in source code or in environment ﬁles. AWS provides several
mechanisms to store these credentials securely, automatically rotate them, and audit their usage.

The best way to approach secrets management is to follow the guidance of remove, replace, and
rotate. The most secure credential is one that you do not have to store, manage, or handle. There

Identity and access management 289

AWS Well-Architected Framework Framework

Credential type Description Suggested strategy

instances, manually or as part Connect to provide

of an automated process programmatic and human
access to EC2 instances using
IAM roles.

Application and database Passwords – plain text string Rotate: Store credentials in
credentials AWS Secrets Manager and
establish automated rotation
if possible.

Amazon RDS and Aurora Passwords – plain text string Replace: Use the Secrets
Admin Database credentials Manager integration with
Amazon RDS or Amazon
Aurora. In addition, some RDS
database types can use IAM
roles instead of passwords
for some use cases (for more
detail, see IAM database
authentication).

OAuth tokens Secret tokens – plain text Rotate: Store tokens in

string AWS Secrets Manager and
conﬁgure automated rotation.

API tokens and keys Secret tokens – plain text Rotate: Store in AWS Secrets
string Manager and establish
automated rotation if
possible.

A common anti-pattern is embedding IAM access keys inside source code, conﬁguration ﬁles,
or mobile apps. When an IAM access key is required to communicate with an AWS service, use
temporary (short-term) security credentials. These short-term credentials can be provided through
IAM roles for EC2 instances, execution roles for Lambda functions, Cognito IAM roles for mobile
user access, and IoT Core policies for IoT devices. When interfacing with third parties, prefer

Identity and access management 291

AWS Well-Architected Framework Framework

a. Consider using a tool such as git-secrets to prevent committing new secrets to your source
code repository.
5. Monitor Secrets Manager activity for indications of unexpected usage, inappropriate secret
access, or attempts to delete secrets.
6. Reduce human exposure to credentials. Restrict access to read, write, and modify credentials to
an IAM role dedicated for this purpose, and only provide access to assume the role to a small
subset of operational users.

Resources

Related best practices:

• SEC02-BP02 Use temporary credentials

• SEC02-BP05 Audit and rotate credentials periodically

Related documents:

• Getting Started with AWS Secrets Manager

• Identity Providers and Federation
• Amazon CodeGuru Introduces Secrets Detector
• How AWS Secrets Manager uses AWS Key Management Service
• Secret encryption and decryption in Secrets Manager
• Secrets Manager blog entries
• Amazon RDS announces integration with AWS Secrets Manager

Related videos:

• Best Practices for Managing, Retrieving, and Rotating Secrets at Scale

• Find Hard-Coded Secrets Using Amazon CodeGuru Secrets Detector
• Securing Secrets for Hybrid Workloads Using AWS Secrets Manager

Related workshops:

• Store, retrieve, and manage sensitive credentials in AWS Secrets Manager

Identity and access management 293

AWS Well-Architected Framework Framework

Implementation guidance

Guidance for workforce users accessing AWS

Workforce users like employees and contractors in your organization may require access to AWS
using the AWS Management Console or AWS Command Line Interface (AWS CLI) to perform
their job functions. You can grant AWS access to your workforce users by federating from your
centralized identity provider to AWS at two levels: direct federation to each AWS account or
federating to multiple accounts in your AWS organization.

• To federate your workforce users directly with each AWS account, you can use a centralized
identity provider to federate to AWS Identity and Access Management in that account. The
ﬂexibility of IAM allows you to enable a separate SAML 2.0 or an Open ID Connect (OIDC)
Identity Provider for each AWS account and use federated user attributes for access control.
Your workforce users will use their web browser to sign in to the identity provider by providing
their credentials (such as passwords and MFA token codes). The identity provider issues a SAML
assertion to their browser that is submitted to the AWS Management Console sign in URL to
allow the user to single sign-on to the AWS Management Console by assuming an IAM Role. Your
users can also obtain temporary AWS API credentials for use in the AWS CLI or AWS SDKs from
AWS STS by assuming the IAM role using a SAML assertion from the identity provider.

• To federate your workforce users with multiple accounts in your AWS organization, you can use
AWS IAM Identity Center to centrally manage access for your workforce users to AWS accounts
and applications. You enable Identity Center for your organization and configure your identity
source. IAM Identity Center provides a default identity source directory which you can use to
manage your users and groups. Alternatively, you can choose an external identity source by
connecting to your external identity provider using SAML 2.0 and automatically provisioning
users and groups using SCIM, or connecting to your Microsoft AD Directory using AWS Directory
Service. Once an identity source is configured, you can assign access to users and groups to
AWS accounts by defining least-privilege policies in your permission sets. Your workforce users
can authenticate through your central identity provider to sign in to the AWS access portal and
single-sign on to the AWS accounts and cloud applications assigned to them. Your users can
configure the AWS CLI v2 to authenticate with Identity Center and get credentials to run AWS CLI
commands. Identity Center also allows single-sign on access to AWS applications such as Amazon
SageMaker Studio and AWS IoT Sitewise Monitor portals.

After you follow the preceding guidance, your workforce users will no longer need to use IAM users
and groups for normal operations when managing workloads on AWS. Instead, your users and

Identity and access management 295

AWS Well-Architected Framework Framework

Resources

Related Well-Architected best practices:

• SEC02-BP06 Employ user groups and attributes

• SEC03-BP02 Grant least privilege access

• SEC03-BP06 Manage access based on lifecycle

Related documents:

• Identity federation in AWS

• Security best practices in IAM

• AWS Identity and Access Management Best practices

• Getting started with IAM Identity Center delegated administration

• How to use customer managed policies in IAM Identity Center for advanced use cases

• AWS CLI v2: IAM Identity Center credential provider

Related videos:

• AWS re:Inforce 2022 - AWS Identity and Access Management (IAM) deep dive

• AWS re:Invent 2022 - Simplify your existing workforce access with IAM Identity Center

• AWS re:Invent 2018: Mastering Identity at Every Layer of the Cake

Related examples:

• Workshop: Using AWS IAM Identity Center to achieve strong identity management

• Workshop: Serverless identity

Related tools:

• AWS Security Competency Partners: Identity and Access Management

• AWS IAM Identity Center

Identity and access management 297

AWS Well-Architected Framework Framework

can include, but are not limited to, IAM users, AWS IAM Identity Center users, Active Directory
users, or users in a diﬀerent upstream identity provider. For example, remove people that leave
the organization, and remove cross-account roles that are no longer required. Have a process
in place to periodically audit permissions to the services accessed by an IAM entity. This helps
you identify the policies you need to modify to remove any unused permissions. Use credential
reports and AWS Identity and Access Management Access Analyzer to audit IAM credentials and
permissions. You can use Amazon CloudWatch to set up alarms for speciﬁc API calls called within
your AWS environment. Amazon GuardDuty can also alert you to unexpected activity, which
might indicate overly permissive access or unintended access to IAM credentials.

• Rotate credentials regularly: When you are unable to use temporary credentials, rotate long-
term IAM access keys regularly (maximum every 90 days). If an access key is unintentionally
disclosed without your knowledge, this limits how long the credentials can be used to access
your resources. For information about rotating access keys for IAM users, see Rotating access
keys.
• Review IAM permissions: To improve the security of your AWS account, regularly review and
monitor each of your IAM policies. Verify that policies adhere to the principle of least privilege.

• Consider automating IAM resource creation and updates: IAM Identity Center automates many
IAM tasks, such as role and policy management. Alternatively, AWS CloudFormation can be used
to automate the deployment of IAM resources, including roles and policies, to reduce the chance
of human error because the templates can be veriﬁed and version controlled.

• Use IAM Roles Anywhere to replace IAM users for machine identities: IAM Roles Anywhere
allows you to use roles in areas that you traditionally could not, such as on-premise servers. IAM
Roles Anywhere uses a trusted X.509 certiﬁcate to authenticate to AWS and receive temporary
credentials. Using IAM Roles Anywhere avoids the need to rotate these credentials, as long-term
credentials are no longer stored in your on-premises environment. Please note that you will need
to monitor and rotate the X.509 certiﬁcate as it approaches expiration.

Resources

Related best practices:

• SEC02-BP02 Use temporary credentials

• SEC02-BP03 Store and use secrets securely

Related documents:

Identity and access management 299

AWS Well-Architected Framework Framework

• Deﬁning groups at too granular a level, creating duplication and confusion about membership.
• Using groups with duplicate permissions across subsets of resources when attributes can be used
instead.
• Not managing groups, attributes, and memberships through a standardized identity provider
integrated with your AWS environments.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

AWS permissions are defined in documents called policies that are associated to a principal, such
as a user, group, role, or resource. For your workforce, this allows you to define groups based on
the function your users perform for your organization, rather than based on the resources being
accessed. For example, a WebAppDeveloper group may have a policy attached for configuring a
service such as Amazon CloudFront within a development account. An AutomationDeveloper
group may have some CloudFront permissions in common with the WebAppDeveloper group.
These permissions can be captured in a separate policy and associated to both groups, rather than
having users from both functions belong to a CloudFrontAccess group.

In addition to groups, you can use attributes to further scope access. For example, you may have a
Project attribute for users in your WebAppDeveloper group to scope access to resources specific
to their project. Using this technique removes the need to have different groups for application
developers working on different projects if their permissions are otherwise the same. The way
you refer to attributes in permission policies is based on their source, whether they are defined as
part of your federation protocol (such as SAML, OIDC, or SCIM), as custom SAML assertions, or set
within IAM Identity Center.

Implementation steps

1. Establish where you will deﬁne groups and attributes.

a. Following the guidance in SEC02-BP04 Rely on a centralized identity provider, you can
determine whether you need to define groups and attributes within your identity provider,
within IAM Identity Center, or using IAM user groups in a specific account.
2. Define groups.
a. Determine your groups based on function and scope of access required.
b. If defining within IAM Identity Center, create groups and associate the desired level of access
using permission sets.

Identity and access management 301

AWS Well-Architected Framework Framework

• Mastering identity at every layer of the cake

SEC 3. How do you manage permissions for people and machines?

Manage permissions to control access to people and machine identities that require access to AWS
and your workload. Permissions control who can access what, and under what conditions.

Best practices

• SEC03-BP01 Deﬁne access requirements

• SEC03-BP02 Grant least privilege access

• SEC03-BP03 Establish emergency access process

• SEC03-BP04 Reduce permissions continuously

• SEC03-BP05 Deﬁne permission guardrails for your organization

• SEC03-BP06 Manage access based on lifecycle

• SEC03-BP07 Analyze public and cross-account access

• SEC03-BP08 Share resources securely within your organization

• SEC03-BP09 Share resources securely with a third party

SEC03-BP01 Deﬁne access requirements

Each component or resource of your workload needs to be accessed by administrators, end

users, or other components. Have a clear deﬁnition of who or what should have access to each
component, choose the appropriate identity type and method of authentication and authorization.

Common anti-patterns:

• Hard-coding or storing secrets in your application.

• Granting custom permissions for each user.

• Using long-lived credentials.

Level of risk exposed if this best practice is not established: High

Identity and access management 303

AWS Well-Architected Framework Framework

Which user needs To By

programmatic access?

Workforce identity Use temporary credentials to Following the instructions for

IAM Use temporary credentials to Following the instructions in

sign programmatic requests Using temporary credentia
to the AWS CLI, AWS SDKs, or ls with AWS resources in the
AWS APIs. IAM User Guide.

IAM (Not recommended) Following the instructions for

Identity and access management 305

AWS Well-Architected Framework Framework

set permissions at scale, rather than deﬁning permissions for individual users. For example, you
can allow a group of developers access to manage only resources for their project. This way, if a
developer leaves the project, the developer’s access is automatically revoked without changing the
underlying access policies.

Desired outcome: Users should only have the permissions required to do their job. Users should
only be given access to production environments to perform a speciﬁc task within a limited
time period, and access should be revoked once that task is complete. Permissions should be
revoked when no longer needed, including when a user moves onto a diﬀerent project or job
function. Administrator privileges should be given only to a small group of trusted administrators.
Permissions should be reviewed regularly to avoid permission creep. Machine or system accounts
should be given the smallest set of permissions needed to complete their tasks.

Common anti-patterns:

• Defaulting to granting users administrator permissions.

• Using the root user for day-to-day activities.
• Creating policies that are overly permissive, but without full administrator privileges.
• Not reviewing permissions to understand whether they permit least privilege access.

Level of risk exposed if this best practice is not established: High

Implementation guidance

The principle of least privilege states that identities should only be permitted to perform the
smallest set of actions necessary to fulfill a specific task. This balances usability, efficiency, and
security. Operating under this principle helps limit unintended access and helps track who has
access to what resources. IAM users and roles have no permissions by default. The root user has full
access by default and should be tightly controlled, monitored, and used only for tasks that require
root access.

IAM policies are used to explicitly grant permissions to IAM roles or speciﬁc resources. For example,
identity-based policies can be attached to IAM groups, while S3 buckets can be controlled by
resource-based policies.

When creating an IAM policy, you can specify the service actions, resources, and conditions that
must be true for AWS to allow or deny access. AWS supports a variety of conditions to help you
scope down access. For example, by using the PrincipalOrgID condition key, you can deny
actions if the requestor isn’t a part of your AWS Organization.

Identity and access management 307

AWS Well-Architected Framework Framework

environments. Using these tags, you can restrict developers to the development environment.
By combining tagging and permissions policies, you can achieve fine-grained resource access
without needing to define complicated, custom policies for every job function.
• Use service control policies for AWS Organizations. Service control policies centrally control
the maximum available permissions for member accounts in your organization. Importantly,
service control policies allow you to restrict root user permissions in member accounts. Also
consider using AWS Control Tower, which provides prescriptive managed controls that enrich
AWS Organizations. You can also define your own controls within Control Tower.
• Establish a user lifecycle policy for your organization: User lifecycle policies define tasks to
perform when users are onboarded onto AWS, change job role or scope, or no longer need access
to AWS. Permission reviews should be done during each step of a user’s lifecycle to verify that
permissions are properly restrictive and to avoid permissions creep.
• Establish a regular schedule to review permissions and remove any unneeded permissions:
You should regularly review user access to verify that users do not have overly permissive access.
AWS Config and IAM Access Analyzer can help when auditing user permissions.
• Establish a job role matrix: A job role matrix visualizes the various roles and access levels
required within your AWS footprint. Using a job role matrix, you can define and separate
permissions based on user responsibilities within your organization. Use groups instead of
applying permissions directly to individual users or roles.

Resources

Related documents:

• Grant least privilege

• Permissions boundaries for IAM entities
• Techniques for writing least privilege IAM policies
• IAM Access Analyzer makes it easier to implement least privilege permissions by generating IAM
policies based on access activity
• Delegate permission management to developers by using IAM permissions boundaries
• Reﬁning Permissions using last accessed information
• IAM policy types and when to use them
• Testing IAM policies with the IAM policy simulator
• Guardrails in AWS Control Tower

Identity and access management 309

AWS Well-Architected Framework Framework

• You have deﬁned and documented the failure modes that count as an emergency: consider
your normal circumstances and the systems your users depend on to manage their workloads.
Consider how each of these dependencies can fail and cause an emergency situation. You may
ﬁnd the questions and best practices in the Reliability pillar useful to identify failure modes and
architect more resilient systems to minimize the likelihood of failures.

• You have documented the steps that must be followed to conﬁrm a failure as an emergency. For
example, you can require your identity administrators to check the status of your primary and
standby identity providers and, if both are unavailable, declare an emergency event for identity
provider failure.

• You have defined an emergency access process specific to each type of emergency or failure
mode. Being specific can reduce the temptation on the part of your users to overuse a
general process for all types of emergencies. Your emergency access processes describe the
circumstances under which each process should be used, and conversely situations where the
process should not be used and points to alternate processes that may apply.

• Your processes are well-documented with detailed instructions and playbooks that can be
followed quickly and eﬃciently. Remember that an emergency event can be a stressful time for
your users and they may be under extreme time pressure, so design your process to be as simple
as possible.

Common anti-patterns:

• You do not have well-documented and well-tested emergency access processes. Your users are
unprepared for an emergency and follow improvised processes when an emergency event arises.

• Your emergency access processes depend on the same systems (such as a centralized identity
provider) as your normal access mechanisms. This means that the failure of such a system may
impact both your normal and emergency access mechanisms and impair your ability to recover
from the failure.

• Your emergency access processes are used in non-emergency situations. For example, your users
frequently misuse emergency access processes as they ﬁnd it easier to make changes directly
than submit changes through a pipeline.

• Your emergency access processes do not generate suﬃcient logs to audit the processes, or the
logs are not monitored to alert for potential misuse of the processes.

Beneﬁts of establishing this best practice:

Identity and access management 311

AWS Well-Architected Framework Framework

emergency event tracked in your incident management system. Having a uniform system for
emergency accesses allows you to track those requests in a single system, analyze usage trends,
and improve your processes.

• Verify that your emergency access processes can only be initiated by authorized users and
require approvals from the user's peers or management as appropriate. The approval process
should operate eﬀectively both inside and outside business hours. Deﬁne how requests for
approval allow secondary approvers if the primary approvers are unavailable and are escalated
up your management chain until approved.

• Verify that the process generates detailed audit logs and events for both successful and failed
attempts to gain emergency access. Monitor both the request process and the emergency
access mechanism to detect misuse or unauthorized accesses. Correlate activity with ongoing
emergency events from your incident management system and alert when actions happen
outside of expected time periods. For example, you should monitor and alert on activity in the
emergency access AWS account, as it should never be used in normal operations.

• Test emergency access processes periodically to verify that the steps are clear and grant the
correct level of access quickly and eﬃciently. Your emergency access processes should be tested
as part of incident response simulations (SEC10-BP07) and disaster recovery tests (REL13-BP03).

Failure Mode 1: Identity provider used to federate to AWS is unavailable

As described in SEC02-BP04 Rely on a centralized identity provider, we recommend relying on a

centralized identity provider to federate your workforce users to grant access to AWS accounts. You
can federate to multiple AWS accounts in your AWS organization using IAM Identity Center, or you
can federate to individual AWS accounts using IAM. In both cases, workforce users authenticate
with your centralized identity provider before being redirected to an AWS sign-in endpoint to
single sign-on.

In the unlikely event that your centralized identity provider is unavailable, your workforce users
can't federate to AWS accounts or manage their workloads. In this emergency event, you can
provide an emergency access process for a small set of administrators to access AWS accounts to
perform critical tasks that cannot wait until your centralized identity providers are back online.
For example, your identity provider is unavailable for 4 hours and during that period you need
to modify the upper limits of an Amazon EC2 Auto Scaling group in a Production account to
handle an unexpected spike in customer traﬃc. Your emergency administrators should follow the
emergency access process to gain access to the speciﬁc production AWS account and make the
necessary changes.

Identity and access management 313

AWS Well-Architected Framework Framework

Failure Mode 2: Identity provider conﬁguration on AWS is modiﬁed or has expired

To allow your workforce users to federate to AWS accounts, you can configure the IAM Identity
Center with an external identity provider or create an IAM Identity Provider (SEC02-BP04).
Typically, you configure these by importing a SAML meta-data XML document provided by your
identity provider. The meta-data XML document includes a X.509 certificate corresponding to a
private key that the identity provider uses to sign its SAML assertions.

These conﬁgurations on the AWS-side may be modiﬁed or deleted by mistake by an administrator.

In another scenario, the X.509 certiﬁcate imported into AWS may expire and a new meta-data XML
with a new certiﬁcate has not yet been imported into AWS. Both scenarios can break federation to
AWS for your workforce users, resulting in an emergency.

In such an emergency event, you can provide your identity administrators access to AWS to fix the
federation issues. For example, your identity administrator uses the emergency access process to
sign into the emergency access AWS account, switches to a role in the Identity Center administrator
account, and updates the external identity provider configuration by importing the latest SAML
meta-data XML document from your identity provider to re-enable federation. Once federation
is fixed, your workforce users continue to use the normal operating process to federate into their
workload accounts.

You can follow the approaches detailed in the previous Failure Mode 1 to create an emergency
access process. You can grant least-privilege permissions to your identity administrators to access
only the Identity Center administrator account and perform actions on Identity Center in that
account.

Failure Mode 3: Identity Center disruption

In the unlikely event of an IAM Identity Center or AWS Region disruption, we recommend that
you set up a conﬁguration that you can use to provide temporary access to the AWS Management
Console.

The emergency access process uses direct federation from your identity provider to IAM in an
emergency account. For detail on the process and design considerations, see Set up emergency
access to the AWS Management Console.

Implementation steps

Common steps for all failure modes

Identity and access management 315

AWS Well-Architected Framework Framework

• Create IAM roles corresponding to the emergency operations groups in the emergency access
account.

Resources

Related Well-Architected best practices:

• SEC02-BP04 Rely on a centralized identity provider

• SEC03-BP02 Grant least privilege access

• SEC10-BP02 Develop incident management plans

• SEC10-BP07 Run game days

Related documents:

• Set up emergency access to the AWS Management Console

• Enabling SAML 2.0 federated users to access the AWS Management Console

• Break glass access

Related videos:

• AWS re:Invent 2022 - Simplify your existing workforce access with IAM Identity Center

• AWS re:Inforce 2022 - AWS Identity and Access Management (IAM) deep dive

Related examples:

• AWS Break Glass Role

• AWS customer playbook framework

• AWS incident response playbook samples

SEC03-BP04 Reduce permissions continuously

As your teams determine what access is required, remove unneeded permissions and establish
review processes to achieve least privilege permissions. Continually monitor and remove unused
identities and permissions for both human and machine access.

Identity and access management 317

AWS Well-Architected Framework Framework

• Determine an acceptable timeframe and usage policy for IAM users and roles: Use the
last accessed timestamp to identify unused users and roles and remove them. Review service
and action last accessed information to identify and scope permissions for specific users and
roles. For example, you can use last accessed information to identify the specific Amazon
S3 actions that your application role requires and restrict the role’s access to only those
actions. Last accessed information features are available in the AWS Management Console
and programmatically allow you to incorporate them into your infrastructure workflows and
automated tools.

• Consider logging data events in AWS CloudTrail: By default, CloudTrail does not log data
events such as Amazon S3 object-level activity (for example, GetObject and DeleteObject)
or Amazon DynamoDB table activities (for example, PutItem and DeleteItem). Consider using
logging for these events to determine what users and roles need access to speciﬁc Amazon S3
objects or DynamoDB table items.

Resources

Related documents:

• Grant least privilege

• Remove unnecessary credentials

• What is AWS CloudTrail?

• Working with Policies

• Logging and monitoring DynamoDB

• Using CloudTrail event logging for Amazon S3 buckets and objects

• Getting credential reports for your AWS account

Related videos:

• Become an IAM Policy Master in 60 Minutes or Less

• Separation of Duties, Least Privilege, Delegation, and CI/CD

• AWS re:Inforce 2022 - AWS Identity and Access Management (IAM) deep dive

Identity and access management 319

AWS Well-Architected Framework Framework

additional layers are applied. This helps you grant access based on the principle of least privilege,
reducing the risk of unintended access due to policy misconﬁguration.

The ﬁrst step to establish permission guardrails is to isolate your workloads and environments into
separate AWS accounts. Principals from one account cannot access resources in another account
without explicit permission to do so, even when both accounts are in the same AWS organization
or under the same organizational unit (OU). You can use OUs to group accounts you want to
administer as a single unit.

The next step is to reduce the maximum set of permissions that you can grant to principals within
the member accounts of your organization. You can use service control policies (SCPs) for this
purpose, which you can apply to either an OU or an account. SCPs can enforce common access
controls, such as restricting access to speciﬁc AWS Regions, help prevent resources from being
deleted, or disabling potentially risky service actions. SCPs that you apply to the root of your
organization only aﬀect its member accounts, not the management account. SCPs only govern the
principals within your organization. Your SCPs don't govern principals outside your organization
that are accessing your resources.

A further step is to use IAM resource policies to scope the available actions that you can take on
the resources they govern, along with any conditions that the acting principal must meet. This
can be as broad as allowing all actions so long as the principal is part of your organization (using
the PrincipalOrgId condition key), or as granular as only allowing specific actions by a specific IAM
role. You can take a similar approach with conditions in IAM role trust policies. If a resource or role
trust policy explicitly names a principal in the same account as the role or resource it governs, that
principal does not need an attached IAM policy that grants the same permissions. If the principal is
in a different account from the resource, then the principal does need an attached IAM policy that
grants those permissions.

Often, a workload team will want to manage the permissions their workload requires. This may
require them to create new IAM roles and permission policies. You can capture the maximum scope
of permissions the team is allowed to grant in an IAM permission boundary, and associate this
document to an IAM role the team can then use to manage their IAM roles and permissions. This
approach can provide them the ability to complete their work while mitigating risks of having IAM
administrative access.

A more granular step is to implement privileged access management (PAM) and temporary
elevated access management (TEAM) techniques. One example of PAM is to require principals to
perform multi-factor authentication before taking privileged actions. For more information, see

Identity and access management 321

AWS Well-Architected Framework Framework

Related tools:

• AWS Solution: Temporary Elevated Access Management

• Validated security partner solutions for TEAM

SEC03-BP06 Manage access based on lifecycle

Monitor and adjust the permissions granted to your principals (users, roles, and groups) throughout
their lifecycle within your organization. Adjust group memberships as users change roles, and
remove access when a user leaves the organization.

Desired outcome: You monitor and adjust permissions throughout the lifecycle of principals within
the organization, reducing risk of unnecessary privileges. You grant appropriate access when you
create a user. You modify access as the user's responsibilities change, and you remove access when
the user is no longer active or has left the organization. You centrally manage changes to your
users, roles, and groups. You use automation to propagate changes to your AWS environments.

Common anti-patterns:

• You grant excessive or broad access privileges to identities upfront beyond what is initially
required.

• You don't review and adjust access privileges as the roles and responsibilities of identities change
over time.

• You leave inactive or terminated identities with active access privileges. This increases the risk of
unauthorized access.
• You don't automate the management of identity lifecycles.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Carefully manage and adjust access privileges that you grant to identities (such as users, roles,
groups) throughout their lifecycle. This lifecycle includes the initial onboarding phase, ongoing
changes in roles and responsibilities, and eventual oﬀboarding or termination. Proactively manage
access based on the stage of the lifecycle to maintain the appropriate access level. Adhere to the
principle of least privilege to reduce the risk of excessive or unnecessary access privileges.

Identity and access management 323

AWS Well-Architected Framework Framework

Resources

Related best practices:

• SEC02-BP04 Rely on a centralized identity provider

Related documents:

• Manage your identity source

• Manage identities in IAM Identity Center
• Using AWS Identity and Access Management Access Analyzer
• IAM Access Analyzer policy generation

Related videos:

• AWS re:Inforce 2023 - Manage temporary elevated access with AWS IAM Identity Center
• AWS re:Invent 2022 - Simplify your existing workforce access with IAM Identity Center
• AWS re:Invent 2022 - Harness power of IAM policies & rein in permissions w/Access Analyzer

SEC03-BP07 Analyze public and cross-account access

Continually monitor ﬁndings that highlight public and cross-account access. Reduce public access
and cross-account access to only the speciﬁc resources that require this access.

Desired outcome: Know which of your AWS resources are shared and with whom. Continually
monitor and audit your shared resources to verify they are shared with only authorized principals.

Common anti-patterns:

• Not keeping an inventory of shared resources.

• Not following a process for approval of cross-account or public access to resources.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

If your account is in AWS Organizations, you can grant access to resources to the entire
organization, speciﬁc organizational units, or individual accounts. If your account is not a member

Identity and access management 325

AWS Well-Architected Framework Framework

Access is turned off, and if Amazon S3 buckets become public. Additionally, if you are using
AWS Organizations, you can create a service control policy that prevents changes to Amazon
S3 public access policies. AWS Trusted Advisor checks for Amazon S3 buckets that have open
access permissions. Bucket permissions that grant, upload, or delete access to everyone create
potential security issues by allowing anyone to add, modify, or remove items in a bucket. The
Trusted Advisor check examines explicit bucket permissions and associated bucket policies that
might override the bucket permissions. You also can use AWS Config to monitor your Amazon S3
buckets for public access. For more information, see How to Use AWS Config to Monitor for and
Respond to Amazon S3 Buckets Allowing Public Access. While reviewing access, it’s important to
consider what types of data are contained in Amazon S3 buckets. Amazon Macie helps discover
and protect sensitive data, such as PII, PHI, and credentials, such as private or AWS keys.

Resources

Related documents:

• Using AWS Identity and Access Management Access Analyzer

• AWS Control Tower controls library
• AWS Foundational Security Best Practices standard
• AWS Config Managed Rules
• AWS Trusted Advisor check reference
• Monitoring AWS Trusted Advisor check results with Amazon EventBridge
• Managing AWS Config Rules Across All Accounts in Your Organization
• AWS Config and AWS Organizations
• Make your AMI publicly available for use in Amazon EC2

Related videos:

• Best Practices for securing your multi-account environment

• Dive Deep into IAM Access Analyzer

SEC03-BP08 Share resources securely within your organization

As the number of workloads grows, you might need to share access to resources in those workloads
or provision the resources multiple times across multiple accounts. You might have constructs

Identity and access management 327

AWS Well-Architected Framework Framework

for a resource and share it across accounts. You can use AWS Resource Access Manager (AWS RAM)
to share other common resources, such as VPC subnets and Transit Gateway attachments, AWS
Network Firewall, or Amazon SageMaker pipelines.

To restrict your account to only share resources within your organization, use service control
policies (SCPs) to prevent access to external principals. When sharing resources, combine identity-
based controls and network controls to create a data perimeter for your organization to help
protect against unintended access. A data perimeter is a set of preventive guardrails to help verify
that only your trusted identities are accessing trusted resources from expected networks. These
controls place appropriate limits on what resources can be shared and prevent sharing or exposing
resources that should not be allowed. For example, as a part of your data perimeter, you can use
VPC endpoint policies and the AWS:PrincipalOrgId condition to ensure the identities accessing
your Amazon S3 buckets belong to your organization. It is important to note that SCPs do not
apply to service-linked roles or AWS service principals.

When using Amazon S3, turn oﬀ ACLs for your Amazon S3 bucket and use IAM policies to deﬁne
access control. For restricting access to an Amazon S3 origin from Amazon CloudFront, migrate
from origin access identity (OAI) to origin access control (OAC) which supports additional features
including server-side encryption with AWS Key Management Service.

In some cases, you might want to allow sharing resources outside of your organization or grant a
third party access to your resources. For prescriptive guidance on managing permissions to share
resources externally, see Permissions management.

Implementation steps

1. Use AWS Organizations.

AWS Organizations is an account management service that allows you to consolidate multiple
AWS accounts into an organization that you create and centrally manage. You can group your
accounts into organizational units (OUs) and attach diﬀerent policies to each OU to help you
meet your budgetary, security, and compliance needs. You can also control how AWS artiﬁcial
intelligence (AI) and machine learning (ML) services can collect and store data, and use the
multi-account management of the AWS services integrated with Organizations.
2. Integrate AWS Organizations with AWS services.

When you use an AWS service to perform tasks on your behalf in the member accounts of your
organization, AWS Organizations creates an IAM service-linked role (SLR) for that service in each
member account. You should manage trusted access using the AWS Management Console, the

Identity and access management 329

AWS Well-Architected Framework Framework

AWS RAM helps you securely share the resources that you have created with roles and users in
your account and with other AWS accounts. In a multi-account environment, AWS RAM allows
you to create a resource once and share it with other accounts. This approach helps reduce
your operational overhead while providing consistency, visibility, and auditability through
integrations with Amazon CloudWatch and AWS CloudTrail, which you do not receive when
using cross-account access.

If you have resources that you shared previously using a resource-based policy, you can use the
PromoteResourceShareCreatedFromPolicy API or an equivalent to promote the resource
share to a full AWS RAM resource share.

In some cases, you might need to take additional steps to share resources. For example, to share
an encrypted snapshot, you need to share a AWS KMS key.

Resources

Related best practices:

• SEC03-BP07 Analyze public and cross-account access

• SEC03-BP09 Share resources securely with a third party
• SEC05-BP01 Create network layers

Related documents:

• Bucket owner granting cross-account permission to objects it does not own

• How to use Trust Policies with IAM
• Building Data Perimeter on AWS
• How to use an external ID when granting a third party access to your AWS resources
• AWS services you can use with AWS Organizations
• Establishing a data perimeter on AWS: Allow only trusted identities to access company data

Related videos:

• Granular Access with AWS Resource Access Manager

• Securing your data perimeter with VPC endpoints

Identity and access management 331

AWS Well-Architected Framework Framework

condition. When using an external ID, you or the third party can generate a unique ID for each
customer, third party, or tenancy. The unique ID should not be controlled by anyone but you after
it’s created. The third party must implement a process to relate the external ID to the customer in a
secure, auditable, and reproduceable manner.

You can also use IAM Roles Anywhere to manage IAM roles for applications outside of AWS that use
AWS APIs.

If the third party no longer requires access to your environment, remove the role. Avoid providing
long-term credentials to a third party. Maintain awareness of other AWS services that support
sharing. For example, the AWS Well-Architected Tool allows sharing a workload with other AWS
accounts, and AWS Resource Access Manager helps you securely share an AWS resource you own
with other accounts.

Implementation steps

1. Use cross-account roles to provide access to external accounts.

Cross-account roles reduce the amount of sensitive information that is stored by external
accounts and third parties for servicing their customers. Cross-account roles allow you to grant
access to AWS resources in your account securely to a third party, such as AWS Partners or other
accounts in your organization, while maintaining the ability to manage and audit that access.

The third party might be providing service to you from a hybrid infrastructure or alternatively
pulling data into an oﬀsite location. IAM Roles Anywhere helps you allow third party workloads
to securely interact with your AWS workloads and further reduce the need for long-term
credentials.

You should not use long-term credentials, or access keys associated with users, to provide
external account access. Instead, use cross-account roles to provide the cross-account access.
2. Use an external ID with third parties.

Using an external ID allows you to designate who can assume a role in an IAM trust policy.
The trust policy can require that the user assuming the role assert the condition and target in
which they are operating. It also provides a way for the account owner to permit the role to be
assumed only under speciﬁc circumstances. The primary function of the external ID is to address
and prevent the confused deputy problem.

Use an external ID if you are an AWS account owner and you have conﬁgured a role for a third
party that accesses other AWS accounts in addition to yours, or when you are in the position of

Identity and access management 333

AWS Well-Architected Framework Framework

The third party should provide an automated, auditable setup mechanism. However, by using
the role policy document outlining the access needed, you should automate the setup of the
role. Using a AWS CloudFormation template or equivalent, you should monitor for changes with
drift detection as part of the audit practice.
6. Account for changes.

Your account structure, your need for the third party, or their service oﬀering being provided
might change. You should anticipate changes and failures, and plan accordingly with the right
people, process, and technology. Audit the level of access you provide on a periodic basis, and
implement detection methods to alert you to unexpected changes. Monitor and audit the use
of the role and the datastore of the external IDs. You should be prepared to revoke third-party
access, either temporarily or permanently, as a result of unexpected changes or access patterns.
Also, measure the impact to your revocation operation, including the time it takes to perform,
the people involved, the cost, and the impact to other resources.

For prescriptive guidance on detection methods, see the Detection best practices.

Resources

Related best practices:

• SEC02-BP02 Use temporary credentials

• SEC03-BP05 Deﬁne permission guardrails for your organization
• SEC03-BP06 Manage access based on lifecycle
• SEC03-BP07 Analyze public and cross-account access
• SEC04 Detection

Related documents:

• Bucket owner granting cross-account permission to objects it does not own

• How to use trust policies with IAM roles
• Delegate access across AWS accounts using IAM roles
• How do I access resources in another AWS account using IAM?
• Security best practices in IAM
• Cross-account policy evaluation logic

Identity and access management 335

AWS Well-Architected Framework Framework

SEC04-BP01 Conﬁgure service and application logging

Retain security event logs from services and applications. This is a fundamental principle of
security for audit, investigations, and operational use cases, and a common security requirement
driven by governance, risk, and compliance (GRC) standards, policies, and procedures.

Desired outcome: An organization should be able to reliably and consistently retrieve security
event logs from AWS services and applications in a timely manner when required to fulﬁll an
internal process or obligation, such as a security incident response. Consider centralizing logs for
better operational results.

Common anti-patterns:

• Logs are stored in perpetuity or deleted too soon.

• Everybody can access logs.
• Relying entirely on manual processes for log governance and use.
• Storing every single type of log just in case it is needed.
• Checking log integrity only when necessary.

Beneﬁts of establishing this best practice: Implement a root cause analysis (RCA) mechanism for
security incidents and a source of evidence for your governance, risk, and compliance obligations.

Level of risk exposed if this best practice is not established: High

Implementation guidance

During a security investigation or other use cases based on your requirements, you need to be able
to review relevant logs to record and understand the full scope and timeline of the incident. Logs
are also required for alert generation, indicating that certain actions of interest have happened. It
is critical to select, turn on, store, and set up querying and retrieval mechanisms and alerting.

Implementation steps

• Select and use log sources. Ahead of a security investigation, you need to capture relevant
logs to retroactively reconstruct activity in an AWS account. Select log sources relevant to your
workloads.

The log source selection criteria should be based on the use cases required by your business.
Establish a trail for each AWS account using AWS CloudTrail or an AWS Organizations trail, and
conﬁgure an Amazon S3 bucket for it.

Detection 337
AWS Well-Architected Framework Framework

An Amazon S3 bucket provides cost-eﬀective, durable storage with an optional lifecycle policy.
Logs stored in Amazon S3 buckets can be queried using services such as Amazon Athena.

A CloudWatch log group provides durable storage and a built-in query facility through
CloudWatch Logs Insights.
• Identify appropriate log retention: When you use an Amazon S3 bucket or CloudWatch log
group to store logs, you must establish adequate lifecycles for each log source to optimize
storage and retrieval costs. Customers generally have between three months to one year of logs
readily available for querying, with retention of up to seven years. The choice of availability and
retention should align with your security requirements and a composite of statutory, regulatory,
and business mandates.
• Use logging for each AWS service and application with proper retention and lifecycle
policies: For each AWS service or application in your organization, look for the specific logging
configuration guidance:
• Configure AWS CloudTrail Trail
• Configure VPC Flow Logs
• Configure Amazon GuardDuty Finding Export
• Configure AWS Config recording
• Configure AWS WAF web ACL traffic
• Configure AWS Network Firewall network traffic logs
• Configure Elastic Load Balancing access logs
• Configure Amazon Route 53 resolver query logs
• Configure Amazon RDS logs
• Configure Amazon EKS Control Plane logs
• Configure Amazon CloudWatch agent for Amazon EC2 instances and on-premises servers
• Select and implement querying mechanisms for logs: For log queries, you can use CloudWatch
Logs Insights for data stored in CloudWatch log groups, and Amazon Athena and Amazon
OpenSearch Service for data stored in Amazon S3. You can also use third-party querying tools
such as a security information and event management (SIEM) service.

The process for selecting a log querying tool should consider the people, process, and
technology aspects of your security operations. Select a tool that fulﬁlls operational, business,
and security requirements, and is both accessible and maintainable in the long term. Keep in
mind that log querying tools work optimally when the number of logs to be scanned is kept

Detection 339
AWS Well-Architected Framework Framework

• AWS Security Incident Response Guide

• Getting Started with Amazon Security Lake
• Getting started: Amazon CloudWatch Logs
• Security Partner Solutions: Logging and Monitoring

Related videos:

• AWS re:Invent 2022 - Introducing Amazon Security Lake

Related examples:

• Assisted Log Enabler for AWS

• AWS Security Hub Findings Historical Export

Related tools:

• Snowﬂake for Cybersecurity

SEC04-BP02 Capture logs, ﬁndings, and metrics in standardized locations

Security teams rely on logs and ﬁndings to analyze events that may indicate unauthorized
activity or unintentional changes. To streamline this analysis, capture security logs and ﬁndings
in standardized locations. This makes data points of interest available for correlation and can
simplify tool integrations.

Desired outcome: You have a standardized approach to collect, analyze, and visualize log data,
ﬁndings, and metrics. Security teams can eﬃciently correlate, analyze, and visualize security data
across disparate systems to discover potential security events and identify anomalies. Security
information and event management (SIEM) systems or other mechanisms are integrated to query
and analyze log data for timely responses, tracking, and escalation of security events.

Common anti-patterns:

• Teams independently own and manage logging and metrics collection that is inconsistent to the
organization's logging strategy.
• Teams don't have adequate access controls to restrict visibility and alteration of the data
collected.

Detection 341
AWS Well-Architected Framework Framework

AWS services, such as Amazon GuardDuty and Amazon Inspector, with your log data. You can
also use third-party data source integrations, or conﬁgure custom data sources. All integrations
standardize your data into the Open Cybersecurity Schema Framework (OCSF) format, and are
stored in Amazon S3 buckets as Parquet ﬁles, eliminating the need for ETL processing.

Storing security data in standardized locations provides advanced analytics capabilities. AWS
recommends you deploy tools for security analytics that operate in an AWS environment into a
Security Tooling account that is separate from your Log Archive account. This approach allows
you to implement controls at depth to protect the integrity and availability of the logs and log
management process, distinct from the tools that access them. Consider using services, such as
Amazon Athena, to run on-demand queries that correlate multiple data sources. You can also
integrate visualization tools, such as Amazon QuickSight. AI-powered solutions are becoming
increasingly available and can perform functions such as translating ﬁndings into human-readable
summaries and natural language interaction. These solutions are often more readily integrated by
having a standardized data storage location for querying.

Implementation steps

1. Create the Log Archive and Security Tooling accounts

a. Using AWS Organizations, create the Log Archive and Security Tooling accounts under a
security organizational unit. If you are using AWS Control Tower to manage your organization,
the Log Archive and Security Tooling accounts are created for you automatically. Configure
roles and permissions for accessing and administering these accounts as required.
2. Configure your standardized security data locations
a. Determine your strategy for creating standardized security data locations. You can achieve
this through options like common data lake architecture approaches, third-party data
products, or Amazon Security Lake. AWS recommends that you capture security data from
AWS Regions that are opted-in for your accounts, even when not actively in use.
3. Configure data source publication to your standardized locations
a. Identify the sources for your security data and configure them to publish into your
standardized locations. Evaluate options to automatically export data in the desired format
as opposed to those where ETL processes need to be developed. With Amazon Security Lake,
you can collect data from supported AWS sources and integrated third-party systems.
4. Configure tools to access your standardized locations
a. Configure tools such as Amazon Athena, Amazon QuickSight, or third-party solutions to have
the access required to your standardized locations. Configure these tools to operate out of

Detection 343
AWS Well-Architected Framework Framework

• Open Cybersecurity Schema Framework (OCSF)

• Amazon Athena
• Amazon QuickSight
• Amazon Bedrock

SEC04-BP03 Correlate and enrich security alerts

Unexpected activity can generate multiple security alerts by diﬀerent sources, requiring further
correlation and enrichment to understand the full context. Implement automated correlation and
enrichment of security alerts to help achieve more accurate incident identiﬁcation and response.

Desired outcome: As activity generates diﬀerent alerts within your workloads and environments,
automated mechanisms correlate data and enrich that data with additional information. This pre-
processing presents a more detailed understanding of the event, which helps your investigators
determine the criticality of the event and if it constitutes an incident that requires formal response.
This process reduces the load on your monitoring and investigation teams.

Common anti-patterns:

• Different groups of people investigate findings and alerts generated by different systems, unless
otherwise mandated by separation of duty requirements.
• Your organization funnels all security finding and alert data to standard locations, but requires
investigators to perform manual correlation and enrichment.
• You rely solely on the intelligence of threat detection systems to report on findings and establish
criticality.

Beneﬁts of establishing this best practice: Automated correlation and enrichment of alerts helps
to reduce the overall cognitive load and manual data preparation required of your investigators.
This practice can reduce the time it takes to determine if the event represents an incident and
initiate a formal response. Additional context also helps you accurately assess the true severity of
an event, as it may be higher or lower than what any one alert suggests.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

Security alerts can come from many diﬀerent sources within AWS, including:

Detection 345
AWS Well-Architected Framework Framework

3. Identify sources for data correlation and enrichment. Example sources include CloudTrail, VPC
Flow Logs, Amazon Security Lake, and infrastructure and application logs.
4. Integrate your alerts with your data correlation and enrichment sources to create more detailed
security event contexts and establish criticality.

a. Amazon Detective, SIEM tooling, or other third-party solutions can perform a certain level of
ingestion, correlation, and enrichment automatically.

b. You can also use AWS services to build your own. For example, you can invoke an AWS
Lambda function to run an Amazon Athena query against AWS CloudTrail or Amazon Security
Lake, and publish the results to EventBridge.

Resources

Related best practices:

• SEC10-BP03 Prepare forensic capabilities

• OPS08-BP04 Create actionable alerts

• REL06-BP03 Send notiﬁcations (Real-time processing and alarming)

Related documents:

• AWS Security Incident Response Guide

Related examples:

• How to enrich AWS Security Hub ﬁndings with account metadata

• How to use AWS Security Hub and Amazon OpenSearch Service for SIEM

Related tools:

• Amazon Detective

• Amazon EventBridge

• AWS Lambda

• Amazon Athena

Detection 347
AWS Well-Architected Framework Framework

Implementation guidance

As described in SEC01-BP03 Identify and validate control objectives, services such as AWS Config
can help you monitor the configuration of resources in your accounts for adherence to your
requirements. When non-compliant resources are detected, we recommend that you configure
sending alerts to a cloud security posture management (CSPM) solution, such as AWS Security Hub,
to help with remediation. These solutions provide a central place for your security investigators to
monitor for issues and take corrective action.

While some non-compliant resource situations are unique and require human judgment to
remediate, other situations have a standard response that you can define programmatically. For
example, a standard response to a misconfigured VPC security group could be to remove the
disallowed rules and notify the owner. Responses can be defined in AWS Lambda functions, AWS
Systems Manager Automation documents, or through other code environments you prefer. Make
sure the environment is able to authenticate to AWS using an IAM role with the least amount of
permission needed to take corrective action.

Once you define the desired remediation, you can then determine your preferred means for
initiating it. AWS Config can initiate remediations for you. If you are using Security Hub, you can
do this through custom actions, which publishes the finding information to Amazon EventBridge.
An EventBridge rule can then initiate your remediation. You can configure the custom action in
Security Hub to run either automatically or manually.

For programmatic remediation, we recommend that you have comprehensive logs and audits
for the actions taken, as well as their outcomes. Review and analyze these logs to assess the
eﬀectiveness of the automated processes, and identify areas of improvement. Capture logs in
Amazon CloudWatch Logs and remediation outcomes as ﬁnding notes in Security Hub.

As a starting point, consider Automated Security Response on AWS, which has pre-built
remediations for resolving common security misconﬁgurations.

Implementation steps

1. Analyze and prioritize alerts.

a. Consolidate security alerts from various AWS services into Security Hub for centralized
visibility, prioritization, and remediation.
2. Develop remediations.
a. Use services such as Systems Manager and AWS Lambda to run programmatic remediations.
3. Conﬁgure how remediations are initiated.

Detection 349
AWS Well-Architected Framework Framework

SEC 5. How do you protect your network resources?

Best practices
• SEC05-BP01 Create network layers
• SEC05-BP02 Control traﬃc ﬂow within your network layers
• SEC05-BP03 Implement inspection-based protection
• SEC05-BP04 Automate network protection

SEC05-BP01 Create network layers

Segment your network topology into diﬀerent layers based on logical groupings of your workload
components according to their data sensitivity and access requirements. Distinguish between
components that require inbound access from the internet, such as public web endpoints, and
those that only need internal access, such as databases.

Desired outcome: The layers of your network are part of an integral defense-in-depth approach
to security that complements the identity authentication and authorization strategy of your
workloads. Layers are in place according to data sensitivity and access requirements, with
appropriate traﬃc ﬂow and control mechanisms.

Common anti-patterns:

• You create all resources in a single VPC or subnet.

• You construct your network layers without consideration of data sensitivity requirements,
component behaviors, or functionality.
• You use VPCs and subnets as defaults for all network layer considerations, and you don't
consider how AWS managed services inﬂuence your topology.

Benefits of establishing this best practice: Establishing network layers is the first step in
restricting unnecessary pathways through the network, particularly those that lead to critical
systems and data. This makes it harder for unauthorized actors to gain access to your network and
navigate to additional resources within. Discrete network layers beneficially reduce the scope of

Infrastructure protection 351

AWS Well-Architected Framework Framework

VPC subnets unless there are speciﬁc reasons not to. Determine where VPC endpoints and AWS
PrivateLink can simplify adhering to security policies that limit access to internet gateways.

Implementation steps

1. Review your workload architecture. Logically group components and services based on the
functions they serve, the sensitivity of data being processed, and their behavior.

2. For components responding to requests from the internet, consider using load balancers or
other proxies to provide public endpoints. Explore shifting security controls by using managed
services, such as CloudFront, Amazon API Gateway, Elastic Load Balancing, and AWS Amplify to
host public endpoints.

3. For components running in compute environments, such as Amazon EC2 instances, AWS Fargate
containers, or Lambda functions, deploy these into private subnets based on your groups from
the ﬁrst step.
4. For fully managed AWS services, such as Amazon DynamoDB, Amazon Kinesis, or Amazon SQS,
consider using VPC endpoints as the default for access over private IP addresses.

Resources

Related best practices:

• REL02 Plan your network topology

• PERF04-BP01 Understand how networking impacts performance

Related videos:

• AWS re:Invent 2023 - AWS networking foundations

Related examples:

• VPC examples

• Access container applications privately on Amazon ECS by using AWS Fargate, AWS PrivateLink,
and a Network Load Balancer

• Serve static content in an Amazon S3 bucket through a VPC by using Amazon CloudFront

Infrastructure protection 353

AWS Well-Architected Framework Framework

further control using additional services, such as AWS PrivateLink, Amazon Route 53 Resolver DNS
Firewall, AWS Network Firewall, and AWS WAF.

Understand and inventory the data flow and communication requirements of your workloads in
terms of connection-initiating parties, ports, protocols, and network layers. Evaluate the protocols
available for establishing connections and transmitting data to select ones that achieve your
protection requirements (for example, HTTPS rather than HTTP). Capture these requirements
at both the boundaries of your networks and within each layer. Once these requirements are
identified, explore options to only allow the required traffic to flow at each connection point. A
good starting point is to use security groups within your VPC, as they can be attached to resources
that uses an Elastic Network Interface (ENI), such Amazon EC2 instances, Amazon ECS tasks,
Amazon EKS pods, or Amazon RDS databases. Unlike a Layer 4 firewall, a security group can have
a rule that allows traffic from another security group by its identifier, minimizing updates as
resources within the group change over time. You can also filter traffic using both inbound and
outbound rules using security groups.

When traffic moves between VPCs, it's common to use VPC peering for simple routing or the
AWS Transit Gateway for complex routing. With these approaches, you facilitate traffic flows
between the range of IP addresses of both the source and destination networks. However, if your
workload only requires traffic flows between specific components in different VPCs, consider
using a point-to-point connection using AWS PrivateLink. To do this, identify which service should
act as the producer and which should act as the consumer. Deploy a compatible load balancer
for the producer, turn on PrivateLink accordingly, and then accept a connection request by the
consumer. The producer service is then assigned a private IP address from the consumer's VPC
that the consumer can use to make subsequent requests. This approach reduces the need to
peer the networks. Include the costs for data processing and load balancing as part of evaluating
PrivateLink.

While security groups and PrivateLink help control the flow between the components of your
workloads, another major consideration is how to control which DNS domains your resources are
allowed to access (if any). Depending on the DHCP configuration of your VPCs, you can consider
two different AWS services for this purpose. Most customers use the default Route 53 Resolver
DNS service (also called Amazon DNS server or AmazonProvidedDNS) available to VPCs at the +2
address of its CIDR range. With this approach, you can create DNS Firewall rules and associate them
to your VPC that determine what actions to take for the domain lists you supply.

If you are not using the Route 53 Resolver, or if you want to complement the Resolver with deeper
inspection and ﬂow control capabilities beyond domain ﬁltering, consider deploying an AWS

Infrastructure protection 355

AWS Well-Architected Framework Framework

• Application Acceleration and Protection with Amazon CloudFront, AWS WAF, and AWS Shield
• AWS re:Inforce 2023: Firewalls and where to put them

Related examples:

• Lab: CloudFront for Web Application

SEC05-BP03 Implement inspection-based protection

Set up traffic inspection points between your network layers to make sure data in transit matches
the expected categories and patterns. Analyze traffic flows, metadata, and patterns to help
identify, detect, and respond to events more effectively.

Desired outcome: Traﬃc that traverses between your network layers are inspected and authorized.
Allow and deny decisions are based on explicit rules, threat intelligence, and deviations from
baseline behaviors. Protections become stricter as traﬃc moves closer to sensitive data.

Common anti-patterns:

• Relying solely on firewall rules based on ports and protocols. Not taking advantage of intelligent
systems.
• Authoring firewall rules based on specific current threat patterns that are subject to change.
• Only inspecting traffic where traffic transits from private to public subnets, or from public
subnets to the Internet.
• Not having a baseline view of your network traffic to compare for behavior anomalies.

Benefits of establishing this best practice: Inspection systems allow you to author intelligent
rules, such as allowing or denying traffic only when certain conditions within the traffic data exist.
Benefit from managed rule sets from AWS and partners, based on the latest threat intelligence,
as the threat landscape changes over time. This reduces the overhead of maintaining rules and
researching indicators of compromise, reducing the potential for false positives.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Have ﬁne-grained control over both your stateful and stateless network traﬃc using AWS Network
Firewall, or other Firewalls and Intrusion Prevention Systems (IPS) on AWS Marketplace that you

Infrastructure protection 357

AWS Well-Architected Framework Framework

3. For out-of-band inspection solutions:

1. Turn on VPC Traffic Mirroring on interfaces where inbound and outbound traffic should be
mirrored. You can use Amazon EventBridge rules to invoke an AWS Lambda function to turn
on traffic mirroring on interfaces when new resources are created. Point the traffic mirroring
sessions to the Network Load Balancer in front of your appliance that processes traffic.

4. For inbound web traﬃc solutions:

a. To configure AWS WAF, start by configuring a web access control list (web ACL). The web ACL
is a collection of rules with a serially processed default action (ALLOW or DENY) that defines
how your WAF handles traffic. You can create your own rules and groups or use AWS managed
rule groups in your web ACL.

b. Once your web ACL is conﬁgured, associate the web ACL with an AWS resource (like an
Application Load Balancer, API Gateway REST API, or CloudFront distribution) to begin
protecting web traﬃc.

Resources

Related documents:

• What is Traﬃc Mirroring?

• Implementing inline traﬃc inspection using third-party security appliances

• AWS Network Firewall example architectures with routing

• Centralized inspection architecture with AWS Gateway Load Balancer and AWS Transit Gateway

Related examples:

• Best practices for deploying Gateway Load Balancer

• TLS inspection conﬁguration for encrypted egress traﬃc and AWS Network Firewall

Related tools:

• AWS Marketplace IDS/IPS

Infrastructure protection 359

AWS Well-Architected Framework Framework

rules systems that can update automatically based on the latest threat intelligence. Examples
of protecting your web endpoints include AWS WAF managed rules and AWS Shield Advanced
automatic application layer DDoS mitigation. Use AWS Network Firewall managed rule groups to
stay up to date with low-reputation domain lists and threat signatures as well.

Beyond managed rules, we recommend you use DevOps practices to automate deploying your
network resources, protections, and the rules you specify. You can capture these definitions in
AWS CloudFormation or another infrastructure as code (IaC) tool of your choice, commit them
to a version control system, and deploy them using CI/CD pipelines. Use this approach to gain
the traditional benefits of DevOps for managing your network controls, such as more predictable
releases, automated testing using tools like AWS CloudFormation Guard, and detecting drift
between your deployed environment and your desired configuration.

Based on the decisions you made as part of SEC05-BP01 Create network layers, you may have
a central management approach to creating VPCs that are dedicated for ingress, egress, and
inspection flows. As described in the AWS Security Reference Architecture (AWS SRA), you can
define these VPCs in a dedicated Network infrastructure account. You can use similar techniques
to centrally define the VPCs used by your workloads in other accounts, their security groups, AWS
Network Firewall deployments, Route 53 Resolver rules and DNS Firewall configurations, and other
network resources. You can share these resources with your other accounts with the AWS Resource
Access Manager. With this approach, you can simplify the automated testing and deployment of
your network controls to the Network account, with only one destination to manage. You can do
this in a hybrid model, where you deploy and share certain controls centrally and delegate other
controls to the individual workload teams and their respective accounts.

Implementation steps

1. Establish ownership over which aspects of the network and protections are deﬁned centrally,
and which your workload teams can maintain.
2. Create environments to test and deploy changes to your network and its protections. For
example, use a Network Testing account and a Network Production account.

3. Determine how you will store and maintain your templates in a version control system. Store
central templates in a repository that is distinct from workload repositories, while workload
templates can be stored in repositories speciﬁc to that workload.

4. Create CI/CD pipelines to test and deploy templates. Deﬁne tests to check for misconﬁgurations
and that templates adhere to your company standards.

Infrastructure protection 361

AWS Well-Architected Framework Framework

SEC06-BP01 Perform vulnerability management

Frequently scan and patch for vulnerabilities in your code, dependencies, and in your infrastructure
to help protect against new threats.

Desired outcome: Create and maintain a vulnerability management program. Regularly scan
and patch resources such as Amazon EC2 instances, Amazon Elastic Container Service (Amazon
ECS) containers, and Amazon Elastic Kubernetes Service (Amazon EKS) workloads. Conﬁgure
maintenance windows for AWS managed resources, such as Amazon Relational Database Service
(Amazon RDS) databases. Use static code scanning to inspect application source code for common
issues. Consider web application penetration testing if your organization has the requisite skills or
can hire outside assistance.

Common anti-patterns:

• Not having a vulnerability management program.

• Performing system patching without considering severity or risk avoidance.
• Using software that has passed its vendor-provided end of life (EOL) date.
• Deploying code into production before analyzing it for security issues.

Level of risk exposed if this best practice is not established: High

Implementation guidance

A vulnerability management program includes security assessment, identifying issues, prioritizing,

and performing patch operations as part of resolving the issues. Automation is the key to
continually scanning workloads for issues and unintended network exposure and performing
remediation. Automating the creation and updating of resources saves time and reduces the risk of
conﬁguration errors creating further issues. A well-designed vulnerability management program
should also consider vulnerability testing during the development and deployment stages of the
software life cycle. Implementing vulnerability management during development and deployment
helps lessen the chance that a vulnerability can make its way into your production environment.

Implementing a vulnerability management program requires a good understanding of the AWS

Shared Responsibly model and how it relates to your speciﬁc workloads. Under the Shared
Responsibility Model, AWS is responsible for protecting the infrastructure of the AWS Cloud. This
infrastructure is composed of the hardware, software, networking, and facilities that run AWS
Cloud services. You are responsible for security in the cloud, for example, the actual data, security

Infrastructure protection 363

AWS Well-Architected Framework Framework

• Use AWS Systems Manager: You are responsible for patch management for your AWS resources,
including Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Machine Images
(AMIs), and other compute resources. AWS Systems Manager Patch Manager automates the
process of patching managed instances with both security related and other types of updates.
Patch Manager can be used to apply patches on Amazon EC2 instances for both operating
systems and applications, including Microsoft applications, Windows service packs, and minor
version upgrades for Linux based instances. In addition to Amazon EC2, Patch Manager can also
be used to patch on-premises servers.

For a list of supported operating systems, see Supported operating systems in the Systems
Manager User Guide. You can scan instances to see only a report of missing patches, or you can
scan and automatically install all missing patches.

• Use AWS Security Hub: Security Hub provides a comprehensive view of your security state in
AWS. It collects security data across multiple AWS services and provides those ﬁndings in a
standardized format, allowing you to prioritize security ﬁndings across AWS services.

• Use AWS CloudFormation: AWS CloudFormation is an infrastructure as code (IaC) service that
can help with vulnerability management by automating resource deployment and standardizing
resource architecture across multiple accounts and environments.

Resources

Related documents:

• AWS Systems Manager

• Security Overview of AWS Lambda

• Amazon CodeGuru

• Improved, Automated Vulnerability Management for Cloud Workloads with a New Amazon
Inspector

• Automate vulnerability management and remediation in AWS using Amazon Inspector and AWS
Systems Manager – Part 1

Related videos:

• Securing Serverless and Container Services

• Security best practices for the Amazon EC2 instance metadata service

Infrastructure protection 365

AWS Well-Architected Framework Framework

You can reduce the burden of hardening systems by using guidance that trusted sources provide,
such as the Center for Internet Security (CIS) and the Defense Information Systems Agency (DISA)
Security Technical Implementation Guides (STIGs). We recommend you start with an Amazon
Machine Image (AMI) published by AWS or an APN partner, and use the AWS EC2 Image Builder to
automate conﬁguration according to an appropriate combination of CIS and STIG controls.

While there are available hardened images and EC2 Image Builder recipes that apply the CIS
or DISA STIG recommendations, you may find their configuration prevents your software from
running successfully. In this situation, you can start from a non-hardened base image, install your
software, and then incrementally apply CIS controls to test their impact. For any CIS control that
prevents your software from running, test if you can implement the finer-grained hardening
recommendations in a DISA instead. Keep track of the different CIS controls and DISA STIG
configurations you are able to apply successfully. Use these to define your image hardening recipes
in EC2 Image Builder accordingly.

For containerized workloads, hardened images from Docker are available on the Amazon Elastic
Container Registry (ECR) public repository. You can use EC2 Image Builder to harden container
images alongside AMIs.

Similar to operating systems and container images, you can obtain code packages (or libraries)
from public repositories, through tooling such as pip, npm, Maven, and NuGet. We recommend you
manage code packages by integrating private repositories, such as within AWS CodeArtifact, with
trusted public repositories. This integration can handle retrieving, storing, and keeping packages
up-to-date for you. Your application build processes can then obtain and test the latest version of
these packages alongside your application, using techniques like Software Composition Analysis
(SCA), Static Application Security Testing (SAST), and Dynamic Application Security Testing (DAST).

For serverless workloads that use AWS Lambda, simplify managing package dependencies using
Lambda layers. Use Lambda layers to conﬁgure a set of standard dependencies that are shared
across diﬀerent functions into a standalone archive. You can create and maintain layers through
their own build process, providing a central way for your functions to stay up-to-date.

Implementation steps

• Harden operating systems. Use base images from trusted sources as a foundation for building
your hardened AMIs. Use EC2 Image Builder to help customize the software installed on your
images.

Infrastructure protection 367

AWS Well-Architected Framework Framework

Common anti-patterns:

• Interactive access to Amazon EC2 instances with protocols such as SSH or RDP.

• Maintaining individual user logins such as /etc/passwd or Windows local users.

• Sharing a password or private key to access an instance among multiple users.

• Manually installing software and creating or updating conﬁguration ﬁles.

• Manually updating or patching software.

• Logging into an instance to troubleshoot problems.

Benefits of establishing this best practice: Performing actions with automation helps you to
reduce the operational risk of unintended changes and misconfigurations. Removing the use of
Secure Shell (SSH) and Remote Desktop Protocol (RDP) for interactive access reduces the scope
of access to your compute resources. This takes away a common path for unauthorized actions.
Capturing your compute resource management tasks in automation documents and programmatic
scripts provides a mechanism to define and audit the full scope of authorized activities at a fine-
grained level of detail.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Logging into an instance is a classic approach to system administration. After installing the server
operating system, users would typically log in manually to conﬁgure the system and install the
desired software. During the server's lifetime, users might log in to perform software updates,
apply patches, change conﬁgurations, and troubleshoot problems.

Manual access poses a number of risks, however. It requires a server that listens for requests, such
as an SSH or RDP service, that can provide a potential path to unauthorized access. It also increases
the risk of human error associated with performing manual steps. These can result in workload
incidents, data corruption or destruction, or other security issues. Human access also requires
protections against the sharing of credentials, creating additional management overhead.

To mitigate these risks, you can implement an agent-based remote access solution, such as AWS
Systems Manager. AWS Systems Manager Agent (SSM Agent) initiates an encrypted channel and
thus does not rely on listening for externally-initiated requests. Consider conﬁguring SSM Agent to
establish this channel over a VPC endpoint.

Infrastructure protection 369

AWS Well-Architected Framework Framework

Related tools:

• AWS Systems Manager

Related videos:

• Controlling User Session Access to Instances in AWS Systems Manager Session Manager

SEC06-BP04 Validate software integrity

Use cryptographic veriﬁcation to validate the integrity of software artifacts (including images) your
workload uses. Cryptographically sign your software as a safeguard against unauthorized changes
run within your compute environments.

Desired outcome: All artifacts are obtained from trusted sources. Vendor website certificates
are validated. Downloaded artifacts are cryptographically verified by their signatures. Your own
software is cryptographically signed and verified by your computing environments.

Common anti-patterns:

• Trusting reputable vendor websites to obtain software artifacts, but ignoring certificate
expiration notices. Proceeding with downloads without confirming certificates are valid.

• Validating vendor website certiﬁcates, but not cryptographically verifying downloaded artifacts
from these websites.

• Relying solely on digests or hashes to validate software integrity. Hashes establish that artifacts
have not been modiﬁed from the original version, but do not validate their source.
• Not signing your own software, code, or libraries, even when only used in your own
deployments.

Beneﬁts of establishing this best practice: Validating the integrity of artifacts that your workload
depends on helps prevent malware from entering your compute environments. Signing your
software helps safeguard against unauthorized running in your compute environments. Secure
your software supply chain by signing and verifying code.

Level of risk exposed if this best practice is not established: Medium

Infrastructure protection 371

AWS Well-Architected Framework Framework

Related examples:

• Automate Lambda code signing with Amazon CodeCatalyst and AWS Signer
• Signing and Validating OCI Artifacts with AWS Signer

Related tools:

• AWS Lambda
• AWS Signer
• AWS Certiﬁcate Manager
• AWS Key Management Service
• AWS CodeArtifact

SEC06-BP05 Automate compute protection

Automate compute protection operations to reduce the need for human intervention. Use
automated scanning to detect potential issues within your compute resources, and remediate with
automated programmatic responses or ﬂeet management operations. Incorporate automation in
your CI/CD processes to deploy trustworthy workloads with up-to-date dependencies.

Desired outcome: Automated systems perform all scanning and patching of compute resources.
You use automated veriﬁcation to check that software images and dependencies come from
trusted sources, and have not been tampered with. Workloads are automatically checked for up-
to-date dependencies, and are signed to establish trustworthiness in AWS compute environments.
Automated remediations are initiated when non-compliant resources are detected.

Common anti-patterns:

• Following the practice of immutable infrastructure, but not having a solution in place for
emergency patching or replacement of production systems.
• Using automation to ﬁx misconﬁgured resources, but not having a manual override mechanism
in place. Situations may arise where you need to adjust the requirements, and you may need to
suspend automations until you make these changes.

Benefits of establishing this best practice: Automation can reduce the risk of unauthorized access
and use of your compute resources. It helps to prevent misconfigurations from making their way
into production environments, and detecting and fixing misconfigurations should they occur.

Infrastructure protection 373

AWS Well-Architected Framework Framework

or Security Technical Implementation Guide (STIG) standards from base AWS and APN partner
images.
2. Automate configuration management. Enforce and validate secure configurations in your
compute resources automatically by using a configuration management service or tool.
a. Automated configuration management using AWS Config
b. Automated security and compliance posture management using AWS Security Hub
3. Automate patching or replacing Amazon Elastic Compute Cloud (Amazon EC2) instances. AWS
Systems Manager Patch Manager automates the process of patching managed instances with
both security-related and other types of updates. You can use Patch Manager to apply patches
for both operating systems and applications.
a. AWS Systems Manager Patch Manager
4. Automate scanning of compute resources for common vulnerabilities and exposures (CVEs), and
embed security scanning solutions within your build pipeline.
a. Amazon Inspector
b. ECR Image Scanning
5. Consider Amazon GuardDuty for automatic malware and threat detection to protect compute
resources. GuardDuty can also identify potential issues when an AWS Lambda function gets
invoked in your AWS environment.
a. Amazon GuardDuty
6. Consider AWS Partner solutions. AWS Partners offer industry-leading products that are
equivalent, identical to, or integrate with existing controls in your on-premises environments.
These products complement the existing AWS services to allow you to deploy a comprehensive
security architecture and a more seamless experience across your cloud and on-premises
environments.
a. Infrastructure security

Resources

Related best practices:

• SEC01-BP06 Automate deployment of standard security controls

Related documents:

• Get the full beneﬁts of IMDSv2 and disable IMDSv1 across your AWS infrastructure

Infrastructure protection 375

AWS Well-Architected Framework Framework

Common anti-patterns:

• Not having a formal data classification policy in place to define data sensitivity levels and their
handling requirements
• Not having a good understanding of the sensitivity levels of data within your workload, and not
capturing this information in architecture and operations documentation
• Failing to apply the appropriate controls around your data based on its sensitivity and
requirements, as outlined in your data classification and handling policy
• Failing to provide feedback about data classification and handling requirements to owners of the
policies.

Benefits of establishing this best practice: This practice removes ambiguity around the
appropriate handling of data within your workload. Applying a formal policy that defines the
sensitivity levels of data in your organization and their required protections can help you comply
with legal regulations and other cybersecurity attestations and certifications. Workload owners
can have confidence in knowing where sensitive data is stored and what protection controls are in
place. Capturing these in documentation helps new team members better understand them and
maintain controls early in their tenure. These practices can also help reduce costs by right sizing the
controls for each type of data.

Level of risk exposed if this best practice is not established: High

Implementation guidance

When designing a workload, you may be considering ways to protect sensitive data intuitively. For
example, in a multi-tenant application, it is intuitive to think of each tenant's data as sensitive and
put protections in place so that one tenant can't access the data of another tenant. Likewise, you
may intuitively design access controls so only administrators can modify data while other users
have only read-level access or no access at all.

By having these data sensitivity levels deﬁned and captured in policy, along with their data
protection requirements, you can formally identify what data resides in your workload. You can
then determine if the right controls are in place, if the controls can be audited, and what responses
are appropriate if data is found to be mishandled.

To help with categorizing where sensitive data is present within your workload, consider
using resource tags where available. For example, you can apply a tag that has a tag key of
Classification and a tag value of PHI for protected health information (PHI), and another tag

Data protection 377

AWS Well-Architected Framework Framework

SEC07-BP02 Apply data protection controls based on data sensitivity

Apply data protection controls that provide an appropriate level of control for each class of data
deﬁned in your classiﬁcation policy. This practice can allow you to protect sensitive data from
unauthorized access and use, while preserving the availability and use of data.

Desired outcome: You have a classification policy that defines the different levels of sensitivity for
data in your organization. For each of these sensitivity levels, you have clear guidelines published
for approved storage and handling services and locations, and their required configuration.
You implement the controls for each level according to the level of protection required and
their associated costs. You have monitoring and alerting in place to detect if data is present in
unauthorized locations, processed in unauthorized environments, accessed by unauthorized actors,
or the configuration of related services becomes non-compliant.

Common anti-patterns:

• Applying the same level of protection controls across all data. This may lead to over-provisioning
security controls for low-sensitivity data, or insufficient protection of highly sensitive data.
• Not involving relevant stakeholders from security, compliance, and business teams when
defining data protection controls.
• Overlooking the operational overhead and costs associated with implementing and maintaining
data protection controls.
• Not conducting periodic data protection control reviews to maintain alignment with
classification policies.

Benefits of establishing this best practice: By aligning your controls to the classification level of
your data, your organization can invest in higher levels of control where needed. This can include
increasing resources on securing, monitoring, measuring, remediating, and reporting. Where fewer
controls are appropriate, you can improve the accessibility and completeness of data for your
workforce, customers, or constituents. This approach gives your organization the most flexibility
with data usage, while still adhering to data protection requirements.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Implementing data protection controls based on data sensitivity levels involves several key steps.
First, identify the diﬀerent data sensitivity levels within your workload architecture (such as public,

Data protection 379

AWS Well-Architected Framework Framework

Related documents:

• Data Classiﬁcation whitepaper

• Best Practices for Security, Identify, & Compliance
• AWS KMS Best Practices
• Encryption best practices and features for AWS services

Related examples:

• Building a serverless tokenization solution to mask sensitive data

• How to use tokenization to improve data security and reduce audit scope

Related tools:

• AWS Key Management Service (AWS KMS)

• AWS CloudHSM
• AWS Organizations

SEC07-BP03 Automate identiﬁcation and classiﬁcation

Automating the identiﬁcation and classiﬁcation of data can help you implement the correct
controls. Using automation to augment manual determination reduces the risk of human error and
exposure.

Desired outcome: You are able to verify whether the proper controls are in place based on
your classiﬁcation and handling policy. Automated tools and services help you to identify and
classify the sensitivity level of your data. Automation also helps you continually monitor your
environments to detect and alert if data is being stored or handled in unauthorized ways so
corrective action can be taken quickly.

Common anti-patterns:

• Relying solely on manual processes for data identification and classification, which can be error-
prone and time-consuming. This can lead to inefficient and inconsistent data classification,
especially as data volumes grow.
• Not having mechanisms to track and manage data assets across the organization.

Data protection 381

AWS Well-Architected Framework Framework

uses sampling techniques to cost-eﬃciently perform a preliminary analysis of where sensitive

data resides. A deeper analysis of S3 buckets can then be performed using a sensitive data
discovery job. Other data stores can also be exported to S3 to be scanned by Macie.

2. Conﬁgure ongoing scans of your environments.

a. The automated sensitive data discovery capability of Macie can be used to perform ongoing
scans of your environments. Known S3 buckets that are authorized to store sensitive data can
be excluded using an allow list in Macie.

3. Incorporate identiﬁcation and classiﬁcation into your build and test processes.

a. Identify tools that developers can use to scan data for sensitivity while workloads are in
development. Use these tools as part of integration testing to alert when sensitive data is
unexpected and prevent further deployment.

4. Implement a system or runbook to take action when sensitive data is found in unauthorized
locations.

Resources

Related documents:

• AWS Glue: Detect and process sensitive data

• Using managed data identiﬁers in Amazon SNS

• Amazon CloudWatch Logs: Help protect sensitive log data with masking

Related examples:

• Enabling data classiﬁcation for Amazon RDS database with Macie

• Detecting sensitive data in DynamoDB with Macie

Related tools:

• Amazon Macie

• Amazon Comprehend

• Amazon Comprehend Medical

• AWS Glue

Data protection 383

AWS Well-Architected Framework Framework

lifecycle mechanism, such as Amazon S3 lifecycle policies and the Amazon Data Lifecycle Manager,
to conﬁgure your data retention, archiving, and expiration processes.

Distinguish between data that is available for use, and data that is stored as a backup. Consider
using AWS Backup to automate the backup of data across AWS services. Amazon EBS snapshots
provide a way to copy an EBS volume and store it using S3 features, including lifecycle, data
protection, and access to protection mechanisms. Two of these mechanisms are S3 Object Lock
and AWS Backup Vault Lock, which can provide you with additional security and control over your
backups. Manage clear separation of duties and access for backups. Isolate backups at the account
level to maintain separation from the aﬀected environment during an event.

Another aspect of lifecycle management is recording the history of data as it progresses through
your workload, called data provenance tracking. This can give conﬁdence that you know where
the data came from, any transformations performed, what owner or process made those changes,
and when. Having this history helps with troubleshooting issues and investigations during
potential security events. For example, you can log metadata about transformations in an Amazon
DynamoDB table. Within a data lake, you can keep copies of transformed data in diﬀerent
S3 buckets for each data pipeline stage. Store schema and timestamp information in an AWS
Glue Data Catalog. Regardless of your solution, consider the requirements of your end users to
determine the appropriate tooling you need to report on your data provenance. This will help you
determine how to best track your provenance.

Implementation steps

1. Analyze the workload's data types, sensitivity levels, and access requirements to classify the data
and deﬁne appropriate lifecycle management strategies.
2. Design and implement data retention policies and automated destruction processes that align
with legal, regulatory, and organizational requirements.
3. Establish processes and automation for continuous monitoring, auditing, and adjustment of
data lifecycle management strategies, controls, and policies as workload requirements and
regulations evolve.

Resources

Related best practices:

• COST04-BP05 Enforce data retention policies

• SUS04-BP03 Use policies to manage the lifecycle of your datasets

Data protection 385

AWS Well-Architected Framework Framework

correct balance between key availability, conﬁdentiality, and integrity. Access to keys should be
monitored, and key material rotated through an automated process. Key material should never be
accessible to human identities.

Common anti-patterns:

• Human access to unencrypted key material.

• Creating custom cryptographic algorithms.
• Overly broad permissions to access key material.

Beneﬁts of establishing this best practice: By establishing a secure key management mechanism
for your workload, you can help provide protection for your content against unauthorized access.
Additionally, you may be subject to regulatory requirements to encrypt your data. An eﬀective key
management solution can provide technical mechanisms aligned to those regulations to protect
key material.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Many regulatory requirements and best practices include encryption of data at rest as a
fundamental security control. In order to comply with this control, your workload needs a
mechanism to securely store and manage the key material used to encrypt your data at rest.

AWS oﬀers AWS Key Management Service (AWS KMS) to provide durable, secure, and redundant
storage for AWS KMS keys. Many AWS services integrate with AWS KMS to support encryption of
your data. AWS KMS uses FIPS 140-2 Level 3 validated hardware security modules to protect your
keys. There is no mechanism to export AWS KMS keys in plain text.

When deploying workloads using a multi-account strategy, it is considered best practice to keep
AWS KMS keys in the same account as the workload that uses them. In this distributed model,
responsibility for managing the AWS KMS keys resides with the application team. In other use
cases, organizations may choose to store AWS KMS keys into a centralized account. This centralized
structure requires additional policies to enable the cross-account access required for the workload
account to access keys stored in the centralized account, but may be more applicable in use cases
where a single key is shared across multiple AWS accounts.

Regardless of where the key material is stored, access to the key should be tightly controlled
through the use of key policies and IAM policies. Key policies are the primary way to control access

Data protection 387

AWS Well-Architected Framework Framework

• Review the best practices for access control to your AWS KMS keys.
4. Consider AWS Encryption SDK: Use the AWS Encryption SDK with AWS KMS integration when
your application needs to encrypt data client-side.
• AWS Encryption SDK
5. Enable IAM Access Analyzer to automatically review and notify if there are overly broad AWS
KMS key policies.
6. Enable Security Hub to receive notiﬁcations if there are misconﬁgured key policies, keys
scheduled for deletion, or keys without automated rotation enabled.
7. Determine the logging level appropriate for your AWS KMS keys. Since calls to AWS KMS,
including read-only events, are logged, the CloudTrail logs associated with AWS KMS can
become voluminous.
• Some organizations prefer to segregate the AWS KMS logging activity into a separate trail.
For more detail, see the Logging AWS KMS API calls with CloudTrail section of the AWS KMS
developers guide.

Resources

Related documents:

• AWS Key Management Service

• AWS cryptographic services and tools
• Protecting Amazon S3 Data Using Encryption
• Envelope encryption
• Digital sovereignty pledge
• Demystifying AWS KMS key operations, bring your own key, custom key store, and ciphertext
portability
• AWS Key Management Service cryptographic details

Related videos:

• How Encryption Works in AWS

• Securing Your Block Storage on AWS
• AWS data protection: Using locks, keys, signatures, and certiﬁcates

Data protection 389

AWS Well-Architected Framework Framework

Additionally, Amazon Elastic Compute Cloud (Amazon EC2) and Amazon S3 support the
enforcement of encryption by setting default encryption. You can use AWS Conﬁg Rules to check
automatically that you are using encryption, for example, for Amazon Elastic Block Store (Amazon
EBS) volumes, Amazon Relational Database Service (Amazon RDS) instances, and Amazon S3
buckets.

AWS also provides options for client-side encryption, allowing you to encrypt data prior to
uploading it to the cloud. The AWS Encryption SDK provides a way to encrypt your data using
envelope encryption. You provide the wrapping key, and the AWS Encryption SDK generates a
unique data key for each data object it encrypts. Consider AWS CloudHSM if you need a managed
single-tenant hardware security module (HSM). AWS CloudHSM allows you to generate, import,
and manage cryptographic keys on a FIPS 140-2 level 3 validated HSM. Some use cases for AWS
CloudHSM include protecting private keys for issuing a certiﬁcate authority (CA), and turning on
transparent data encryption (TDE) for Oracle databases. The AWS CloudHSM Client SDK provides
software that allows you to encrypt data client side using keys stored inside AWS CloudHSM prior
to uploading your data into AWS. The Amazon DynamoDB Encryption Client also allows you to
encrypt and sign items prior to upload into a DynamoDB table.

Implementation steps

• Enforce encryption at rest for Amazon S3: Implement Amazon S3 bucket default encryption.

Conﬁgure default encryption for new Amazon EBS volumes: Specify that you want all newly
created Amazon EBS volumes to be created in encrypted form, with the option of using the
default key provided by AWS or a key that you create.

Conﬁgure encrypted Amazon Machine Images (AMIs): Copying an existing AMI with encryption
conﬁgured will automatically encrypt root volumes and snapshots.

Conﬁgure Amazon RDS encryption: Conﬁgure encryption for your Amazon RDS database
clusters and snapshots at rest by using the encryption option.

Create and configure AWS KMS keys with policies that limit access to the appropriate
principals for each classification of data: For example, create one AWS KMS key for encrypting
production data and a different key for encrypting development or test data. You can also
provide key access to other AWS accounts. Consider having different accounts for your
development and production environments. If your production environment needs to decrypt
artifacts in the development account, you can edit the CMK policy used to encrypt the

Data protection 391

AWS Well-Architected Framework Framework

misconﬁgured storage locations initiates automated remediations. Automated processes create

data backups and store immutable copies outside of the original environment.

Common anti-patterns:

• Not considering options to enable encryption by default settings, where supported.

• Not considering security events, in addition to operational events, when formulating an
automated backup and recovery strategy.
• Not enforcing public access settings for storage services.
• Not monitoring and audit your controls for protecting data at rest.

Benefits of establishing this best practice: Automation helps to prevent the risk of misconfiguring
your data storage locations. It helps to prevent misconfigurations from entering your production
environments. This best practice also helps with detecting and fixing misconfigurations if they
occur.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Automation is a theme throughout the practices for protecting your data at rest. SEC01-
BP06 Automate deployment of standard security controls describes how you can capture the
conﬁguration of your resources using infrastructure as code (IaC) templates, such as with AWS
CloudFormation. These templates are committed to a version control system, and are used to
deploy resources on AWS through a CI/CD pipeline. These techniques equally apply to automating
the conﬁguration of your data storage solutions, such as encryption settings on Amazon S3
buckets.

You can check the settings that you define in your IaC templates for misconfiguration in your CI/
CD pipelines using rules in AWS CloudFormation Guard. You can monitor settings that are not yet
available in CloudFormation or other IaC tooling for misconfiguration with AWS Config. Alerts that
Config generates for misconfigurations can be remediated automatically, as described in SEC04-
BP04 Initiate remediation for non-compliant resources.

Using automation as part of your permissions management strategy is also an integral component
of automated data protections. SEC03-BP02 Grant least privilege access and SEC03-BP04 Reduce
permissions continuously describe conﬁguring least-privilege access policies that are continually
monitored by the AWS Identity and Access Management Access Analyzer to generate ﬁndings when

Data protection 393

AWS Well-Architected Framework Framework

a. AWS Backup is a managed service that creates encrypted and secure backups of various
data sources on AWS. Elastic Disaster Recovery allows you to copy full server workloads
and maintain continuous data protection with a recovery point objective (RPO) measured
in seconds. You can conﬁgure both services to work together to automate creating data
backups and copying them to failover locations. This can help keep your data available when
impacted by either operational or security events.

Resources

Related best practices:

• SEC01-BP06 Automate deployment of standard security controls

• SEC03-BP02 Grant least privilege access
• SEC03-BP04 Reduce permissions continuously
• SEC04-BP04 Initiate remediation for non-compliant resources
• SEC07-BP03 Automate identiﬁcation and classiﬁcation
• REL09-BP02 Secure and encrypt backups
• REL09-BP03 Perform data backup automatically

Related documents:

• AWS Prescriptive Guidance: Automatically encrypt existing and new Amazon EBS volumes
• Ransomware Risk Management on AWS Using the NIST Cyber Security Framework (CSF)

Related examples:

• How to use AWS Conﬁg proactive rules and AWS CloudFormation Hooks to prevent creation of
noncompliant cloud resources
• Automate and centrally manage data protection for Amazon S3 with AWS Backup
• AWS re:Invent 2023 - Implement proactive data protection using Amazon EBS snapshots
• AWS re:Invent 2022 - Build and automate for resilience with modern data protection

Related tools:

• AWS CloudFormation Guard

Data protection 395

AWS Well-Architected Framework Framework

Amazon S3 Glacier Vault Lock and Amazon S3 Object Lock provide mandatory access control for
objects in Amazon S3—once a vault policy is locked with the compliance option, not even the root
user can change it until the lock expires.

Implementation steps

• Enforce access control: Enforce access control with least privileges, including access to
encryption keys.
• Separate data based on different classification levels: Use different AWS accounts for data
classification levels, and manage those accounts using AWS Organizations.
• Review AWS Key Management Service (AWS KMS) policies: Review the level of access granted
in AWS KMS policies.
• Review Amazon S3 bucket and object permissions: Regularly review the level of access granted
in S3 bucket policies. Best practice is to avoid using publicly readable or writeable buckets.
Consider using AWS Config to detect buckets that are publicly available, and Amazon CloudFront
to serve content from Amazon S3. Verify that buckets that should not allow public access are
properly configured to prevent public access. By default, all S3 buckets are private, and can only
be accessed by users that have been explicitly granted access.
• Use AWS IAM Access Analyzer: IAM Access Analyzer analyzes Amazon S3 buckets and generates
a finding when an S3 policy grants access to an external entity.
• Use Amazon S3 versioning and object lock when appropriate.
• Use Amazon S3 Inventory: Amazon S3 Inventory can be used to audit and report on the
replication and encryption status of your S3 objects.
• Review Amazon EBS and AMI sharing permissions: Sharing permissions can allow images and
volumes to be shared with AWS accounts that are external to your workload.
• Review AWS Resource Access Manager Shares periodically to determine whether resources
should continue to be shared. Resource Access Manager allows you to share resources, such
as AWS Network Firewall policies, Amazon Route 53 resolver rules, and subnets, within your
Amazon VPCs. Audit shared resources regularly and stop sharing resources which no longer need
to be shared.

Resources

Related best practices:

• SEC03-BP01 Deﬁne access requirements

Data protection 397

AWS Well-Architected Framework Framework

Desired outcome: A secure certificate management system that can provision, deploy, store, and
renew certificates in a public key infrastructure (PKI). A secure key and certificate management
mechanism prevents certificate private key material from disclosure and automatically renews
the certificate on a periodic basis. It also integrates with other services to provide secure network
communications and identity for machine resources inside of your workload. Key material should
never be accessible to human identities.

Common anti-patterns:

• Performing manual steps during the certiﬁcate deployment or renewal processes.

• Paying insufficient attention to certificate authority (CA) hierarchy when designing a private CA.
• Using self-signed certificates for public resources.

Beneﬁts of establishing this best practice:

• Simplify certiﬁcate management through automated deployment and renewal

• Encourage encryption of data in transit using TLS certificates
• Increased security and auditability of certificate actions taken by the certificate authority
• Organization of management duties at different layers of the CA hierarchy

Level of risk exposed if this best practice is not established: High

Implementation guidance

Modern workloads make extensive use of encrypted network communications using PKI protocols
such as TLS. PKI certificate management can be complex, but automated certificate provisioning,
deployment, and renewal can reduce the friction associated with certificate management.

AWS provides two services to manage general-purpose PKI certificates: AWS Certificate Manager
and AWS Private Certificate Authority (AWS Private CA). ACM is the primary service that customers
use to provision, manage, and deploy certificates for use in both public-facing as well as private
AWS workloads. ACM issues certificates using AWS Private CA and integrates with many other AWS
managed services to provide secure TLS certificates for workloads.

AWS Private CA allows you to establish your own root or subordinate certificate authority and
issue TLS certificates through an API. You can use these kinds of certificates in scenarios where you
control and manage the trust chain on the client side of the TLS connection. In addition to TLS use

Data protection 399

AWS Well-Architected Framework Framework

• Use ACM managed renewal for certiﬁcates issued by ACM along with integrated AWS
managed services.
3. Establish logging and audit trails:

• Enable CloudTrail logs to track access to the accounts holding certificate authorities. Consider
configuring log file integrity validation in CloudTrail to verify the authenticity of the log data.

• Periodically generate and review audit reports that list the certiﬁcates that your private CA has
issued or revoked. These reports can be exported to an S3 bucket.

• When deploying a private CA, you will also need to establish an S3 bucket to store the
Certificate Revocation List (CRL). For guidance on configuring this S3 bucket based on your
workload's requirements, see Planning a certificate revocation list (CRL).

Resources

Related best practices:

• SEC02-BP02 Use temporary credentials

• SEC08-BP01 Implement secure key management

• SEC09-BP03 Authenticate network communications

Related documents:

• How to host and manage an entire private certiﬁcate infrastructure in AWS

• How to secure an enterprise scale ACM Private CA hierarchy for automotive and manufacturing
• Private CA best practices

• How to use AWS RAM to share your ACM Private CA cross-account

Related videos:

• Activating AWS Certiﬁcate Manager Private CA (workshop)

Related examples:

• Private CA workshop

• IOT Device Management Workshop (including device provisioning)

Data protection 401

AWS Well-Architected Framework Framework

Additionally, you can use VPN connectivity into your VPC from an external network or AWS Direct
Connect to facilitate encryption of traﬃc. Verify that your clients are making calls to AWS APIs
using at least TLS 1.2, as AWS is deprecating the use of earlier versions of TLS in June 2023. AWS
recommends using TLS 1.3. Third-party solutions are available in the AWS Marketplace if you have
special requirements.

Implementation steps

• Enforce encryption in transit: Your defined encryption requirements should be based on the
latest standards and best practices and only allow secure protocols. For example, configure a
security group to only allow the HTTPS protocol to an application load balancer or Amazon EC2
instance.
• Configure secure protocols in edge services: Configure HTTPS with Amazon CloudFront and use
a security profile appropriate for your security posture and use case.
• Use a VPN for external connectivity: Consider using an IPsec VPN for securing point-to-point or
network-to-network connections to help provide both data privacy and integrity.
• Configure secure protocols in load balancers: Select a security policy that provides the
strongest cipher suites supported by the clients that will be connecting to the listener. Create an
HTTPS listener for your Application Load Balancer.
• Configure secure protocols in Amazon Redshift: Configure your cluster to require a secure
socket layer (SSL) or transport layer security (TLS) connection.
• Configure secure protocols: Review AWS service documentation to determine encryption-in-
transit capabilities.
• Configure secure access when uploading to Amazon S3 buckets: Use Amazon S3 bucket policy
controls to enforce secure access to data.
• Consider using AWS Certificate Manager: ACM allows you to provision, manage, and deploy
public TLS certificates for use with AWS services.
• Consider using AWS Private Certificate Authority for private PKI needs: AWS Private CA allows
you to create private certificate authority (CA) hierarchies to issue end-entity X.509 certificates
that can be used to create encrypted TLS channels.

Resources

Related documents:

• Using HTTPS with CloudFront

Data protection 403

AWS Well-Architected Framework Framework

• Enhances monitoring, logging, and incident response through request attribution and well-
deﬁned communication interfaces.
• Provides defense-in-depth for your workloads by combining network controls with
authentication and authorization controls.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

Your workload’s network traﬃc patterns can be characterized into two categories:

• East-west traffic represents traffic flows between services that make up a workload.
• North-south traffic represents traffic flows between your workload and consumers.

While it is common practice to encrypt north-south traﬃc, securing east-west traﬃc using
authenticated protocols is less common. Modern security practices recommend that network
design alone does not grant a trusted relationship between two entities. When two services may
reside within a common network boundary, it is still best practice to encrypt, authenticate, and
authorize communications between those services.

As an example, AWS service APIs use the AWS Signature Version 4 (SigV4) signature protocol to
authenticate the caller, no matter what network the request originates from. This authentication
ensures that AWS APIs can verify the identity that requested the action, and that identity can then
be combined with policies to make an authorization decision to determine whether the action
should be allowed or not.

Services such as Amazon VPC Lattice and Amazon API Gateway allow you use the same SigV4
signature protocol to add authentication and authorization to east-west traﬃc in your own
workloads. If resources outside of your AWS environment need to communicate with services
that require SigV4-based authentication and authorization, you can use AWS Identity and
Access Management (IAM) Roles Anywhere on the non-AWS resource to acquire temporary AWS
credentials. These credentials can be used to sign requests to services using SigV4 to authorize
access.

Another common mechanism for authenticating east-west traffic is TLS mutual authentication
(mTLS). Many Internet of Things (IoT), business-to-business applications, and microservices use
mTLS to validate the identity of both sides of a TLS communication through the use of both client
and server-side X.509 certificates. These certificates can be issued by AWS Private Certificate

Data protection 405

AWS Well-Architected Framework Framework

• For service-to-service communication using mTLS, consider API Gateway or App Mesh. AWS
Private CA can be used to establish a private CA hierarchy capable of issuing certificates for use
with mTLS.
• When integrating with services using OAuth 2.0 or OIDC, consider API Gateway using the JWT
authorizer.
• For communication between your workload and IoT devices, consider AWS IoT Core, which
provides several options for network traffic encryption and authentication.
• Monitor for unauthorized access: Continually monitor for unintended communication channels,
unauthorized principals attempting to access protected resources, and other improper access
patterns.
• If using VPC Lattice to manage access to your services, consider enabling and monitoring VPC
Lattice access logs. These access logs include information on the requesting entity, network
information including source and destination VPC, and request metadata.
• Consider enabling VPC flow logs to capture metadata on network flows and periodically
review for anomalies.
• Refer to the AWS Security Incident Response Guide and the Incident Response section of the
AWS Well-Architected Framework security pillar for more guidance on planning, simulating,
and responding to security incidents.

Resources

Related best practices:

• SEC03-BP07 Analyze public and cross-account access

• SEC02-BP02 Use temporary credentials
• SEC01-BP07 Identify threats and prioritize mitigations using a threat model

Related documents:

• Evaluating access control methods to secure Amazon API Gateway APIs

• Conﬁguring mutual TLS authentication for a REST API
• How to secure API Gateway HTTP endpoints with JWT authorizer
• Authorizing direct calls to AWS services using AWS IoT Core credential provider
• AWS Security Incident Response Guide

Data protection 407

AWS Well-Architected Framework Framework

SEC10-BP01 Identify key personnel and external resources

Identify internal and external personnel, resources, and legal obligations to help your organization
respond to an incident.

Desired outcome: You have a list of key personnel, their contact information, and the roles they
play when responding to a security event. You review this information regularly and update it
to reﬂect personnel changes from an internal and external tools perspective. You consider all
third-party service providers and vendors while documenting this information, including security
partners, cloud providers, and software-as-a-service (SaaS) applications. During a security event,
personnel are available with the appropriate level of responsibility, context, and access to be able
to respond and recover.

Common anti-patterns:

• Not maintaining an updated list of key personnel with contact information, their roles, and their
responsibilities when responding to security events.
• Assuming that everyone understands the people, dependencies, infrastructure, and solutions
when responding to and recovering from an event.
• Not having a document or knowledge repository that represents key infrastructure or application
design.
• Not having proper onboarding processes for new employees to eﬀectively contribute to a
security event response, such as conducting event simulations.
• Not having an escalation path in place when key personnel are temporarily unavailable or fail to
respond during security events.

Beneﬁts of establishing this best practice: This practice reduces the triage and response time
spent on identifying the right personnel and their roles during an event. Minimize wasted time
during an event by maintaining an updated list of key personnel and their roles so you can bring
the right individuals to triage and recover from an event.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Identify key personnel in your organization: Maintain a contact list of personnel within your
organization that you need to involve. Regularly review and update this information in the event of

Incident response 409

AWS Well-Architected Framework Framework

a. Identify the most appropriate contacts to engage during an incident. Deﬁne escalation plans
aligned to the roles of personnel to be engaged, rather than individual contacts. Consider
including contacts that may be responsible for informing external entities, even if they are
not directly engaged to resolve the incident.

Resources

Related best practices:

• OPS02-BP03 Operations activities have identiﬁed owners responsible for their performance

Related documents:

• AWS Security Incident Response Guide

Related examples:

• AWS customer playbook framework

• Prepare for and respond to security incidents in your AWS environment

Related tools:

• AWS Systems Manager Incident Manager

Related videos:

• Amazon's approach to security during development

SEC10-BP02 Develop incident management plans

The ﬁrst document to develop for incident response is the incident response plan. The incident
response plan is designed to be the foundation for your incident response program and strategy.

Benefits of establishing this best practice: Developing thorough and clearly defined incident
response processes is key to a successful and scalable incident response program. When a security
event occurs, clear steps and workflows can help you to respond in a timely manner. You might

Incident response 411

AWS Well-Architected Framework Framework

responsible, accountable, consulted, and informed (RACI) chart for your security response plans,
doing so facilitates quick and direct communication and clearly outlines the leadership across
diﬀerent stages of the event.

During an incident, including the owners and developers of impacted applications and resources
is key because they are subject matter experts (SMEs) that can provide information and context
to aid in measuring impact. Make sure to practice and build relationships with the developers and
application owners before you rely on their expertise for incident response. Application owners or
SMEs, such as your cloud administrators or engineers, might need to act in situations where the
environment is unfamiliar or has complexity, or where the responders don’t have access.

Lastly, trusted partners might be involved in the investigation or response because they can
provide additional expertise and valuable scrutiny. When you don’t have these skills on your own
team, you might want to hire an external party for assistance.

Understand AWS response teams and support

• AWS Support

• AWS Support oﬀers a range of plans that provide access to tools and expertise that support
the success and operational health of your AWS solutions. If you need technical support and
more resources to help plan, deploy, and optimize your AWS environment, you can select a
support plan that best aligns with your AWS use case.

• Consider the Support Center in AWS Management Console (sign-in required) as the central
point of contact to get support for issues that aﬀect your AWS resources. Access to AWS
Support is controlled by AWS Identity and Access Management. For more information about
getting access to AWS Support features, see Getting started with AWS Support.
• AWS Customer Incident Response Team (CIRT)

• The AWS Customer Incident Response Team (CIRT) is a specialized 24/7 global AWS team that
provides support to customers during active security events on the customer side of the AWS
Shared Responsibility Model.

• When the AWS CIRT supports you, they provide assistance with triage and recovery for an
active security event on AWS. They can assist in root cause analysis through the use of AWS
service logs and provide you with recommendations for recovery. They can also provide
security recommendations and best practices to help you avoid security events in the future.

• AWS customers can engage the AWS CIRT through an AWS Support case.

• DDoS response support

Incident response 413

AWS Well-Architected Framework Framework

• Phases of incident response and actions to take: Enumerates the phases of incident response
(for example, detect, analyze, eradicate, contain, and recover), including high-level actions to
take within those phases.

• Incident severity and prioritization definitions: Details how to classify the severity of an
incident, how to prioritize the incident, and then how the severity definitions affect escalation
procedures.

While these sections are common throughout companies of diﬀerent sizes and industries, each
organization’s incident response plan is unique. You need to build an incident response plan that
works best for your organization.

Resources

Related best practices:

• SEC04 (How do you detect and investigate security events?)

Related documents:

• AWS Security Incident Response Guide

• NIST: Computer Security Incident Handling Guide

SEC10-BP03 Prepare forensic capabilities

Ahead of a security incident, consider developing forensics capabilities to support security event
investigations.

Level of risk exposed if this best practice is not established: Medium

Concepts from traditional on-premises forensics apply to AWS. For key information to start
building forensics capabilities in the AWS Cloud, see Forensic investigation environment strategies
in the AWS Cloud.

Once you have your environment and AWS account structure set up for forensics, deﬁne the
technologies required to eﬀectively perform forensically sound methodologies across the four
phases:

Incident response 415

AWS Well-Architected Framework Framework

instrument the forensics accounts well ahead of an incident so that responders can be prepared to
eﬀectively use them for response.

The following diagram displays a sample account structure including a forensics OU with per-
Region forensics accounts:

Per-Region account structure for incident response

Capture backups and snapshots

Setting up backups of key systems and databases are critical for recovering from a security incident
and for forensics purposes. With backups in place, you can restore your systems to their previous
safe state. On AWS, you can take snapshots of various resources. Snapshots provide you with point-
in-time backups of those resources. There are many AWS services that can support you in backup
and recovery. For detail on these services and approaches for backup and recovery, see Backup and
Recovery Prescriptive Guidance and Use backups to recover from security incidents.

Especially when it comes to situations such as ransomware, it’s critical for your backups to be well
protected. For guidance on securing your backups, see Top 10 security best practices for securing
backups in AWS. In addition to securing your backups, you should regularly test your backup and
restore processes to verify that the technology and processes you have in place work as expected.

Automate forensics

During a security event, your incident response team must be able to collect and analyze evidence
quickly while maintaining accuracy for the time period surrounding the event (such as capturing
logs related to a speciﬁc event or resource or collecting memory dump of an Amazon EC2

Incident response 417

AWS Well-Architected Framework Framework

Implementation guidance

Playbooks should be created for incident scenarios such as:

• Expected incidents: Playbooks should be created for incidents you anticipate. This includes
threats like denial of service (DoS), ransomware, and credential compromise.

• Known security findings or alerts: Playbooks should be created for your known security findings
and alerts, such as GuardDuty findings. You might receive a GuardDuty finding and think, "Now
what?" To prevent the mishandling or ignoring of a GuardDuty finding, create a playbook for
each potential GuardDuty finding. Some remediation details and guidance can be found in
the GuardDuty documentation. It’s worth noting that GuardDuty is not enabled by default and
does incur a cost. For more detail on GuardDuty, see Appendix A: Cloud capability definitions -
Visibility and alerting.

Playbooks should contain technical steps for a security analyst to complete in order to adequately
investigate and respond to a potential security incident.

Implementation steps

Items to include in a playbook include:

• Playbook overview: What risk or incident scenario does this playbook address? What is the goal
of the playbook?

• Prerequisites: What logs, detection mechanisms, and automated tools are required for this
incident scenario? What is the expected notiﬁcation?

• Communication and escalation information: Who is involved and what is their contact
information? What are each of the stakeholders’ responsibilities?

• Response steps: Across phases of incident response, what tactical steps should be taken? What
queries should an analyst run? What code should be run to achieve the desired outcome?

• Detect: How will the incident be detected?

• Analyze: How will the scope of impact be determined?

• Contain: How will the incident be isolated to limit scope?

• Eradicate: How will the threat be removed from the environment?

• Recover: How will the aﬀected system or resource be brought back into production?

• Expected outcomes: After queries and code are run, what is the expected result of the playbook?

Incident response 419

AWS Well-Architected Framework Framework

We recommend the use of temporary privilege escalation in the majority of incident response
scenarios. The correct way to do this is to use the AWS Security Token Service and session policies
to scope access.

There are scenarios where federated identities are unavailable, such as:

• Outage related to a compromised identity provider (IdP).

• Misconﬁguration or human error causing broken federated access management system.

• Malicious activity such as a distributed denial of service (DDoS) event or rendering unavailability
of the system.

In the preceding cases, there should be emergency break glass access conﬁgured to allow
investigation and timely remediation of incidents. We recommend that you use a user, group,
or role with appropriate permissions to perform tasks and access AWS resources. Use the root
user only for tasks that require root user credentials. To verify that incident responders have the
correct level of access to AWS and other relevant systems, we recommend the pre-provisioning
of dedicated accounts. The accounts require privileged access, and must be tightly controlled
and monitored. The accounts must be built with the fewest privileges required to perform the
necessary tasks, and the level of access should be based on the playbooks created as part of the
incident management plan.

Use purpose-built and dedicated users and roles as a best practice. Temporarily escalating user or
role access through the addition of IAM policies both makes it unclear what access users had during
the incident, and risks the escalated privileges not being revoked.

It is important to remove as many dependencies as possible to verify that access can be gained
under the widest possible number of failure scenarios. To support this, create a playbook to verify
that incident response users are created as users in a dedicated security account, and not managed
through any existing Federation or single sign-on (SSO) solution. Each individual responder must
have their own named account. The account configuration must enforce strong password policy
and multi-factor authentication (MFA). If the incident response playbooks only require access to
the AWS Management Console, the user should not have access keys configured and should be
explicitly disallowed from creating access keys. This can be configured with IAM policies or service
control policies (SCPs) as mentioned in the AWS Security Best Practices for AWS Organizations
SCPs. The users should have no privileges other than the ability to assume incident response roles
in other accounts.

Incident response 421

AWS Well-Architected Framework Framework

As the incident response roles are likely to have a high level of access, it is important that these
alerts go to a wide group and are acted upon promptly.

During an incident, it is possible that a responder might require access to systems which are not
directly secured by IAM. These could include Amazon Elastic Compute Cloud instances, Amazon
Relational Database Service databases, or software-as-a-service (SaaS) platforms. It is strongly
recommended that rather than using native protocols such as SSH or RDP, AWS Systems Manager
Session Manager is used for all administrative access to Amazon EC2 instances. This access can be
controlled using IAM, which is secure and audited. It might also be possible to automate parts of
your playbooks using AWS Systems Manager Run Command documents, which can reduce user
error and improve time to recovery. For access to databases and third-party tools, we recommend
storing access credentials in AWS Secrets Manager and granting access to the incident responder
roles.

Finally, the management of the incident response IAM accounts should be added to your Joiners,
Movers, and Leavers processes and reviewed and tested periodically to verify that only the
intended access is allowed.

Resources

Related documents:

• Managing temporary elevated access to your AWS environment

• AWS Security Incident Response Guide
• AWS Elastic Disaster Recovery
• AWS Systems Manager Incident Manager
• Setting an account password policy for IAM users
• Using multi-factor authentication (MFA) in AWS
• Configuring Cross-Account Access with MFA
• Using IAM Access Analyzer to generate IAM policies
• Best Practices for AWS Organizations Service Control Policies in a Multi-Account Environment
• How to Receive Notifications When Your AWS Account’s Root Access Keys Are Used
• Create fine-grained session permissions using IAM managed policies

Related videos:

Incident response 423

AWS Well-Architected Framework Framework

During a security investigation, you need to be able to review relevant logs to record and
understand the full scope and timeline of the incident. Logs are also required for alert generation,
indicating certain actions of interest have happened. It is critical to select, enable, store, and set up
querying and retrieval mechanisms, and set up alerting. Additionally, an eﬀective way to provide
tools to search log data is Amazon Detective.

AWS oﬀers over 200 cloud services and thousands of features. We recommend that you review the
services that can support and simplify your incident response strategy.

In addition to logging, you should develop and implement a tagging strategy. Tagging can help
provide context around the purpose of an AWS resource. Tagging can also be used for automation.

Implementation steps

Select and set up logs for analysis and alerting

See the following documentation on conﬁguring logging for incident response:

• Logging strategies for security incident response

• SEC04-BP01 Conﬁgure service and application logging

Enable security services to support detection and response

AWS provides native detective, preventative, and responsive capabilities, and other services can
be used to architect custom security solutions. For a list of the most relevant services for security
incident response, see Cloud capability deﬁnitions.

Develop and implement a tagging strategy

Obtaining contextual information on the business use case and relevant internal stakeholders
surrounding an AWS resource can be diﬃcult. One way to do this is in the form of tags, which
assign metadata to your AWS resources and consist of a user-deﬁned key and value. You can create
tags to categorize resources by purpose, owner, environment, type of data processed, and other
criteria of your choice.

Having a consistent tagging strategy can speed up response times and minimize time spent on
organizational context by allowing you to quickly identify and discern contextual information
about an AWS resource. Tags can also serve as a mechanism to initiate response automations.
For more detail on what to tag, see Tagging your AWS resources. You’ll want to ﬁrst deﬁne the

Incident response 425

AWS Well-Architected Framework Framework

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

There are three main types of simulations:

• Tabletop exercises: The tabletop approach to simulations is a discussion-based session

involving the various incident response stakeholders to practice roles and responsibilities
and use established communication tools and playbooks. Exercise facilitation can typically
be accomplished in a full day in a virtual venue, physical venue, or a combination. Because
it is discussion-based, the tabletop exercise focuses on processes, people, and collaboration.
Technology is an integral part of the discussion, but the actual use of incident response tools or
scripts is generally not a part of the tabletop exercise.

• Purple team exercises: Purple team exercises increase the level of collaboration between
the incident responders (blue team) and simulated threat actors (red team). The blue team
is comprised of members of the security operations center (SOC), but can also include other
stakeholders that would be involved during an actual cyber event. The red team is comprised
of a penetration testing team or key stakeholders that are trained in oﬀensive security. The
red team works collaboratively with the exercise facilitators when designing a scenario so that
the scenario is accurate and feasible. During purple team exercises, the primary focus is on the
detection mechanisms, the tools, and the standard operating procedures (SOPs) supporting the
incident response eﬀorts.

• Red team exercises: During a red team exercise, the oﬀense (red team) conducts a simulation
to achieve a certain objective or set of objectives from a predetermined scope. The defenders
(blue team) will not necessarily have knowledge of the scope and duration of the exercise, which
provides a more realistic assessment of how they would respond to an actual incident. Because
red team exercises can be invasive tests, be cautious and implement controls to verify that the
exercise does not cause actual harm to your environment.

Consider facilitating cyber simulations at a regular interval. Each exercise type can provide unique
beneﬁts to the participants and the organization as a whole, so you might choose to start with less
complex simulation types (such as tabletop exercises) and progress to more complex simulation
types (red team exercises). You should select a simulation type based on your security maturity,
resources, and your desired outcomes. Some customers might not choose to perform red team
exercises due to complexity and cost.

Incident response 427

AWS Well-Architected Framework Framework

misconﬁgurations, not only improving your security posture, but also minimizing time lost to
preventable situations.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

It's important to implement a lessons learned framework that establishes and achieves, at a high
level, the following points:

• When is a lessons learned held?

• What is involved in the lessons learned process?
• How is a lessons learned performed?
• Who is involved in the process and how?
• How will areas of improvement be identiﬁed?
• How will you ensure improvements are eﬀectively tracked and implemented?

The framework should not focus on or blame individuals, but instead should focus on improving
tools and processes.

Implementation steps

Aside from the preceding high-level outcomes listed, it’s important to make sure that you ask the
right questions to derive the most value (information that leads to actionable improvements) from
the process. Consider these questions to help get you started in fostering your lessons learned
discussions:

• What was the incident?

• When was the incident first identified?
• How was it identified?
• What systems alerted on the activity?
• What systems, services, and data were involved?
• What specifically occurred?
• What worked well?
• What didn't work well?
• Which process or procedures failed or failed to scale to respond to the incident?

Incident response 429

AWS Well-Architected Framework Framework

Resources

Related documents:

• AWS Security Incident Response Guide - Establish a framework for learning from incidents
• NCSC CAF guidance - Lessons learned

Application security
Question
• SEC 11. How do you incorporate and validate the security properties of applications throughout
the design, development, and deployment lifecycle?

SEC 11. How do you incorporate and validate the security properties of
applications throughout the design, development, and deployment lifecycle?

Best practices
• SEC11-BP01 Train for application security
• SEC11-BP02 Automate testing throughout the development and release lifecycle
• SEC11-BP03 Perform regular penetration testing
• SEC11-BP04 Manual code reviews
• SEC11-BP05 Centralize services for packages and dependencies
• SEC11-BP06 Deploy software programmatically
• SEC11-BP07 Regularly assess security properties of the pipelines
• SEC11-BP08 Build a program that embeds security ownership in workload teams

SEC11-BP01 Train for application security

Provide training to the builders in your organization on common practices for the secure
development and operation of applications. Adopting security focused development practices
helps reduce the likelihood of issues that are only detected at the security review stage.

Application security 431

AWS Well-Architected Framework Framework

Implementation steps

• Start builders with a course on threat modeling to build a good foundation, and help train them
on how to think about security.
• Provide access to AWS Training and Certiﬁcation, industry, or AWS Partner training.
• Provide training on your organization's security review process, which clariﬁes the division of
responsibilities between the security team, workload teams, and other stakeholders.
• Publish self-service guidance on how to meet your security requirements, including code
examples and templates, if available.
• Regularly obtain feedback from builder teams on their experience with the security review
process and training, and use that feedback to improve.
• Use game days or bug bash campaigns to help reduce the number of issues, and increase the
skills of your builders.

Resources

Related best practices:

• SEC11-BP08 Build a program that embeds security ownership in workload teams

Related documents:

• AWS Training and Certiﬁcation

• How to think about cloud security governance
• How to approach threat modeling
• Accelerating training – The AWS Skills Guild

Related videos:

• Proactive security: Considerations and approaches

Related examples:

• Workshop on threat modeling

• Industry awareness for developers

Application security 433

AWS Well-Architected Framework Framework

• Reduced dependency on people evaluating the security properties of systems.

• Having consistent ﬁndings across multiple workstreams improves consistency.
• Reduced likelihood of introducing security issues into production software.
• Shorter window of time between detection and remediation due to catching software issues
earlier.
• Increased visibility of systemic or repeated behavior across multiple workstreams, which can be
used to drive organization-wide improvements.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

As you build your software, adopt various mechanisms for software testing to ensure that you are
testing your application for both functional requirements, based on your application’s business
logic, and non-functional requirements, which are focused on application reliability, performance,
and security.

Static application security testing (SAST) analyzes your source code for anomalous security
patterns, and provides indications for defect prone code. SAST relies on static inputs, such as
documentation (requirements speciﬁcation, design documentation, and design speciﬁcations)
and application source code to test for a range of known security issues. Static code analyzers
can help expedite the analysis of large volumes of code. The NIST Quality Group provides a
comparison of Source Code Security Analyzers, which includes open source tools for Byte Code
Scanners and Binary Code Scanners.

Complement your static testing with dynamic analysis security testing (DAST) methodologies,
which performs tests against the running application to identify potentially unexpected behavior.
Dynamic testing can be used to detect potential issues that are not detectable via static analysis.
Testing at the code repository, build, and pipeline stages allows you to check for diﬀerent
types of potential issues from entering into your code. Amazon CodeWhisperer provides code
recommendations, including security scanning, in the builder’s IDE. Amazon CodeGuru Reviewer
can identify critical issues, security issues, and hard-to-ﬁnd bugs during application development,
and provides recommendations to improve code quality.

The Security for Developers workshop uses AWS developer tools, such as AWS CodeBuild, AWS
CodeCommit, and AWS CodePipeline, for release pipeline automation that includes SAST and DAST
testing methodologies.

Application security 435

AWS Well-Architected Framework Framework

• Secrets detection in Amazon CodeGuru Review

• Accelerate deployments on AWS with eﬀective governance
• How AWS approaches automating safe, hands-oﬀ deployments

Related videos:

• Hands-oﬀ: Automating continuous delivery pipelines at Amazon

• Automating cross-account CI/CD pipelines

Related examples:

• Industry awareness for developers

• AWS CodePipeline Governance (GitHub)
• Security for Developers workshop

SEC11-BP03 Perform regular penetration testing

Perform regular penetration testing of your software. This mechanism helps identify potential
software issues that cannot be detected by automated testing or a manual code review. It can
also help you understand the eﬃcacy of your detective controls. Penetration testing should try to
determine if the software can be made to perform in unexpected ways, such as exposing data that
should be protected, or granting broader permissions than expected.

Desired outcome: Penetration testing is used to detect, remediate, and validate your application’s
security properties. Regular and scheduled penetration testing should be performed as part of the
software development lifecycle (SDLC). The ﬁndings from penetration tests should be addressed
prior to the software being released. You should analyze the ﬁndings from penetration tests to
identify if there are issues that could be found using automation. Having a regular and repeatable
penetration testing process that includes an active feedback mechanism helps inform the guidance
to builders and improves software quality.

Common anti-patterns:

• Only penetration testing for known or prevalent security issues.

• Penetration testing applications without dependent third-party tools and libraries.

Application security 437

AWS Well-Architected Framework Framework

• Use tools to speed up the penetration testing process by automating common or repeatable
tests.
• Analyze penetration testing ﬁndings to identify systemic security issues, and use this data to
inform additional automated testing and ongoing builder education.

Resources

Related best practices:

• SEC11-BP01 Train for application security

• SEC11-BP02 Automate testing throughout the development and release lifecycle

Related documents:

• AWS Penetration Testing provides detailed guidance for penetration testing on AWS
• Accelerate deployments on AWS with eﬀective governance
• AWS Security Competency Partners
• Modernize your penetration testing architecture on AWS Fargate
• AWS Fault injection Simulator

Related examples:

• Automate API testing with AWS CodePipeline (GitHub)

• Automated security helper (GitHub)

SEC11-BP04 Manual code reviews

Perform a manual code review of the software that you produce. This process helps verify that the
person who wrote the code is not the only one checking the code quality.

Desired outcome: Including a manual code review step during development increases the quality
of the software being written, helps upskill less experienced members of the team, and provides
an opportunity to identify places where automation can be used. Manual code reviews can be
supported by automated tools and testing.

Common anti-patterns:

Application security 439

AWS Well-Architected Framework Framework

Resources

Related best practices:

• SEC11-BP02 Automate testing throughout the development and release lifecycle

Related documents:

• Working with pull requests in AWS CodeCommit repositories

• Working with approval rule templates in AWS CodeCommit
• About pull requests in GitHub
• Automate code reviews with Amazon CodeGuru Reviewer
• Automating detection of security vulnerabilities and bugs in CI/CD pipelines using Amazon
CodeGuru Reviewer CLI

Related videos:

• Continuous improvement of code quality with Amazon CodeGuru

Related examples:

• Security for Developers workshop

SEC11-BP05 Centralize services for packages and dependencies

Provide centralized services for builder teams to obtain software packages and other
dependencies. This allows the validation of packages before they are included in the software
that you write, and provides a source of data for the analysis of the software being used in your
organization.

Desired outcome: Software is comprised of a set of other software packages in addition to the
code that is being written. This makes it simple to consume implementations of functionality that
are repeatedly used, such as a JSON parser or an encryption library. Logically centralizing the
sources for these packages and dependencies provides a mechanism for security teams to validate
the properties of the packages before they are used. This approach also reduces the risk of an
unexpected issue being caused by a change in an existing package, or by builder teams including
arbitrary packages directly from the internet. Use this approach in conjunction with the manual

Application security 441

AWS Well-Architected Framework Framework

• Regularly scan packages in your repository to identify the potential impact of newly discovered
issues.

Resources

Related best practices:

• SEC11-BP02 Automate testing throughout the development and release lifecycle

Related documents:

• Accelerate deployments on AWS with eﬀective governance

• Tighten your package security with CodeArtifact Package Origin Control toolkit
• Detecting security issues in logging with Amazon CodeGuru Reviewer
• Supply chain Levels for Software Artifacts (SLSA)

Related videos:

• Proactive security: Considerations and approaches

• The AWS Philosophy of Security (re:Invent 2017)
• When security, safety, and urgency all matter: Handling Log4Shell

Related examples:

• Multi Region Package Publishing Pipeline (GitHub)

• Publishing Node.js Modules on AWS CodeArtifact using AWS CodePipeline (GitHub)
• AWS CDK Java CodeArtifact Pipeline Sample (GitHub)
• Distribute private .NET NuGet packages with AWS CodeArtifact (GitHub)

SEC11-BP06 Deploy software programmatically

Perform software deployments programmatically where possible. This approach reduces the
likelihood that a deployment fails or an unexpected issue is introduced due to human error.

Desired outcome: Keeping people away from data is a key principle of building securely in the AWS
Cloud. This principle includes how you deploy your software.

Application security 443

AWS Well-Architected Framework Framework

• Using AWS CodeBuild and AWS Code Pipeline to provide CI/CD capability makes it simple to
integrate security testing into your pipelines.
• Follow the guidance on separation of environments in the Organizing Your AWS Environment
Using Multiple Accounts whitepaper.

• Verify no persistent human access to environments where production workloads are running.

• Architect your applications to support the externalization of conﬁguration data.

• Consider deploying using a blue/green deployment model.

• Implement canaries to validate the successful deployment of software.

• Use cryptographic tools such as AWS Signer or AWS Key Management Service (AWS KMS) to sign
and verify the software packages that you are deploying.

Resources

Related best practices:

• SEC11-BP02 Automate testing throughout the development and release lifecycle

Related documents:

• AWS CI/CD Workshop

• Accelerate deployments on AWS with eﬀective governance

• Automating safe, hands-oﬀ deployments

• Code signing using AWS Certiﬁcate Manager Private CA and AWS Key Management Service
asymmetric keys

• Code Signing, a Trust and Integrity Control for AWS Lambda

Related videos:

• Hands-oﬀ: Automating continuous delivery pipelines at Amazon

Related examples:

• Blue/Green deployments with AWS Fargate

Application security 445

AWS Well-Architected Framework Framework

pipeline implementation and analyzing logs for unexpected behavior can help you understand the
usage patterns of the pipelines being used to deploy software.

Implementation steps

• Start with the AWS Deployment Pipelines Reference Architecture.

• Consider using AWS IAM Access Analyzer to programmatically generate least privilege IAM
policies for the pipelines.
• Integrate your pipelines with monitoring and alerting so that you are notiﬁed of unexpected or
abnormal activity, for AWS managed services Amazon EventBridge allows you to route data to
targets such as AWS Lambda or Amazon Simple Notiﬁcation Service (Amazon SNS).

Resources

Related documents:

• AWS Deployment Pipelines Reference Architecture

• Monitoring AWS CodePipeline
• Security best practices for AWS CodePipeline

Related examples:

• DevOps monitoring dashboard (GitHub)

SEC11-BP08 Build a program that embeds security ownership in workload teams

Build a program or mechanism that empowers builder teams to make security decisions about the
software that they create. Your security team still needs to validate these decisions during a review,
but embedding security ownership in builder teams allows for faster, more secure workloads to be
built. This mechanism also promotes a culture of ownership that positively impacts the operation
of the systems you build.

Desired outcome: To embed security ownership and decision making in builder teams, you can
either train builders on how to think about security or you can augment their training with security
people embedded or associated with the builder teams. Either approach is valid and allows the
team to make higher quality security decisions earlier in the development cycle. This ownership

Application security 447

AWS Well-Architected Framework Framework

Implementation steps

• Start by training your builders for application security.

• Create a community and an onboarding program to educate builders.
• Pick a name for the program. Guardians, Champions, or Advocates are commonly used.
• Identify the model to use: train builders, embed security engineers, or have aﬃnity security roles.
• Identify project sponsors from security, builders, and potentially other relevant groups.
• Track metrics for the number of people involved in the program, the time taken for reviews, and
the feedback from builders and security people. Use these metrics to make improvements.

Resources

Related best practices:

• SEC11-BP01 Train for application security

• SEC11-BP02 Automate testing throughout the development and release lifecycle

Related documents:

• How to approach threat modeling

• How to think about cloud security governance

Related videos:

• Proactive security: Considerations and approaches

Reliability
The Reliability pillar encompasses the ability of a workload to perform its intended function
correctly and consistently when it’s expected to. You can ﬁnd prescriptive guidance on
implementation in the Reliability Pillar whitepaper.

Best practice areas

• Foundations
• Workload architecture

Reliability 449
AWS Well-Architected Framework Framework

automation remediation steps to verify that services quotas and constraints are not reached that
could cause service degradation or disruption.

Common anti-patterns:

• Deploying a workload without understanding the hard or soft quotas and their limits for the
services used.
• Deploying a replacement workload without analyzing and reconfiguring the necessary quotas or
contacting Support in advance.
• Assuming that cloud services have no limits and the services can be used without consideration
to rates, limits, counts, quantities.
• Assuming that quotas will automatically be increased.
• Not knowing the process and timeline of quota requests.
• Assuming that the default cloud service quota is the identical for every service compared across
regions.
• Assuming that service constraints can be breached and the systems will auto-scale or add
increase the limit beyond the resource’s constraints
• Not testing the application at peak traffic in order to stress the utilization of its resources.
• Provisioning the resource without analysis of the required resource size.
• Overprovisioning capacity by choosing resource types that go well beyond actual need or
expected peaks.
• Not assessing capacity requirements for new levels of traffic in advance of a new customer event
or deploying a new technology.

Beneﬁts of establishing this best practice: Monitoring and automated management of service
quotas and resource constraints can proactively reduce failures. Changes in traﬃc patterns for
a customer’s service can cause a disruption or degradation if best practices are not followed. By
monitoring and managing these values across all regions and all accounts, applications can have
improved resiliency under adverse or unplanned events.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Service Quotas is an AWS service that helps you manage your quotas for over 250 AWS services
from one location. Along with looking up the quota values, you can also request and track quota

Foundations 451
AWS Well-Architected Framework Framework

• AWS Management Console provides methods to display services quota values, manage, request
new quotas, monitor status of quota requests, and display history of quotas.
• AWS CLI and CDKs oﬀer programmatic methods to automatically manage and monitor service
quota levels and usage.

Implementation steps

For Service Quotas:

• Review AWS Service Quotas.

• To be aware of your existing service quotas, determine the services (like IAM Access Analyzer)
that are used. There are approximately 250 AWS services controlled by service quotas. Then,
determine the specific service quota name that might be used within each account and Region.
There are approximately 3000 service quota names per Region.
• Augment this quota analysis with AWS Config to find all AWS resources used in your AWS
accounts.
• Use AWS CloudFormation data to determine your AWS resources used. Look at the resources that
were created either in the AWS Management Console or with the list-stack-resources AWS
CLI command. You can also see resources configured to be deployed in the template itself.
• Determine all the services your workload requires by looking at the deployment code.
• Determine the service quotas that apply. Use the programmatically accessible information from
Trusted Advisor and Service Quotas.
• Establish an automated monitoring method (see REL01-BP02 Manage service quotas across
accounts and regions and REL01-BP04 Monitor and manage quotas) to alert and inform if
services quotas are near or have reached their limit.
• Establish an automated and programmatic method to check if a service quota has been changed
in one region but not in other regions in the same account (see REL01-BP02 Manage service
quotas across accounts and regions and REL01-BP04 Monitor and manage quotas).
• Automate scanning application logs and metrics to determine if there are any quota or service
constraint errors. If these errors are present, send alerts to the monitoring system.
• Establish engineering procedures to calculate the required change in quota (see REL01-BP05
Automate quota management) once it has been identified that larger quotas are required for
specific services.
• Create a provisioning and approval workflow to request changes in service quota. This should
include an exception workflow in case of request deny or partial approval.

Foundations 453
AWS Well-Architected Framework Framework

• AWS Service Quotas (formerly referred to as service limits)

• AWS Trusted Advisor Best Practice Checks (see the Service Limits section)
• AWS limit monitor on AWS answers
• Amazon EC2 Service Limits
• What is Service Quotas?
• How to Request Quota Increase
• Service endpoints and quotas
• Service Quotas User Guide
• Quota Monitor for AWS
• AWS Fault Isolation Boundaries
• Availability with redundancy
• AWS for Data
• What is Continuous Integration?
• What is Continuous Delivery?
• APN Partner: partners that can help with conﬁguration management
• Managing the account lifecycle in account-per-tenant SaaS environments on AWS
• Managing and monitoring API throttling in your workloads
• View AWS Trusted Advisor recommendations at scale with AWS Organizations
• Automating Service Limit Increases and Enterprise Support with AWS Control Tower

Related videos:

• AWS Live re:Inforce 2019 - Service Quotas

• View and Manage Quotas for AWS Services Using Service Quotas
• AWS IAM Quotas Demo

Related tools:

• Amazon CodeGuru Reviewer

• AWS CodeDeploy
• AWS CloudTrail
• Amazon CloudWatch

Foundations 455
AWS Well-Architected Framework Framework

Implementation guidance

Service quotas are tracked per account. Unless otherwise noted, each quota is AWS Region-speciﬁc.
In addition to the production environments, also manage quotas in all applicable non-production
environments so that testing and development are not hindered. Maintaining a high degree of
resiliency requires that service quotas are assessed continually (whether automated or manual).

With more workloads spanning Regions due to the implementation of designs using Active/Active,
Active/Passive – Hot, Active/Passive-Cold, and Active/Passive-Pilot Light approaches, it is essential
to understand all Region and account quota levels. Past traﬃc patterns are not always a good
indicator if the service quota is set correctly.

Equally important, the service quota name limit is not always the same for every Region. In one
Region, the value could be ﬁve, and in another region the value could be ten. Management of these
quotas must span all the same services, accounts, and Regions to provide consistent resilience
under load.

Reconcile all the service quota differences across different Regions (Active Region or Passive
Region) and create processes to continually reconcile these differences. The testing plans of passive
Region failovers are rarely scaled to peak active capacity, meaning that game day or table top
exercises can fail to find differences in service quotas between Regions and also then maintain the
correct limits.

Service quota drift, the condition where service quota limits for a specific named quota is changed
in one Region and not all Regions, is very important to track and assess. Changing the quota in
Regions with traffic or potentially could carry traffic should be considered.

• Select relevant accounts and Regions based on your service requirements, latency, regulatory,
and disaster recovery (DR) requirements.
• Identify service quotas across all relevant accounts, Regions, and Availability Zones. The limits
are scoped to account and Region. These values should be compared for diﬀerences.

Implementation steps

• Review Service Quotas values that might have breached beyond the a risk level of usage. AWS
Trusted Advisor provides alerts for 80% and 90% threshold breaches.
• Review values for service quotas in any Passive Regions (in an Active/Passive design). Verify that
load will successfully run in secondary Regions in the event of a failure in the primary Region.

Foundations 457
AWS Well-Architected Framework Framework

• How to Request Quota Increase

• Service endpoints and quotas
• Service Quotas User Guide
• Quota Monitor for AWS
• AWS Fault Isolation Boundaries
• Availability with redundancy
• AWS for Data
• What is Continuous Integration?
• What is Continuous Delivery?
• APN Partner: partners that can help with conﬁguration management
• Managing the account lifecycle in account-per-tenant SaaS environments on AWS
• Managing and monitoring API throttling in your workloads
• View AWS Trusted Advisor recommendations at scale with AWS Organizations
• Automating Service Limit Increases and Enterprise Support with AWS Control Tower

Related videos:

• AWS Live re:Inforce 2019 - Service Quotas

• View and Manage Quotas for AWS Services Using Service Quotas
• AWS IAM Quotas Demo

Related services:

• Amazon CodeGuru Reviewer

• AWS CodeDeploy
• AWS CloudTrail
• Amazon CloudWatch
• Amazon EventBridge
• Amazon DevOps Guru
• AWS Conﬁg
• AWS Trusted Advisor

Foundations 459
AWS Well-Architected Framework Framework

Hard limits are show in the Service Quotas console. If the columns shows ADJUSTABLE = No,
the service has a hard limit. Hard limits are also shown in some resources conﬁguration pages. For
example, Lambda has speciﬁc hard limits that cannot be adjusted.

As an example, when designing a python application to run in a Lambda function, the application
should be evaluated to determine if there is any chance of Lambda running longer than 15
minutes. If the code may run more than this service quota limit, alternate technologies or designs
must be considered. If this limit is reached after production deployment, the application will suﬀer
degradation and disruption until it can be remediated. Unlike soft quotas, there is no method to
change to these limits even under emergency Severity 1 events.

Once the application has been deployed to a testing environment, strategies should be used to ﬁnd
if any hard limits can be reached. Stress testing, load testing, and chaos testing should be part of
the introduction test plan.

Implementation steps

• Review the complete list of AWS services that could be used in the application design phase.
• Review the soft quota limits and hard quota limits for all these services. Not all limits are shown
in the Service Quotas console. Some services describe these limits in alternate locations.
• As you design your application, review your workload’s business and technology drivers, such
as business outcomes, use case, dependent systems, availability targets, and disaster recovery
objects. Let your business and technology drivers guide the process to identify the distributed
system that is right for your workload.
• Analyze service load across Regions and accounts. Many hard limits are regionally based for
services. However, some limits are account based.
• Analyze resilience architectures for resource usage during a zonal failure and Regional failure. In
the progression of multi-Region designs using active/active, active/passive – hot, active/passive -
cold, and active/passive - pilot light approaches, these failure cases will cause higher usage. This
creates a potential use case for hitting hard limits.

Resources

Related best practices:

• REL01-BP01 Aware of service quotas and constraints

• REL01-BP02 Manage service quotas across accounts and regions

Foundations 461
AWS Well-Architected Framework Framework

• Automating Service Limit Increases and Enterprise Support with AWS Control Tower
• Actions, resources, and condition keys for Service Quotas

Related videos:

• AWS Live re:Inforce 2019 - Service Quotas

• View and Manage Quotas for AWS Services Using Service Quotas
• AWS IAM Quotas Demo
• AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and
Small

Related tools:

• AWS CodeDeploy
• AWS CloudTrail
• Amazon CloudWatch
• Amazon EventBridge
• Amazon DevOps Guru
• AWS Conﬁg
• AWS Trusted Advisor
• AWS CDK
• AWS Systems Manager
• AWS Marketplace

REL01-BP04 Monitor and manage quotas

Evaluate your potential usage and increase your quotas appropriately, allowing for planned growth
in usage.

Desired outcome: Active and automated systems that manage and monitor have been deployed.
These operations solutions ensure that quota usage thresholds are nearing being reached. These
would be proactively remediated by requested quota changes.

Common anti-patterns:

Foundations 463
AWS Well-Architected Framework Framework

• Capture your current quotas that are essential and applicable to the services using:
• AWS Service Quotas
• AWS Trusted Advisor
• AWS documentation
• AWS service-speciﬁc pages
• AWS Command Line Interface (AWS CLI)
• AWS Cloud Development Kit (AWS CDK)
• Use AWS Service Quotas, an AWS service that helps you manage your quotas for over 250 AWS
services from one location.
• Use Trusted Advisor service limits to monitor your current service limits at various thresholds.
• Use the service quota history (console or AWS CLI) to check on regional increases.
• Compare service quota changes in each Region and each account to create equivalency, if
required.

For management:

• Automated: Set up an AWS Config custom rule to scan service quotas across Regions and
compare for differences.
• Automated: Set up a scheduled Lambda function to scan service quotas across Regions and
compare for differences.
• Manual: Scan services quota through AWS CLI, API, or AWS Console to scan service quotas across
Regions and compare for differences. Report the differences.
• If differences in quotas are identified between Regions, request a quota change, if required.
• Review the result of all requests.

Resources

Related best practices:

• REL01-BP01 Aware of service quotas and constraints

• REL01-BP02 Manage service quotas across accounts and regions
• REL01-BP03 Accommodate ﬁxed service quotas and constraints through architecture
• REL01-BP05 Automate quota management

Foundations 465
AWS Well-Architected Framework Framework

Related videos:

• AWS Live re:Inforce 2019 - Service Quotas

• View and Manage Quotas for AWS Services Using Service Quotas
• AWS IAM Quotas Demo
• AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and
Small

Related tools:

REL01-BP05 Automate quota management

Implement tools to alert you when thresholds are being approached. You can automate quota
increase requests by using AWS Service Quotas APIs.

If you integrate your Conﬁguration Management Database (CMDB) or ticketing system with Service
Quotas, you can automate the tracking of quota increase requests and current quotas. In addition
to the AWS SDK, Service Quotas oﬀers automation using the AWS Command Line Interface (AWS
CLI).

Common anti-patterns:

• Tracking the quotas and usage in spreadsheets.

• Running reports on usage daily, weekly, or monthly, and then comparing usage to the quotas.

Foundations 467
AWS Well-Architected Framework Framework

• AWS Trusted Advisor Best Practice Checks (see the Service Limits section)
• Quota Monitor on AWS - AWS Solution
• Amazon EC2 Service Limits
• What is Service Quotas?

Related videos:

• AWS Live re:Inforce 2019 - Service Quotas

REL01-BP06 Ensure that a suﬃcient gap exists between the current quotas and the maximum
usage to accommodate failover

This article explains how to maintain space between the resource quota and your usage, and
how it can beneﬁt your organization. After you ﬁnish using a resource, the usage quota may
continue to account for that resource. This can result in a failing or inaccessible resource. Prevent
resource failure by verifying that your quotas cover the overlap of inaccessible resources and their
replacements. Consider cases like network failure, Availability Zone failure, or Region failures when
calculating this gap.

Desired outcome: Small or large failures in resources or resource accessibility can be covered
within the current service thresholds. Zone failures, network failures, or even Regional failures have
been considered in the resource planning.

Common anti-patterns:

• Setting service quotas based on current needs without accounting for failover scenarios.
• Not considering the principals of static stability when calculating the peak quota for a service.
• Not considering the potential of inaccessible resources in calculating total quota needed for each
Region.
• Not considering AWS service fault isolation boundaries for some services and their potential
abnormal usage patterns.

Beneﬁts of establishing this best practice: When service disruption events impact application
availability, use the cloud to implement strategies to recover from these events. An example
strategy is creating additional resources to replace inaccessible resources to accommodate failover
conditions without exhausting your service limit.

Foundations 469
AWS Well-Architected Framework Framework

• Plan consumption growth and monitor your consumption trends.

• Consider the static stability impact for your most critical workloads. Assess resources conforming
to a statically stable system in all Regions and Availability Zones.
• Consider using On-Demand Capacity Reservations to schedule capacity ahead of any failover.
This is a useful strategy to implement for critical business schedules to reduce potential risks of
obtaining the correct quantity and type of resources during failover.

Resources

Related best practices:

• REL01-BP01 Aware of service quotas and constraints

• REL01-BP02 Manage service quotas across accounts and regions
• REL01-BP03 Accommodate ﬁxed service quotas and constraints through architecture
• REL01-BP04 Monitor and manage quotas
• REL01-BP05 Automate quota management
• REL03-BP01 Choose how to segment your workload
• REL10-BP01 Deploy the workload to multiple locations
• REL11-BP01 Monitor all components of the workload to detect failures
• REL11-BP03 Automate healing on all layers
• REL12-BP05 Test resiliency using chaos engineering

Related documents:

• AWS Well-Architected Framework’s Reliability Pillar: Availability

• AWS Service Quotas (formerly referred to as service limits)
• AWS Trusted Advisor Best Practice Checks (see the Service Limits section)
• AWS limit monitor on AWS answers
• Amazon EC2 Service Limits
• What is Service Quotas?
• How to Request Quota Increase
• Service endpoints and quotas
• Service Quotas User Guide

Foundations 471
AWS Well-Architected Framework Framework

• AWS Marketplace

REL 2. How do you plan your network topology?

Workloads often exist in multiple environments. These include multiple cloud environments (both
publicly accessible and private) and possibly your existing data center infrastructure. Plans must
include network considerations such as intra- and intersystem connectivity, public IP address
management, private IP address management, and domain name resolution.

Best practices
• REL02-BP01 Use highly available network connectivity for your workload public endpoints
• REL02-BP02 Provision redundant connectivity between private networks in the cloud and on-
premises environments
• REL02-BP03 Ensure IP subnet allocation accounts for expansion and availability
• REL02-BP04 Prefer hub-and-spoke topologies over many-to-many mesh
• REL02-BP05 Enforce non-overlapping private IP address ranges in all private address spaces
where they are connected

REL02-BP01 Use highly available network connectivity for your workload public endpoints

Building highly available network connectivity to public endpoints of your workloads can help
you reduce downtime due to loss of connectivity and improve the availability and SLA of your
workload. To achieve this, use highly available DNS, content delivery networks (CDNs), API
gateways, load balancing, or reverse proxies.

Desired outcome: It is critical to plan, build, and operationalize highly available network
connectivity for your public endpoints. If your workload becomes unreachable due to a loss in
connectivity, even if your workload is running and available, your customers will see your system
as down. By combining the highly available and resilient network connectivity for your workload’s
public endpoints, along with a resilient architecture for your workload itself, you can provide the
best possible availability and service level for your customers.

AWS Global Accelerator, Amazon CloudFront, Amazon API Gateway, AWS Lambda Function URLs,
AWS AppSync APIs, and Elastic Load Balancing (ELB) all provide highly available public endpoints.
Amazon Route 53 provides a highly available DNS service for domain name resolution to verify
that your public endpoint addresses can be resolved.

Foundations 473
AWS Well-Architected Framework Framework

instances. You can also use Amazon API Gateway along with AWS Lambda for a serverless solution.
Customers can also run workloads in multiple AWS Regions. With multi-site active/active pattern,
the workload can serve traﬃc from multiple Regions. With a multi-site active/passive pattern, the
workload serves traﬃc from the active region while data is replicated to the secondary region and
becomes active in the event of a failure in the primary region. Route 53 health checks can then be
used to control DNS failover from any endpoint in a primary Region to an endpoint in a secondary
Region, verifying that your workload is reachable and available to your users.

Amazon CloudFront provides a simple API for distributing content with low latency and high data
transfer rates by serving requests using a network of edge locations around the world. Content
delivery networks (CDNs) serve customers by serving content located or cached at a location near
to the user. This also improves availability of your application as the load for content is shifted
away from your servers over to CloudFront’s edge locations. The edge locations and regional edge
caches hold cached copies of your content close to your viewers resulting in quick retrieval and
increasing reachability and availability of your workload.

For workloads with users spread out geographically, AWS Global Accelerator helps you improve
the availability and performance of the applications. AWS Global Accelerator provides Anycast
static IP addresses that serve as a fixed entry point to your application hosted in one or more
AWS Regions. This allows traffic to ingress onto the AWS global network as close to your users as
possible, improving reachability and availability of your workload. AWS Global Accelerator also
monitors the health of your application endpoints by using TCP, HTTP, and HTTPS health checks.
Any changes in the health or configuration of your endpoints permit redirection of user traffic to
healthy endpoints that deliver the best performance and availability to your users. In addition, AWS
Global Accelerator has a fault-isolating design that uses two static IPv4 addresses that are serviced
by independent network zones increasing the availability of your applications.

To help protect customers from DDoS attacks, AWS provides AWS Shield Standard. Shield Standard
comes automatically turned on and protects from common infrastructure (layer 3 and 4) attacks
like SYN/UDP floods and reflection attacks to support high availability of your applications
on AWS. For additional protections against more sophisticated and larger attacks (like UDP
floods), state exhaustion attacks (like TCP SYN floods), and to help protect your applications
running on Amazon Elastic Compute Cloud (Amazon EC2), Elastic Load Balancing (ELB), Amazon
CloudFront, AWS Global Accelerator, and Route 53, you can consider using AWS Shield Advanced.
For protection against Application layer attacks like HTTP POST or GET floods, use AWS WAF. AWS
WAF can use IP addresses, HTTP headers, HTTP body, URI strings, SQL injection, and cross-site
scripting conditions to determine if a request should be blocked or allowed.

Foundations 475
AWS Well-Architected Framework Framework

protections from application layer HTTP POST AND GET ﬂoods, review Getting started with
AWS WAF. You can also use AWS WAF with CloudFront see the documentation on how AWS WAF
works with Amazon CloudFront features.
6. Set up additional DDoS protection: By default, all AWS customers receive protection from
common, most frequently occurring network and transport layer DDoS attacks that target
your web site or application with AWS Shield Standard at no additional charge. For additional
protection of internet-facing applications running on Amazon EC2, Elastic Load Balancing,
Amazon CloudFront, AWS Global Accelerator, and Amazon Route 53 you can consider AWS
Shield Advanced and review examples of DDoS resilient architectures. To protect your workload
and your public endpoints from DDoS attacks review Getting started with AWS Shield Advanced.

Resources

Related best practices:

• REL10-BP01 Deploy the workload to multiple locations

• REL10-BP02 Select the appropriate locations for your multi-location deployment
• REL11-BP04 Rely on the data plane and not the control plane during recovery
• REL11-BP06 Send notiﬁcations when events impact availability

Related documents:

• APN Partner: partners that can help plan your networking

• AWS Marketplace for Network Infrastructure
• What Is AWS Global Accelerator?
• What is Amazon CloudFront?
• What is Amazon Route 53?
• What is Elastic Load Balancing?
• Network Connectivity capability - Establishing Your Cloud Foundations
• What is Amazon API Gateway?
• What are AWS WAF, AWS Shield, and AWS Firewall Manager?
• What is Amazon Application Recovery Controller?
• Conﬁgure custom health checks for DNS failover

Foundations 477
AWS Well-Architected Framework Framework

Implementation guidance

When using AWS Direct Connect to connect your on-premises network to AWS, you can achieve
maximum network resiliency (SLA of 99.99%) by using separate connections that end on distinct
devices in more than one on-premises location and more than one AWS Direct Connect location.
This topology oﬀers resilience against device failures, connectivity issues, and complete location
outages. Alternatively, you can achieve high resiliency (SLA of 99.9%) by using two individual
connections to multiple locations (each on-premises location connected to a single Direct Connect
location). This approach protects against connectivity disruptions caused by ﬁber cuts or device
failures and helps mitigate complete location failures. The AWS Direct Connect Resiliency Toolkit
can assist in designing your AWS Direct Connect topology.

You can also consider AWS Site-to-Site VPN ending on an AWS Transit Gateway as a cost-eﬀective
backup to your primary AWS Direct Connect connection. This setup enables equal-cost multipath
(ECMP) routing across multiple VPN tunnels, allowing for throughput of up to 50Gbps, even
though each VPN tunnel is capped at 1.25 Gbps. It's important to note, however, that AWS Direct
Connect is still the most eﬀective choice for minimizing network disruptions and providing stable
connectivity.

When using VPNs over the internet to connect your cloud environment to your on-premises data
center, conﬁgure two VPN tunnels as part of a single site-to-site VPN connection. Each tunnel
should end in a diﬀerent Availability Zone for high availability and use redundant hardware
to prevent on-premises device failure. Additionally, consider multiple internet connections
from various internet service providers (ISPs) at your on-premises location to avoid complete
VPN connectivity disruption due to a single ISP outage. Selecting ISPs with diverse routing and
infrastructure, especially those with separate physical paths to AWS endpoints, provides high
connectivity availability.

In addition to physical redundancy with multiple AWS Direct Connect connections and multiple
VPN tunnels (or a combination of both), implementing Border Gateway Protocol (BGP) dynamic
routing is also crucial. Dynamic BGP provides automatic rerouting of traffic from one path to
another based on real-time network conditions and configured policies. This dynamic behavior
is especially beneficial in maintaining network availability and service continuity in the event of
link or network failures. It quickly selects alternative paths, enhancing the network's resilience and
reliability.

Implementation steps

• Acquisition highly-available connectivity between AWS and your on-premises environment.

Foundations 479
AWS Well-Architected Framework Framework

Related videos:

• AWS re:Invent 2018: Advanced VPC Design and New Capabilities for Amazon VPC
• AWS re:Invent 2019: AWS Transit Gateway reference architectures for many VPCs

REL02-BP03 Ensure IP subnet allocation accounts for expansion and availability

Amazon VPC IP address ranges must be large enough to accommodate workload requirements,
including factoring in future expansion and allocation of IP addresses to subnets across Availability
Zones. This includes load balancers, EC2 instances, and container-based applications.

When you plan your network topology, the ﬁrst step is to deﬁne the IP address space itself. Private
IP address ranges (following RFC 1918 guidelines) should be allocated for each VPC. Accommodate
the following requirements as part of this process:

• Allow IP address space for more than one VPC per Region.
• Within a VPC, allow space for multiple subnets so that you can cover multiple Availability Zones.
• Consider leaving unused CIDR block space within a VPC for future expansion.
• Ensure that there is IP address space to meet the needs of any transient ﬂeets of Amazon EC2
instances that you might use, such as Spot Fleets for machine learning, Amazon EMR clusters, or
Amazon Redshift clusters. Similar consideration should be given to Kubernetes clusters, such as
Amazon Elastic Kubernetes Service (Amazon EKS), as each Kubernetes pod is assigned a routable
address from the VPC CIDR block by default.
• Note that the ﬁrst four IP addresses and the last IP address in each subnet CIDR block are
reserved and not available for your use.
• Note that the initial VPC CIDR block allocated to your VPC cannot be changed or deleted, but
you can add additional non-overlapping CIDR blocks to the VPC. Subnet IPv4 CIDRs cannot be
changed, however IPv6 CIDRs can.
• The largest possible VPC CIDR block is a /16, and the smallest is a /28.
• Consider other connected networks (VPC, on-premises, or other cloud providers) and ensure non-
overlapping IP address space. For more information, see REL02-BP05 Enforce non-overlapping
private IP address ranges in all private address spaces where they are connected.

Desired outcome: A scalable IP subnet can help you accomodate for future growth and avoid
unnecessary waste.

Foundations 481
AWS Well-Architected Framework Framework

Resources

Related Well-Architected best practices:

• REL02-BP05 Enforce non-overlapping private IP address ranges in all private address spaces
where they are connected

Related documents:

• APN Partner: partners that can help plan your networking

• AWS Marketplace for Network Infrastructure
• Amazon Virtual Private Cloud Connectivity Options Whitepaper
• Multiple data center HA network connectivity
• Single Region Multi-VPC Connectivity
• What Is Amazon VPC?
• IPv6 on AWS
• IPv6 on reference architectures
• Amazon Elastic Kubernetes Service launches IPv6 support
• Recommendations for your VPC - Classic Load Balancers
• Availability Zone subnets - Application Load Balancers
• Availability Zones - Network Load Balancers

Related videos:

• AWS re:Invent 2018: Advanced VPC Design and New Capabilities for Amazon VPC (NET303)
• AWS re:Invent 2019: AWS Transit Gateway reference architectures for many VPCs (NET406-R1)
• AWS re:Invent 2023: AWS Ready for what's next? Designing networks for growth and ﬂexibility
(NET310)

REL02-BP04 Prefer hub-and-spoke topologies over many-to-many mesh

When connecting multiple private networks, such as Virtual Private Clouds (VPCs) and on-premises
networks, opt for a hub-and-spoke topology over a meshed one. Unlike meshed topologies, where
each network connects directly to the others and increases the complexity and management

Foundations 483
AWS Well-Architected Framework Framework

hub-and-spoke topologies. When you use AWS Transit Gateway, you can establish connections and
centralizes traﬃc routing across multiple networks.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

• Plan your network.

• Create your AWS Transit Gateway.

• Attach your VPCs.

• If needed, create VPN connections or Direct Connect gateways and associate them with the
Transit Gateway.

• Define how traffic is routed between the connected VPCs and other connections through
configuration of your Transit Gateway route tables.

• Use Amazon CloudWatch to monitor and adjust conﬁgurations as necessary for performance and
cost optimization.

Resources

Related documents:

• What Is a Transit Gateway?

• Building a Scalable and Secure Multi-VPC AWS Network Infrastructure

• Building a global network using AWS Transit Gateway Inter-Region peering

• Amazon Virtual Private Cloud Connectivity Options

• APN Partner: partners that can help plan your networking

• AWS Marketplace for Network Infrastructure

Related videos:

• AWS re:Invent 2023 - AWS networking foundations

• AWS re:Invent 2023 - Advanced VPC designs and new capabilities

Foundations 485
AWS Well-Architected Framework Framework

Implementation steps

• Capture current CIDR consumption (for example, VPCs and subnets).

• Use service API operations to collect current CIDR consumption.
• Use the Amazon VPC IP Address Manager to discover resources.
• Capture your current subnet usage.
• Use service API operations to collect subnets per VPC in each Region.
• Use the Amazon VPC IP Address Manager to discover resources.
• Record the current usage.
• Determine if you created any overlapping IP ranges.
• Calculate the spare capacity.
• Identify overlapping IP ranges. You can either migrate to a new range of addresses or consider
using techniques like private NAT Gateway or AWS PrivateLink if you need to connect the
overlapping ranges.

Resources

Related best practices:

• Protecting networks

Related documents:

• APN Partner: partners that can help plan your networking

• AWS Marketplace for Network Infrastructure
• Amazon Virtual Private Cloud Connectivity Options Whitepaper
• Multiple data center HA network connectivity
• Connecting Networks with Overlapping IP Ranges
• What Is Amazon VPC?
• What is IPAM?

Related videos:

• AWS re:Invent 2023 - Advanced VPC designs and new capabilities

Foundations 487
AWS Well-Architected Framework Framework

workload built to scale from the start needs. When refactoring an existing monolith, you will
need to consider how well the application will support a decomposition towards statelessness.
Breaking services into smaller pieces allows small, well-deﬁned teams to develop and manage
them. However, smaller services can introduce complexities which include possible increased
latency, more complex debugging, and increased operational burden.

Common anti-patterns:

• The microservice Death Star is a situation in which the atomic components become so highly
interdependent that a failure of one results in a much larger failure, making the components as
rigid and fragile as a monolith.

Beneﬁts of establishing this practice:

• More specific segments lead to greater agility, organizational flexibility, and scalability.
• Reduced impact of service interruptions.
• Application components may have different availability requirements, which can be supported by
a more atomic segmentation.
• Well-defined responsibilities for teams supporting the workload.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Choose your architecture type based on how you will segment your workload. Choose an SOA or
microservices architecture (or in some rare cases, a monolithic architecture). Even if you choose
to start with a monolith architecture, you must ensure that it’s modular and can ultimately
evolve to SOA or microservices as your product scales with user adoption. SOA and microservices
oﬀer respectively smaller segmentation, which is preferred as a modern scalable and reliable
architecture, but there are trade-oﬀs to consider, especially when deploying a microservice
architecture.

One primary trade-oﬀ is that you now have a distributed compute architecture that can make it
harder to achieve user latency requirements and there is additional complexity in the debugging
and tracing of user interactions. You can use AWS X-Ray to assist you in solving this problem.
Another eﬀect to consider is increased operational complexity as you increase the number of
applications that you are managing, which requires the deployment of multiple independency
components.

Workload architecture 489

AWS Well-Architected Framework Framework

• For existing monoliths with a single, shared database, choose how to reorganize the data into
smaller segments. This could be by business unit, access pattern, or data structure. At this point
in the refactoring process, you should choose to move forward with a relational or non-relational
(NoSQL) type of database. For more details, see From SQL to NoSQL.

Level of eﬀort for the implementation plan: High

Resources

Related best practices:

• REL03-BP02 Build services focused on speciﬁc business domains and functionality

Related documents:

• Amazon API Gateway: Conﬁguring a REST API Using OpenAPI

• What is Service-Oriented Architecture?
• Bounded Context (a central pattern in Domain-Driven Design)
• Implementing Microservices on AWS
• Microservice Trade-Oﬀs
• Microservices - a deﬁnition of this new architectural term
• Microservices on AWS
• What is AWS App Mesh?

Related examples:

• Iterative App Modernization Workshop

Related videos:

• Delivering Excellence with Microservices on AWS

REL03-BP02 Build services focused on speciﬁc business domains and functionality

Service-oriented architectures (SOA) deﬁne services with well-delineated functions deﬁned by

business needs. Microservices use domain models and bounded context to draw service boundaries

Workload architecture 491

AWS Well-Architected Framework Framework

focused on business domains. When working with existing monolithic applications, you can
take advantage of decomposition patterns that provide established techniques to modernize
applications into services.

Domain-driven design

Implementation steps

• Teams can hold event storming workshops to quickly identify events, commands, aggregates and
domains in a lightweight sticky note format.

• Once domain entities and functions have been formed in a domain context, you can divide your
domain into services using bounded context, where entities that share similar features and
attributes are grouped together. With the model divided into contexts, a template for how to
boundary microservices emerges.

• For example, the Amazon.com website entities might include package, delivery, schedule,
price, discount, and currency.

• Package, delivery, and schedule are grouped into the shipping context, while price, discount,
and currency are grouped into the pricing context.

• Decomposing monoliths into microservices outlines patterns for refactoring microservices. Using
patterns for decomposition by business capability, subdomain, or transaction aligns well with
domain-driven approaches.
• Tactical techniques such as the bubble context allow you to introduce DDD in existing or legacy
applications without up-front rewrites and full commitments to DDD. In a bubble context
approach, a small bounded context is established using a service mapping and coordination, or
anti-corruption layer, which protects the newly deﬁned domain model from external inﬂuences.

After teams have performed domain analysis and deﬁned entities and service contracts, they can
take advantage of AWS services to implement their domain-driven design as cloud-based services.

Workload architecture 493

AWS Well-Architected Framework Framework

• Microservices
• Test-driven development
• Behavior-driven development

Related examples:

• Designing Cloud Native Microservices on AWS (from DDD/EventStormingWorkshop)

Related tools:

• AWS Cloud Databases

• Serverless on AWS
• Containers at AWS

REL03-BP03 Provide service contracts per API

Service contracts are documented agreements between API producers and consumers deﬁned in
a machine-readable API deﬁnition. A contract versioning strategy allows consumers to continue
using the existing API and migrate their applications to a newer API when they are ready. Producer
deployment can happen any time as long as the contract is followed. Service teams can use the
technology stack of their choice to satisfy the API contract.

Desired outcome: Applications built with service-oriented or microservice architectures are able to
operate independently while having integrated runtime dependency. Changes deployed to an API
consumer or producer do not interrupt the stability of the overall system when both sides follow a
common API contract. Components that communicate over service APIs can perform independent
functional releases, upgrades to runtime dependencies, or fail over to a disaster recovery (DR) site
with little or no impact to each other. In addition, discrete services are able to independently scale
absorbing resource demand without requiring other services to scale in unison.

Common anti-patterns:

• Creating service APIs without strongly typed schemas. This results in APIs that cannot be used to
generate API bindings and payloads that can’t be programmatically validated.
• Not adopting a versioning strategy, which forces API consumers to update and release or fail
when service contracts evolve.

Workload architecture 495

AWS Well-Architected Framework Framework

• Importing an OpenAPI definition simplifies the creation of your API and can be integrated with
AWS infrastructure as code tools like the AWS Serverless Application Model and AWS Cloud
Development Kit (AWS CDK).
• Exporting an API definition simplifies integrating with API testing tools and provides services
consumer an integration specification.
• You can define and manage GraphQL APIs with AWS AppSync by defining a GraphQL schema file
to generate your contract interface and simplify interaction with complex REST models, multiple
database tables or legacy services.
• AWS Amplify projects that are integrated with AWS AppSync generate strongly typed JavaScript
query files for use in your application as well as an AWS AppSync GraphQL client library for
Amazon DynamoDB tables.
• When you consume service events from Amazon EventBridge, events adhere to schemas that
already exist in the schema registry or that you define with the OpenAPI Spec. With a schema
defined in the registry, you can also generate client bindings from the schema contract to
integrate your code with events.
• Extending or version your API. Extending an API is a simpler option when adding fields that can
be configured with optional fields or default values for required fields.
• JSON based contracts for protocols like REST and GraphQL can be a good fit for contract
extension.
• XML based contracts for protocols like SOAP should be tested with service consumers to
determine the feasibility of contract extension.
• When versioning an API, consider implementing proxy versioning where a facade is used to
support versions so that logic can be maintained in a single codebase.
• With API Gateway you can use request and response mappings to simplify absorbing contract
changes by establishing a facade to provide default values for new fields or to strip removed
fields from a request or response. With this approach the underlying service can maintain a
single codebase.

Resources

Related best practices:

• REL03-BP01 Choose how to segment your workload

• REL03-BP02 Build services focused on speciﬁc business domains and functionality
• REL04-BP02 Implement loosely coupled dependencies

Workload architecture 497

AWS Well-Architected Framework Framework

REL 4. How do you design interactions in a distributed system to prevent failures?

Distributed systems rely on communications networks to interconnect components, such as servers

or services. Your workload must operate reliably despite data loss or latency in these networks.
Components of the distributed system must operate in a way that does not negatively impact
other components or the workload. These best practices prevent failures and improve mean time
between failures (MTBF).

Best practices
• REL04-BP01 Identify the kind of distributed systems you depend on
• REL04-BP02 Implement loosely coupled dependencies
• REL04-BP03 Do constant work
• REL04-BP04 Make all responses idempotent

REL04-BP01 Identify the kind of distributed systems you depend on

Distributed systems can be synchronous, asynchronous, or batch. Synchronous systems must

process requests as quickly as possible and communicate with each other by making synchronous
request and response calls using HTTP/S, REST, or remote procedure call (RPC) protocols.
Asynchronous systems communicate with each other by exchanging data asynchronously through
an intermediary service without coupling individual systems. Batch systems receive a large volume
of input data, run automated data processes without human intervention, and generate output
data.

Desired outcome: Design a workload that eﬀectively interacts with synchronous, asynchronous,
and batch dependencies.

Common anti-patterns:

• Workload waits indeﬁnitely for a response from its dependencies, which could lead to workload
clients timing out, not knowing if their request has been received.
• Workload uses a chain of dependent systems that call each other synchronously. This requires
each system to be available and to successfully process a request before the whole chain can
succeed, leading to potentially brittle behavior and overall availability.
• Workload communicates with its dependencies asynchronously and rely on the concept of
exactly-once guaranteed delivery of messages, when often it is still possible to receive duplicate
messages.

Workload architecture 499

AWS Well-Architected Framework Framework

• Your workload should not rely on multiple synchronous dependencies to perform a single
function. This chain of dependencies increases overall brittleness because all dependencies in the
pathway need to be available in order for the request to complete successfully.
• When a dependency is unhealthy or unavailable, determine your error handling and retry
strategies. Avoid using bimodal behavior. Bimodal behavior is when your workload exhibits
different behavior under normal and failure modes. For more details on bimodal behavior, see
REL11-BP05 Use static stability to prevent bimodal behavior.
• Keep in mind that failing fast is better than making your workload wait. For instance, the AWS
Lambda Developer Guide describes how to handle retries and failures when you invoke Lambda
functions.
• Set timeouts when your workload calls its dependency. This technique avoids waiting too long or
waiting indefinitely for a response. For helpful discussion of this issue, see Tuning AWS Java SDK
HTTP request settings for latency-aware Amazon DynamoDB applications.
• Minimize the number of calls made from your workload to its dependency to fulfill a single
request. Having chatty calls between them increases coupling and latency.

Asynchronous dependency

To temporally decouple your workload from its dependency, they should communicate
asynchronously. Using an asynchronous approach, your workload can continue with any other
processing without having to wait for its dependency, or chain of dependencies, to send a
response.

When your workload needs to communicate asynchronously with its dependency, consider the
following guidance:

• Determine whether to use messaging or event streaming based on your use case and
requirements. Messaging allows your workload to communicate with its dependency by sending
and receiving messages through a message broker. Event streaming allows your workload and
its dependency to use a streaming service to publish and subscribe to events, delivered as
continuous streams of data, that need to be processed as soon as possible.

• Messaging and event streaming handle messages diﬀerently so you need to make trade-oﬀ
decisions based on:
• Message priority: message brokers can process high-priority messages ahead of normal
messages. In event streaming, all messages have the same priority.

Workload architecture 501

AWS Well-Architected Framework Framework

• Define the time window when your workload should run the batch job. Your workload can set
up a recurrence pattern to invoke a batch system, for example, every hour or at the end of every
month.
• Determine the location of the data input and the processed data output. Choose a storage
service, such as Amazon Simple Storage Services (Amazon S3), Amazon Elastic File System
(Amazon EFS), and Amazon FSx for Lustre, that allows your workload to read and write files at
scale.
• If your workload needs to invoke multiple batch jobs, you could leverage AWS Step Functions
to simplify the orchestration of batch jobs that run in AWS or on-premises. This sample project
demonstrates orchestration of batch jobs using Step Functions, AWS Batch, and Lambda.
• Monitor batch jobs to look for abnormalities, such as a job taking longer than it should to
complete. You could use tools like CloudWatch Container Insights to monitor AWS Batch
environments and jobs. In this instance, your workload would stop the next job from beginning
and inform the relevant staff of the exception.

Resources

Related documents:

• AWS Cloud Operations: Monitoring and Observability

• The Amazon's Builder Library: Challenges with distributed systems
• REL11-BP05 Use static stability to prevent bimodal behavior
• AWS Lambda Developer Guide: Error handling and automatic retries in AWS Lambda
• Tuning AWS Java SDK HTTP request settings for latency-aware Amazon DynamoDB applications
• AWS Messaging
• What is streaming data?
• AWS Lambda Developer Guide: Asynchronous invocation
• Amazon Simple Queue Service FAQ: FIFO queues
• Amazon Kinesis Data Streams Developer Guide: Handling Duplicate Records
• Amazon Simple Queue Service Developer Guide: Available CloudWatch metrics for Amazon SQS
• Amazon Kinesis Data Streams Developer Guide: Monitoring the Amazon Kinesis Data Streams
Service with Amazon CloudWatch
• AWS X-Ray Developer Guide: AWS X-Ray concepts
• AWS Samples on GitHub: AWS Step functions Complex Orchestrator App

Workload architecture 503

AWS Well-Architected Framework Framework

To further improve resiliency through loose coupling, make component interactions asynchronous
where possible. This model is suitable for any interaction that does not need an immediate
response and where an acknowledgment that a request has been registered will suﬃce. It involves
one component that generates events and another that consumes them. The two components
do not integrate through direct point-to-point interaction but usually through an intermediate
durable storage layer, such as an Amazon SQS queue, a streaming data platform such as Amazon
Kinesis, or AWS Step Functions.

Figure 4: Dependencies such as queuing systems and load balancers are loosely coupled

Amazon SQS queues and AWS Step Functions are just two ways to add an intermediate layer
for loose coupling. Event-driven architectures can also be built in the AWS Cloud using Amazon
EventBridge, which can abstract clients (event producers) from the services they rely on (event
consumers). Amazon Simple Notiﬁcation Service (Amazon SNS) is an eﬀective solution when you
need high-throughput, push-based, many-to-many messaging. Using Amazon SNS topics, your
publisher systems can fan out messages to a large number of subscriber endpoints for parallel
processing.

While queues oﬀer several advantages, in most hard real-time systems, requests older than a
threshold time (often seconds) should be considered stale (the client has given up and is no longer

Workload architecture 505

AWS Well-Architected Framework Framework

• Containerize components as microservices: Microservices allow teams to build applications

composed of small independent components which communicate over well-defined APIs.
Amazon Elastic Container Service (Amazon ECS), and Amazon Elastic Kubernetes Service
(Amazon EKS) can help you get started faster with containers.
• Manage workflows with Step Functions: Step Functions help you coordinate multiple AWS
services into flexible workflows.
• Leverage publish-subscribe (pub/sub) messaging architectures: Amazon Simple Notification
Service (Amazon SNS) provides message delivery from publishers to subscribers (also known as
producers and consumers).

Implementation steps

• Components in an event-driven architecture are initiated by events. Events are actions that
happen in a system, such as a user adding an item to a cart. When an action is successful, an
event is generated that actuates the next component of the system.
• Building Event-driven Applications with Amazon EventBridge
• AWS re:Invent 2022 - Designing Event-Driven Integrations using Amazon EventBridge
• Distributed messaging systems have three main parts that need to be implemented for a queue
based architecture. They include components of the distributed system, the queue that is used
for decoupling (distributed on Amazon SQS servers), and the messages in the queue. A typical
system has producers which initiate the message into the queue, and the consumer which
receives the message from the queue. The queue stores messages across multiple Amazon SQS
servers for redundancy.
• Basic Amazon SQS architecture
• Send Messages Between Distributed Applications with Amazon Simple Queue Service
• Microservices, when well-utilized, enhance maintainability and boost scalability, as loosely
coupled components are managed by independent teams. It also allows for the isolation of
behaviors to a single component in case of changes.
• Implementing Microservices on AWS
• Let's Architect! Architecting microservices with containers
• With AWS Step Functions you can build distributed applications, automate processes, orchestrate
microservices, among other things. The orchestration of multiple components into an automated
workﬂow allows you to decouple dependencies in your application.
• Create a Serverless Workﬂow with AWS Step Functions and AWS Lambda
Workload architecture 507
AWS Well-Architected Framework Framework

For example, if the health check system is monitoring 100,000 servers, the load on it is nominal
under the normally light server failure rate. However, if a major event makes half of those servers
unhealthy, then the health check system would be overwhelmed trying to update notiﬁcation
systems and communicate state to its clients. So instead the health check system should send
the full snapshot of the current state each time. 100,000 server health states, each represented
by a bit, would only be a 12.5-KB payload. Whether no servers are failing, or all of them are, the
health check system is doing constant work, and large, rapid changes are not a threat to the system
stability. This is actually how Amazon Route 53 handles health checks for endpoints (such as IP
addresses) to determine how end users are routed to them.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

• Do constant work so that systems do not fail when there are large, rapid changes in load.
• Implement loosely coupled dependencies. Dependencies such as queuing systems, streaming
systems, workﬂows, and load balancers are loosely coupled. Loose coupling helps isolate
behavior of a component from other components that depend on it, increasing resiliency and
agility.
• The Amazon Builders' Library: Reliability, constant work, and a good cup of coﬀee
• AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and
Small ARC337 (includes constant work)
• For the example of a health check system monitoring 100,000 servers, engineer workloads
so that payload sizes remain constant regardless of number of successes or failures.

Resources

Related documents:

• Amazon EC2: Ensuring Idempotency

• The Amazon Builders' Library: Challenges with distributed systems
• The Amazon Builders' Library: Reliability, constant work, and a good cup of coﬀee

Related videos:

• AWS New York Summit 2019: Intro to Event-driven Architectures and Amazon EventBridge
(MAD205)

Workload architecture 509

AWS Well-Architected Framework Framework

• Amazon EC2: Ensuring Idempotency

• The Amazon Builders' Library: Challenges with distributed systems

• The Amazon Builders' Library: Reliability, constant work, and a good cup of coﬀee

Related videos:

• AWS New York Summit 2019: Intro to Event-driven Architectures and Amazon EventBridge
(MAD205)

• AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and
Small ARC337 (includes loose coupling, constant work, static stability)

• AWS re:Invent 2019: Moving to event-driven architectures (SVS308)

REL 5. How do you design interactions in a distributed system to mitigate or

withstand failures?

Distributed systems rely on communications networks to interconnect components (such as servers

or services). Your workload must operate reliably despite data loss or latency over these networks.
Components of the distributed system must operate in a way that does not negatively impact
other components or the workload. These best practices permit workloads to withstand stresses or
failures, more quickly recover from them, and mitigate the impact of such impairments. The result
is improved mean time to recovery (MTTR).

Best practices

• REL05-BP01 Implement graceful degradation to transform applicable hard dependencies into

soft dependencies

• REL05-BP02 Throttle requests

• REL05-BP03 Control and limit retry calls

• REL05-BP04 Fail fast and limit queues

• REL05-BP05 Set client timeouts

• REL05-BP06 Make systems stateless where possible

• REL05-BP07 Implement emergency levers

Workload architecture 511

AWS Well-Architected Framework Framework

critical functionality of the component to callers or customers. These considerations can become
additional requirements that can be tested and veriﬁed. Ideally, a component is able to perform its
core function in an acceptable manner even when one or multiple dependencies fail.

This is as much a business discussion as a technical one. All business requirements are important
and should be fulﬁlled if possible. However, it still makes sense to ask what should happen when
not all of them can be fulﬁlled. A system can be designed to be available and consistent, but
under circumstances where one requirement must be dropped, which one is more important? For
payment processing, it might be consistency. For a real-time application, it might be availability.
For a customer facing website, the answer may depend on customer expectations.

What this means depends on the requirements of the component and what should be considered
its core function. For example:

• An ecommerce website might display data from multiple diﬀerent systems like personalized
recommendations, highest ranked products, and status of customer orders on the landing
page. When one upstream system fails, it still makes sense to display everything else instead of
showing an error page to a customer.

• A component performing batch writes can still continue processing a batch if one of the
individual operations fails. It should be simple to implement a retry mechanism. This can be
done by returning information on which operations succeeded, which failed, and why they failed
to the caller, or putting failed requests into a dead letter queue to implement asynchronous
retries. Information about failed operations should be logged as well.

• A system that processes transactions must verify that either all or no individual updates are
executed. For distributed transactions, the saga pattern can be used to roll back previous
operations in case a later operation of the same transaction fails. Here, the core function is
maintaining consistency.

• Time critical systems should be able to deal with dependencies not responding in a timely
manner. In these cases, the circuit breaker pattern can be used. When responses from a
dependency start timing out, the system can switch to a closed state where no additional call are
made.

• An application may read parameters from a parameter store. It can be useful to create container
images with a default set of parameters and use these in case the parameter store is unavailable.

Note that the pathways taken in case of component failure need to be tested and should be
signiﬁcantly simpler than the primary pathway. Generally, fallback strategies should be avoided.

Workload architecture 513

AWS Well-Architected Framework Framework

Resources

Related documents:

• Amazon API Gateway: Throttle API Requests for Better Throughput

• CircuitBreaker (summarizes Circuit Breaker from “Release It!” book)
• Error Retries and Exponential Backoﬀ in AWS
• Michael Nygard “Release It! Design and Deploy Production-Ready Software”
• The Amazon Builders' Library: Avoiding fallback in distributed systems
• The Amazon Builders' Library: Avoiding insurmountable queue backlogs
• The Amazon Builders' Library: Caching challenges and strategies
• The Amazon Builders' Library: Timeouts, retries, and backoﬀ with jitter

Related videos:

• Retry, backoﬀ, and jitter: AWS re:Invent 2019: Introducing The Amazon Builders’ Library
(DOP328)

Related examples:

• Well-Architected lab: Level 300: Implementing Health Checks and Managing Dependencies to
Improve Reliability

REL05-BP02 Throttle requests

Throttle requests to mitigate resource exhaustion due to unexpected increases in demand.

Requests below throttling rates are processed while those over the deﬁned limit are rejected with a
return message indicating the request was throttled.

Desired outcome: Large volume spikes either from sudden customer traﬃc increases, ﬂooding
attacks, or retry storms are mitigated by request throttling, allowing workloads to continue normal
processing of supported request volume.

Common anti-patterns:

• API endpoint throttles are not implemented or are left at default values without considering
expected volumes.

Workload architecture 515

AWS Well-Architected Framework Framework

The token bucket algorithm.

Amazon API Gateway implements the token bucket algorithm according to account and region
limits and can be configured per-client with usage plans. Additionally, Amazon Simple Queue
Service (Amazon SQS) and Amazon Kinesis can buffer requests to smooth out the request rate, and
allow higher throttling rates for requests that can be addressed. Finally, you can implement rate
limiting with AWS WAF to throttle specific API consumers that generate unusually high load.

Implementation steps

You can conﬁgure API Gateway with throttling limits for your APIs and return 429 Too Many
Requests errors when limits are exceeded. You can use AWS WAF with your AWS AppSync and
API Gateway endpoints to enable rate limiting on a per IP address basis. Additionally, where your
system can tolerate asynchronous processing, you can put messages into a queue or stream to
speed up responses to service clients, which allows you to burst to higher throttle rates.

With asynchronous processing, when you’ve conﬁgured Amazon SQS as an event source for AWS
Lambda, you can conﬁgure maximum concurrency to avoid high event rates from consuming
available account concurrent execution quota needed for other services in your workload or
account.

While API Gateway provides a managed implementation of the token bucket, in cases where
you cannot use API Gateway, you can take advantage of language speciﬁc open-source
implementations (see related examples in Resources) of the token bucket for your services.

• Understand and conﬁgure API Gateway throttling limits at the account level per region, API per
stage, and API key per usage plan levels.

Workload architecture 517

AWS Well-Architected Framework Framework

• The three most important AWS WAF rate-based rules

• Java Bucket4j
• Python token-bucket
• Node token-bucket
• .NET System Threading Rate Limiting

Related videos:

• Implementing GraphQL API security best practices with AWS AppSync

Related tools:

• Amazon API Gateway

• AWS AppSync
• Amazon SQS
• Amazon Kinesis
• AWS WAF

REL05-BP03 Control and limit retry calls

Use exponential backoﬀ to retry requests at progressively longer intervals between each retry.
Introduce jitter between retries to randomize retry intervals. Limit the maximum number of retries.

Desired outcome: Typical components in a distributed software system include servers, load
balancers, databases, and DNS servers. During normal operation, these components can respond
to requests with errors that are temporary or limited, and also errors that would be persistent
regardless of retries. When clients make requests to services, the requests consume resources
including memory, threads, connections, ports, or any other limited resources. Controlling and
limiting retries is a strategy to release and minimize consumption of resources so that system
components under strain are not overwhelmed.

When client requests time out or receive error responses, they should determine whether or not
to retry. If they do retry, they do so with exponential backoﬀ with jitter and a maximum retry
value. As a result, backend services and processes are given relief from load and time to self-heal,
resulting in faster recovery and successful request servicing.

Workload architecture 519

AWS Well-Architected Framework Framework

when calling services that are idempotent and where retries improve your client availability. Decide
what the timeouts are and when to stop retrying based on your use case. Build and exercise testing
scenarios for those retry use cases.

Implementation steps

• Determine the optimal layer in your application stack to implement retries for the services your
application relies on.
• Be aware of existing SDKs that implement proven retry strategies with exponential backoﬀ and
jitter for your language of choice, and favor these over writing your own retry implementations.
• Verify that services are idempotent before implementing retries. Once retries are implemented,
be sure they are both tested and regularly exercise in production.
• When calling AWS service APIs, use the AWS SDKs and AWS CLI and understand the retry
conﬁguration options. Determine if the defaults work for your use case, test, and adjust as
needed.

Resources

Related best practices:

• REL04-BP04 Make all responses idempotent

• REL05-BP02 Throttle requests
• REL05-BP04 Fail fast and limit queues
• REL05-BP05 Set client timeouts
• REL11-BP01 Monitor all components of the workload to detect failures

Related documents:

• Error Retries and Exponential Backoﬀ in AWS

• The Amazon Builders' Library: Timeouts, retries, and backoﬀ with jitter
• Exponential Backoﬀ and Jitter
• Making retries safe with idempotent APIs

Related examples:

• Spring Retry

Workload architecture 521

AWS Well-Architected Framework Framework

• Not clearing backlogged messages from a queue, when there is no value in processing these
messages if the business need no longer exists.
• Configuring first in first out (FIFO) queues when last in first out (LIFO) queues would better serve
client needs, for example when strict ordering is not required and backlog processing is delaying
all new and time sensitive requests resulting in all clients experiencing breached service levels.

• Exposing internal queues to clients instead of exposing APIs that manage work intake and place
requests into internal queues.

• Combining too many work request types into a single queue which can exacerbate backlog
conditions by spreading resource demand across request types.

• Processing complex and simple requests in the same queue, despite needing diﬀerent
monitoring, timeouts and resource allocations.

• Not validating inputs or using assertions to implement fail fast mechanisms in software that
bubble up exceptions to higher level components that can handle errors gracefully.

• Not removing faulty resources from request routing, especially when failures are grey emitting
both successes and failures due to crashing and restarting, intermittent dependency failure,
reduced capacity, or network packet loss.

Benefits of establishing this best practice: Systems that fail fast are easier to debug and fix, and
often expose issues in coding and configuration before releases are published into production.
Systems that incorporate effective queueing strategies provide greater resilience and reliability to
traffic spikes and intermittent system fault conditions.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Fail fast strategies can be coded into software solutions as well as conﬁgured into infrastructure.
In addition to failing fast, queues are a straightforward yet powerful architectural technique to
decouple system components smooth load. Amazon CloudWatch provides capabilities to monitor
for and alarm on failures. Once a system is known to be failing, mitigation strategies can be
invoked, including failing away from impaired resources. When systems implement queues with
Amazon SQS and other queue technologies to smooth load, they must consider how to manage
queue backlogs, as well as message consumption failures.

Workload architecture 523

AWS Well-Architected Framework Framework

• Avoiding insurmountable queue backlogs

• Fail Fast
• How can I prevent an increasing backlog of messages in my Amazon SQS queue?
• Elastic Load Balancing: Zonal Shift
• Amazon Application Recovery Controller: Routing control for traﬃc failover

Related examples:

• Enterprise Integration Patterns: Dead Letter Channel

Related videos:

• AWS re:Invent 2022 - Operating highly available Multi-AZ applications

Related tools:

• Amazon SQS
• Amazon MQ
• AWS IoT Core
• Amazon CloudWatch

REL05-BP05 Set client timeouts

Set timeouts appropriately on connections and requests, verify them systematically, and do not
rely on default values as they are not aware of workload speciﬁcs.

Desired outcome: Client timeouts should consider the cost to the client, server, and workload
associated with waiting for requests that take abnormal amounts of time to complete. Since it is
not possible to know the exact cause of any timeout, clients must use knowledge of services to
develop expectations of probable causes and appropriate timeouts

Client connections time out based on conﬁgured values. After encountering a timeout, clients make
decisions to back oﬀ and retry or open a circuit breaker. These patterns avoid issuing requests that
may exacerbate an underlying error condition.

Common anti-patterns:

Workload architecture 525

AWS Well-Architected Framework Framework

Services should also protect themselves from abnormally expensive content with throttles and
server-side timeouts.
• Requests that take abnormally long due to a service impairment can be timed out and retried.
Consideration should be given to service costs for the request and retry, but if the cause is
a localized impairment, a retry is not likely to be expensive and will reduce client resource
consumption. The timeout may also release server resources depending on the nature of the
impairment.

• Requests that take a long time to complete because the request or response has failed to be
delivered by the network can be timed out and retried. Because the request or response was
not delivered, failure would have been the outcome regardless of the length of timeout. Timing
out in this case will not release server resources, but it will release client resources and improve
workload performance.

Take advantage of well-established design patterns like retries and circuit breakers to handle
timeouts gracefully and support fail-fast approaches. AWS SDKs and AWS CLI allow for
configuration of both connection and request timeouts and for retries with exponential backoff
and jitter. AWS Lambda functions support configuration of timeouts, and with AWS Step Functions,
you can build low code circuit breakers that take advantage of pre-built integrations with AWS
services and SDKs. AWS App Mesh Envoy provides timeout and circuit breaker capabilities.

Implementation steps

• Conﬁgure timeouts on remote service calls and take advantage of built-in language timeout
features or open source timeout libraries.

• When your workload makes calls with an AWS SDK, review the documentation for language
speciﬁc timeout conﬁguration.

• Python

• PHP

• .NET

• Ruby

• Java

• Go

• Node.js

• C++
Workload architecture 527
AWS Well-Architected Framework Framework

• Amazon API Gateway quotas and important notes

• AWS Command Line Interface: Command line options

• AWS SDK for Java 2.x: Conﬁgure API Timeouts

• AWS Botocore using the conﬁg object and Conﬁg Reference

• AWS SDK for .NET: Retries and Timeouts

• AWS Lambda: Conﬁguring Lambda function options

Related examples:

• Using the circuit breaker pattern with AWS Step Functions and Amazon DynamoDB

• Martin Fowler: CircuitBreaker

Related tools:

• AWS SDKs

• AWS Lambda

• Amazon SQS

• AWS Step Functions

• AWS Command Line Interface

REL05-BP06 Make systems stateless where possible

Systems should either not require state, or should oﬄoad state such that between diﬀerent client
requests, there is no dependence on locally stored data on disk and in memory. This allows servers
to be replaced at will without causing an availability impact.

When users or services interact with an application, they often perform a series of interactions that
form a session. A session is unique data for users that persists between requests while they use
the application. A stateless application is an application that does not need knowledge of previous
interactions and does not store session information.

Once designed to be stateless, you can then use serverless compute services, such as AWS Lambda
or AWS Fargate.

Workload architecture 529

AWS Well-Architected Framework Framework

• Design a stateless architecture after you identify which state and user data need to be persisted
with your storage solution of choice.

Resources

Related best practices:

• REL11-BP03 Automate healing on all layers

Related documents:

• The Amazon Builders' Library: Avoiding fallback in distributed systems

• The Amazon Builders' Library: Avoiding insurmountable queue backlogs

• The Amazon Builders' Library: Caching challenges and strategies

• Best Practices for Stateless Web Tier on AWS

REL05-BP07 Implement emergency levers

Emergency levers are rapid processes that can mitigate availability impact on your workload.

Emergency levers work by disabling, throttling, or changing the behavior of components or

dependencies using known and tested mechanisms. This can alleviate workload impairments
caused by resource exhaustion due to unexpected increases in demand and reduce the impact of
failures in non-critical components within your workload.

Desired outcome: By implementing emergency levers, you can establish known-good processes
to maintain the availability of critical components in your workload. The workload should degrade
gracefully and continue to perform its business-critical functions during the activation of an
emergency lever. For more detail on graceful degradation, see REL05-BP01 Implement graceful
degradation to transform applicable hard dependencies into soft dependencies.

Common anti-patterns:

• Failure of non-critical dependencies impacts the availability of your core workload.

• Not testing or verifying critical component behavior during non-critical component impairment.

• No clear and deterministic criteria deﬁned for activation or deactivation of an emergency lever.

Workload architecture 531

AWS Well-Architected Framework Framework

• Finding the right metrics to monitor depends on your workload. Some example metrics are
latency or the number of failed request to a dependency.
• Deﬁne the procedures, manual or automated, that comprise the emergency lever.

• This may include mechanisms such as load shedding, throttling requests, or implementing
graceful degradation.

Resources

Related best practices:

• REL05-BP01 Implement graceful degradation to transform applicable hard dependencies into

soft dependencies

• REL05-BP02 Throttle requests

• REL11-BP05 Use static stability to prevent bimodal behavior

Related documents:

• Automating safe, hands-oﬀ deployments

• Any Day Can Be Prime Day: How Amazon.com Search Uses Chaos Engineering to Handle Over
84K Requests Per Second

Related videos:

• AWS re:Invent 2020: Reliability, consistency, and conﬁdence through immutability

Change management
Questions

• REL 6. How do you monitor workload resources?

• REL 7. How do you design your workload to adapt to changes in demand?

• REL 8. How do you implement change?

Change management 533

AWS Well-Architected Framework Framework

AWS makes an abundance of monitoring and log information available for consumption that can
be used to deﬁne workload-speciﬁc metrics, change-in-demand processes, and adopt machine
learning techniques regardless of ML expertise.

In addition, monitor all of your external endpoints to ensure that they are independent of your
base implementation. This active monitoring can be done with synthetic transactions (sometimes
referred to as user canaries, but not to be confused with canary deployments) which periodically
run a number of common tasks matching actions performed by clients of the workload. Keep
these tasks short in duration and be sure not to overload your workload during testing. Amazon
CloudWatch Synthetics allows you to create synthetic canaries to monitor your endpoints and APIs.
You can also combine the synthetic canary client nodes with AWS X-Ray console to pinpoint which
synthetic canaries are experiencing issues with errors, faults, or throttling rates for the selected
time frame.

Desired Outcome:

Collect and use critical metrics from all components of the workload to ensure workload reliability
and optimal user experience. Detecting that a workload is not achieving business outcomes allows
you to quickly declare a disaster and recover from an incident.

Common anti-patterns:

• Only monitoring external interfaces to your workload.

• Not generating any workload-speciﬁc metrics and only relying on metrics provided to you by the
AWS services your workload uses.
• Only using technical metrics in your workload and not monitoring any metrics related to non-
technical KPIs the workload contributes to.
• Relying on production traﬃc and simple health checks to monitor and evaluate workload state.

Beneﬁts of establishing this best practice: Monitoring at all tiers in your workload allows you to
more rapidly anticipate and resolve problems in the components that comprise the workload.

Level of risk exposed if this best practice is not established: High

Implementation guidance

1. Turn on logging where available. Monitoring data should be obtained from all components of
the workloads. Turn on additional logging, such as S3 Access Logs, and permit your workload

Change management 535

AWS Well-Architected Framework Framework

• Accessing Amazon CloudWatch Logs for AWS Lambda

• Amazon S3 Server Access Logging
• Enable Access Logs for Your Classic Load Balancer
• Exporting log data to Amazon S3
• Install the CloudWatch agent on an Amazon EC2 instance
• Publishing Custom Metrics
• Using Amazon CloudWatch Dashboards
• Using Amazon CloudWatch Metrics
• Using Canaries (Amazon CloudWatch Synthetics)
• What are Amazon CloudWatch Logs?

User guides:
• Creating a trail
• Monitoring memory and disk metrics for Amazon EC2 Linux instances
• Using CloudWatch Logs with container instances
• VPC Flow Logs
• What is Amazon DevOps Guru?
• What is AWS X-Ray?

Related blogs:

• Debugging with Amazon CloudWatch Synthetics and AWS X-Ray

Related examples and workshops:

• AWS Well-Architected Labs: Operational Excellence - Dependency Monitoring

• The Amazon Builders' Library: Instrumenting distributed systems for operational visibility
• Observability workshop

REL06-BP02 Deﬁne and calculate metrics (Aggregation)

Store log data and apply ﬁlters where necessary to calculate metrics, such as counts of a speciﬁc
log event, or latency calculated from log event timestamps.

Change management 537

AWS Well-Architected Framework Framework

• The Amazon Builders' Library: Instrumenting distributed systems for operational visibility

REL06-BP03 Send notiﬁcations (Real-time processing and alarming)

When organizations detect potential issues, they send real-time notiﬁcations and alerts to the
appropriate personnel and systems in order to respond quickly and eﬀectively to these issues.

Desired outcome: Rapid responses to operational events are possible through conﬁguration of
relevant alarms based on service and application metrics. When alarm thresholds are breached, the
appropriate personnel and systems are notiﬁed so they can address underlying issues.

Common anti-patterns:

• Configuring alarms with an excessively high threshold, resulting in the failure to send vital
notifications.
• Configuring alarms with a threshold that is too low, resulting in inaction on important alerts due
to the noise of excessive notifications.
• Not updating alarms and their threshold when usage changes.
• For alarms best addressed through automated actions, sending the notification to personnel
instead of generating the automated action results in excessive notifications being sent.

Beneﬁts of establishing this best practice: Sending real-time notiﬁcations and alerts to the
appropriate personnel and systems allows for early detection of issues and rapid responses to
operational incidents.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Workloads should be equipped with real-time processing and alarming to improve the detectability
of issues that could impact the availability of the application and serve as triggers for automated
response. Organizations can perform real-time processing and alarming by creating alerts with
defined metrics in order to receive notifications whenever significant events occur or a metric
exceeds a threshold.

Amazon CloudWatch allows you to create metric and composite alarms using CloudWatch
alarms based on static threshold, anomaly detection, and other criteria. For more detail on the
types of alarms you can conﬁgure using CloudWatch, see the alarms section of the CloudWatch
documentation.

Change management 539

AWS Well-Architected Framework Framework

services send Amazon SNS messages by conﬁguring the service to do so (more than 30, including
Amazon EC2, Amazon S3, and Amazon RDS).

Implementation steps

1. Create an alarm using Amazon CloudWatch alarms.

a. A metric alarm monitors a single CloudWatch metric or an expression dependent on
CloudWatch metrics. The alarm initiates one or more actions based on the value of the metric
or expression in comparison to a threshold over a number of time intervals. The action may
consist of sending a notification to an Amazon SNS topic, performing an Amazon EC2 action
or an Amazon EC2 Auto Scaling action, or creating an OpsItem or incident in AWS Systems
Manager.
b. A composite alarm consists of a rule expression that considers the alarm conditions of other
alarms you've created. The composite alarm only enters alarm state if all rule conditions
are met. The alarms specified in the rule expression of a composite alarm can include
metric alarms and additional composite alarms. Composite alarms can send Amazon SNS
notifications when their state changes and can create Systems Manager OpsItems or incidents
when they enter the alarm state, but they cannot perform Amazon EC2 or Auto Scaling
actions.
2. Set up Amazon SNS notifications. When creating a CloudWatch alarm, you can include an
Amazon SNS topic to send a notification when the alarm changes state.
3. Create rules in EventBridge that matches specified CloudWatch alarms. Each rule supports
multiple targets, including Lambda functions. For example, you can define an alarm that
initiates when available disk space is running low, which triggers a Lambda function through an
EventBridge rule, to clean up the space. For more detail on EventBridge targets, see EventBridge
targets.

Resources

Related Well-Architected best practices:

• REL06-BP01 Monitor all components for the workload (Generation)

• REL06-BP02 Deﬁne and calculate metrics (Aggregation)
• REL12-BP01 Use playbooks to investigate failures

Related documents:

Change management 541

AWS Well-Architected Framework Framework

agreements (SLAs). Automation can range from self-healing activities of single components to full-
site failover.

Common anti-patterns:

• Not having a clear inventory or catalog of key real-time alarms.

• No automated responses on critical alarms (for example, when compute is nearing exhaustion,
autoscaling occurs).
• Contradictory alarm response actions.
• No standard operating procedures (SOPs) for operators to follow when they receive alert
notifications.
• Not monitoring configuration changes, as undetected configuration changes can cause
downtime for workloads.
• Not having a strategy to undo unintended configuration changes.

Beneﬁts of establishing this best practice: Automating alarm processing can improve system
resiliency. The system takes corrective actions automatically, reducing manual activities that allow
for human, error-prone interventions. Workload operates meet availability goals, and reduces
service disruption.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

To eﬀectively manage alerts and automate their response, categorize alerts based on their
criticality and impact, document response procedures, and plan responses before ranking tasks.

Identify tasks requiring speciﬁc actions (often detailed in runbooks), and examine all runbooks and
playbooks to determine which tasks can be automated. If actions can be deﬁned, often they can be
automated. If actions cannot be automated, document manual steps in an SOP and train operators
on them. Continually challenge manual processes for automation opportunities where you can
establish and maintain a plan to automate alert responses.

Implementation steps

1. Create an inventory of alarms: To obtain a list of all alarms, you can use the AWS CLI using the
Amazon CloudWatch command describe-alarms. Depending upon how many alarms you
have set up, you might have to use pagination to retrieve a subset of alarms for each call, or
alternatively you can use the AWS SDK to obtain the alarms using an API call.

Change management 543

AWS Well-Architected Framework Framework

• REL06-BP01 Monitor all components for the workload (Generation)

• REL06-BP02 Deﬁne and calculate metrics (Aggregation)
• REL06-BP03 Send notiﬁcations (Real-time processing and alarming)
• REL08-BP01 Use runbooks for standard activities such as deployment

Related documents:

• AWS Systems Manager Automation

• Creating an EventBridge Rule That Triggers on an Event from an AWS Resource
• One Observability Workshop
• The Amazon Builders' Library: Instrumenting distributed systems for operational visibility
• What is Amazon DevOps Guru?
• Working with Automation Documents (Playbooks)

Related videos:

• AWS re:Invent 2022 - Observability best practices at Amazon

• AWS re:Invent 2020: Automate anything with AWS Systems Manager
• Introduction to AWS Resilience Hub
• Create Custom Ticket Systems for Amazon DevOps Guru Notiﬁcations
• Enable Multi-Account Insight Aggregation with Amazon DevOps Guru

Related examples:

• Reliability Workshops
• Amazon CloudWatch and Systems Manager Workshop

REL06-BP05 Analyze logs

Collect log ﬁles and metrics histories and analyze these for broader trends and workload insights.

Amazon CloudWatch Logs Insights supports a simple yet powerful query language that you can use
to analyze log data. Amazon CloudWatch Logs also supports subscriptions that allow data to ﬂow
seamlessly to Amazon S3 where you can use or Amazon Athena to query the data. It also supports

Change management 545

AWS Well-Architected Framework Framework

• Amazon CloudWatch Logs Insights Sample Queries

• Analyzing Log Data with CloudWatch Logs Insights
• Debugging with Amazon CloudWatch Synthetics and AWS X-Ray
• How Do I Create a Lifecycle Policy for an S3 Bucket?
• How do I analyze my Amazon S3 server access logs using Athena?
• One Observability Workshop
• The Amazon Builders' Library: Instrumenting distributed systems for operational visibility

REL06-BP06 Conduct reviews regularly

Frequently review how workload monitoring is implemented and update it based on signiﬁcant
events and changes.

Eﬀective monitoring is driven by key business metrics. Ensure these metrics are accommodated in
your workload as business priorities change.

Auditing your monitoring helps ensure that you know when an application is meeting its
availability goals. Root cause analysis requires the ability to discover what happened when failures
occur. AWS provides services that allow you to track the state of your services during an incident:

• Amazon CloudWatch Logs: You can store your logs in this service and inspect their contents.
• Amazon CloudWatch Logs Insights: Is a fully managed service that allows you to analyze
massive logs in seconds. It gives you fast, interactive queries and visualizations.
• AWS Conﬁg: You can see what AWS infrastructure was in use at diﬀerent points in time.
• AWS CloudTrail: You can see which AWS APIs were invoked at what time and by what principal.

At AWS, we conduct a weekly meeting to review operational performance and to share learnings
between teams. Because there are so many teams in AWS, we created The Wheel to randomly pick
a workload to review. Establishing a regular cadence for operational performance reviews and
knowledge sharing enhances your ability to achieve higher performance from your operational
teams.

Common anti-patterns:

• Collecting only default metrics.

• Setting a monitoring strategy and never reviewing it.

Change management 547

AWS Well-Architected Framework Framework

• Using Amazon CloudWatch Dashboards

REL06-BP07 Monitor end-to-end tracing of requests through your system

Trace requests as they process through service components so product teams can more easily
analyze and debug issues and improve performance.

Desired outcome: Workloads with comprehensive tracing across all components are easy to
debug, improving mean time to resolution (MTTR) of errors and latency by simplifying root cause
discovery. End-to-end tracing reduces the time it takes to discover impacted components and drill
into the detailed root causes of errors or latency.

Common anti-patterns:

• Tracing is used for some components but not for all. For example, without tracing for AWS
Lambda, teams might not clearly understand latency caused by cold starts in a spiky workload.
• Synthetic canaries or real-user monitoring (RUM) are not configured with tracing. Without
canaries or RUM, client interaction telemetry is omitted from the trace analysis yielding an
incomplete performance profile.
• Hybrid workloads include both cloud native and third party tracing tools, but steps have not
been taken elect and fully integrate a single tracing solution. Based on the elected tracing
solution, cloud native tracing SDKs should be used to instrument components that are not cloud
native or third party tools should be configured to ingest cloud native trace telemetry.

Beneﬁts of establishing this best practice: When development teams are alerted to issues, they
can see a full picture of system component interactions, including component by component
correlation to logging, performance, and failures. Because tracing makes it easy to visually identify
root causes, less time is spent investigating root causes. Teams that understand component
interactions in detail make better and faster decisions when resolving issues. Decisions like when
to invoke disaster recovery (DR) failover or where to best implement self-healing strategies can
be improved by analyzing systems traces, ultimately improving customer satisfaction with your
services.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Teams that operate distributed applications can use tracing tools to establish a correlation
identiﬁer, collect traces of requests, and build service maps of connected components. All

Change management 549

AWS Well-Architected Framework Framework

• REL06-BP01 Monitor all components for the workload (Generation)

• REL11-BP01 Monitor all components of the workload to detect failures

Related documents:

• What is AWS X-Ray?

• Amazon CloudWatch: Application Monitoring

• Debugging with Amazon CloudWatch Synthetics and AWS X-Ray

• The Amazon Builders' Library: Instrumenting distributed systems for operational visibility

• Integrating AWS X-Ray with other AWS services

• AWS Distro for OpenTelemetry and AWS X-Ray

• Amazon CloudWatch: Using synthetic monitoring

• Amazon CloudWatch: Use CloudWatch RUM

• Set up Amazon CloudWatch synthetics canary and Amazon CloudWatch alarm

• Availability and Beyond: Understanding and Improving the Resilience of Distributed Systems on
AWS

Related examples:

• One Observability Workshop

Related videos:

• AWS re:Invent 2022 - How to monitor applications across multiple accounts

• How to Monitor your AWS Applications

Related tools:

• AWS X-Ray

• Amazon CloudWatch

• Amazon Route 53

Change management 551

AWS Well-Architected Framework Framework

S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per
second.

Conﬁgure and use Amazon CloudFront or a trusted content delivery network (CDN). A CDN can
provide faster end-user response times and can serve requests for content from cache, therefore
reducing the need to scale your workload.

Common anti-patterns:

• Implementing Auto Scaling groups for automated healing, but not implementing elasticity.
• Using automatic scaling to respond to large increases in traﬃc.
• Deploying highly stateful applications, eliminating the option of elasticity.

Beneﬁts of establishing this best practice: Automation removes the potential for manual error
in deploying and decommissioning resources. Automation removes the risk of cost overruns and
denial of service due to slow response on needs for deployment or decommissioning.

Level of risk exposed if this best practice is not established: High

Implementation guidance

• Conﬁgure and use AWS Auto Scaling. This monitors your applications and automatically adjusts
capacity to maintain steady, predictable performance at the lowest possible cost. Using AWS
Auto Scaling, you can setup application scaling for multiple resources across multiple services.
• What is AWS Auto Scaling?
• Conﬁgure Auto Scaling on your Amazon EC2 instances and Spot Fleets, Amazon ECS tasks,
Amazon DynamoDB tables and indexes, Amazon Aurora Replicas, and AWS Marketplace
appliances as applicable.
• Managing throughput capacity automatically with DynamoDB Auto Scaling
• Use service API operations to specify the alarms, scaling policies, warm up times, and
cool down times.
• Use Elastic Load Balancing. Load balancers can distribute load by path or by network
connectivity.
• What is Elastic Load Balancing?
• Application Load Balancers can distribute load by path.
• What is an Application Load Balancer?
Change management 553
AWS Well-Architected Framework Framework

• Conﬁgure Amazon CloudFront distributions for your workloads, or use a third-party CDN.
• You can limit access to your workloads so that they are only accessible from CloudFront by
using the IP ranges for CloudFront in your endpoint security groups or access policies.

Resources

Related documents:

• APN Partner: partners that can help you create automated compute solutions
• AWS Auto Scaling: How Scaling Plans Work
• AWS Marketplace: products that can be used with auto scaling
• Managing Throughput Capacity Automatically with DynamoDB Auto Scaling
• Using a load balancer with an Auto Scaling group
• What Is AWS Global Accelerator?
• What Is Amazon EC2 Auto Scaling?
• What is AWS Auto Scaling?
• What is Amazon CloudFront?
• What is Amazon Route 53?
• What is Elastic Load Balancing?
• What is a Network Load Balancer?
• What is an Application Load Balancer?
• Working with records

REL07-BP02 Obtain resources upon detection of impairment to a workload

Scale resources reactively when necessary if availability is impacted, to restore workload

availability.

You ﬁrst must conﬁgure health checks and the criteria on these checks to indicate when availability
is impacted by lack of resources. Then, either notify the appropriate personnel to manually scale
the resource, or start automation to automatically scale it.

Scale can be manually adjusted for your workload (for example, changing the number of EC2
instances in an Auto Scaling group, or modifying throughput of a DynamoDB table through the

Change management 555

AWS Well-Architected Framework Framework

capacity to handle sudden increases in traﬃc, without throttling. For more detail, see
Managing throughput capacity automatically with DynamoDB auto scaling.

Resources

Related best practices:

• REL07-BP01 Use automation when obtaining or scaling resources

• REL11-BP01 Monitor all components of the workload to detect failures

Related documents:

• AWS Auto Scaling: How Scaling Plans Work

• Managing Throughput Capacity Automatically with DynamoDB Auto Scaling
• What Is Amazon EC2 Auto Scaling?

REL07-BP03 Obtain resources upon detection that more resources are needed for a workload

Scale resources proactively to meet demand and avoid availability impact.

Many AWS services automatically scale to meet demand. If using Amazon EC2 instances or Amazon
ECS clusters, you can conﬁgure automatic scaling of these to occur based on usage metrics that
correspond to demand for your workload. For Amazon EC2, average CPU utilization, load balancer
request count, or network bandwidth can be used to scale out (or scale in) EC2 instances. For
Amazon ECS, average CPU utilization, load balancer request count, and memory utilization can be
used to scale out (or scale in) ECS tasks. Using Target Auto Scaling on AWS, the autoscaler acts like
a household thermostat, adding or removing resources to maintain the target value (for example,
70% CPU utilization) that you specify.

Amazon EC2 Auto Scaling can also do Predictive Auto Scaling, which uses machine learning to
analyze each resource's historical workload and regularly forecasts the future load.

Little’s Law helps calculate how many instances of compute (EC2 instances, concurrent Lambda
functions, etc.) that you need.

L = λW

L = number of instances (or mean concurrency in the system)

Change management 557

AWS Well-Architected Framework Framework

It’s important to perform sustained load testing. Load tests should discover the breaking point
and test the performance of your workload. AWS makes it easy to set up temporary testing
environments that model the scale of your production workload. In the cloud, you can create a
production-scale test environment on demand, complete your testing, and then decommission the
resources. Because you only pay for the test environment when it's running, you can simulate your
live environment for a fraction of the cost of testing on premises.

Load testing in production should also be considered as part of game days where the production
system is stressed, during hours of lower customer usage, with all personnel on hand to interpret
results and address any problems that arise.

Common anti-patterns:

• Performing load testing on deployments that are not the same conﬁguration as your production.

• Performing load testing only on individual pieces of your workload, and not on the entire
workload.

• Performing load testing with a subset of requests and not a representative set of actual requests.

• Performing load testing to a small safety factor above expected load.

Beneﬁts of establishing this best practice: You know what components in your architecture fail
under load and be able to identify what metrics to watch to indicate that you are approaching that
load in time to address the problem, preventing the impact of that failure.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

• Perform load testing to identify which aspect of your workload indicates that you must add or
remove capacity. Load testing should have representative traﬃc similar to what you receive in
production. Increase the load while watching the metrics you have instrumented to determine
which metric indicates when you must add or remove resources.

• Distributed Load Testing on AWS: simulate thousands of connected users

• Identify the mix of requests. You may have varied mixes of requests, so you should look at
various time frames when identifying the mix of traﬃc.

• Implement a load driver. You can use custom code, open source, or commercial software to
implement a load driver.

Change management 559

AWS Well-Architected Framework Framework

For example, put processes in place to ensure rollback safety during deployments. Ensuring that
you can roll back a deployment without any disruption for your customers is critical in making a
service reliable.

For runbook procedures, start with a valid eﬀective manual process, implement it in code, and
invoke it to automatically run where appropriate.

Even for sophisticated workloads that are highly automated, runbooks are still useful for running
game days or meeting rigorous reporting and auditing requirements.

Note that playbooks are used in response to speciﬁc incidents, and runbooks are used to achieve
speciﬁc outcomes. Often, runbooks are for routine activities, while playbooks are used for
responding to non-routine events.

Common anti-patterns:

• Performing unplanned changes to conﬁguration in production.

• Skipping steps in your plan to deploy faster, resulting in a failed deployment.
• Making changes without testing the reversal of the change.

Benefits of establishing this best practice: Effective change planning increases your ability to
successfully run the change because you are aware of all the systems impacted. Validating your
change in test environments increases your confidence.

Level of risk exposed if this best practice is not established: High

Implementation guidance

• Provide consistent and prompt responses to well-understood events by documenting procedures

in runbooks.
• AWS Well-Architected Framework: Concepts: Runbook
• Use the principle of infrastructure as code to define your infrastructure. By using AWS
CloudFormation (or a trusted third party) to define your infrastructure, you can use version
control software to version and track changes.
• Use AWS CloudFormation (or a trusted third-party provider) to define your infrastructure.
• What is AWS CloudFormation?
• Create templates that are singular and decoupled, using good software design principles.
• Determine the permissions, templates, and responsible parties for implementation.

Change management 561

AWS Well-Architected Framework Framework

components such as user interfaces, APIs, databases, and source code. When you examine these
components of the system, functional tests verify that each feature behaves as expected, which
protects both user expectations and the software's integrity. Integrate functional tests as part of
your regular deployment, and use automation to deploy all changes, which reduces the potential
for introduction of human errors.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Integrate functional testing as part of your deployment. Functional tests are run as part of
automated deployment. If success criteria are not met, the pipeline is halted or rolled back. AWS
CodePipeline provides a continuous delivery pipeline for automated testing, which allows testers
to automate the entire testing and deployment process. It integrates with AWS services such as
AWS CodeBuild and AWS CodeDeploy to automate the build, test, and deployment phases of the
software development lifecycle.

Implementation steps

• Configure your pipeline: Set up your source, build, test, and deploy stages using the AWS
CodePipeline console or AWS Command Line Interface (CLI).
• Define your source: With AWS CodePipeline, you can automatically retrieve source code from
version control systems like GitHub, AWS CodeCommit, or Bitbucket, which verifies that the
latest code is always used for testing.
• Automate builds and tests: AWS CodeBuild can automatically build and test your code and
generate test reports. It supports popular testing frameworks like JUnit, NUnit, and TestNG.
• Deploy your code: Once the code has been built and tested, AWS CodeDeploy can deploy it
to your testing environment, including Amazon EC2 instances, AWS Lambda functions, or on-
premises servers.
• Monitor pipelines: AWS CodePipeline can track the progress of your pipeline and the status of
each stage. You can also use quality checks to block the pipeline as per test execution status.
You can also receive notifications for any pipeline stage failure or pipeline completion.

Resources

Related documents:

• Use AWS CodePipeline with AWS CodeBuild to test code and run builds

Change management 563

AWS Well-Architected Framework Framework

• Include updates to your disaster recovery plans and standard operating procedures (SOPs) with
any significant deployment.
• Integrate reliability testing into your automated deployment pipelines. Services such asAWS
Resilience Hubcan be integrated into your CI/CD pipeline to establish continuous resilience
assessments that are automatically evaluated as part of every deployment.
• Define your applications in AWS Resilience Hub. Resilience assessments generate code snippets
that help you create recovery procedures as AWS Systems Manager documents for your
applications and provide a list of recommended Amazon CloudWatch monitors and alarms.
• Once your DR plans and SOPs are updated, complete disaster recovery testing to verify that they
are effective. Disaster recovery testing helps you determine if you can restore your system after
an event and return to normal operations. You can simulate various disaster recovery strategies
and identify whether your planning is sufficient to meet your uptime requirements. Common
disaster recovery strategies include backup and restore, pilot light, cold standby, warm standby,
hot standby, and active-active, and they all differ in cost and complexity. Before disaster recovery
testing, we recommend that you define your recovery time objective (RTO) and recovery point
objective (RPO) to simplify the choice of strategy to simulate. AWS offers disaster recovery tools
like AWS Elastic Disaster Recovery to help you get started with your planning and testing.
• Chaos engineering experiments introduce disruptions to the system, such as network outages
and service failures. By simulating with controlled failures, you can discover your system's
vulnerabilities while containing the impacts of the injected failures. Just like the other strategies,
run controlled failure simulations in non-production environments using services like AWS Fault
Injection Service to gain confidence before deploying in production.

Resources

Related documents:

• Experiment with failure using resilience testing to build recovery preparedness

• Continually assessing application resilience with AWS Resilience Hub and AWS CodePipeline
• Disaster recovery (DR) architecture on AWS, part 1: Strategies for recovery in the cloud
• Verify the resilience of your workloads using Chaos Engineering
• Principles of Chaos Engineering
• Chaos Engineering Workshop

Related videos:

Change management 565

AWS Well-Architected Framework Framework

• Safer deployments with fast rollback and recovery processes: Deployments are safer because
the previous working version is not changed. You can roll back to it if errors are detected.
• Enhanced security posture: By not allowing changes to infrastructure, remote access
mechanisms (such as SSH) can be disabled. This reduces the attack vector, improving your
organization's security posture.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Automation

When deﬁning an immutable infrastructure deployment strategy, it is recommended to use

automation as much as possible to increase reproducibility and minimize the potential of human
error. For more detail, see REL08-BP05 Deploy changes with automation and Automating safe,
hands-oﬀ deployments.

With Infrastructure as code (IaC), infrastructure provisioning, orchestration, and deployment steps
are deﬁned in a programmatic, descriptive, and declarative way and stored in a source control
system. Leveraging infrastructure as code makes it simpler to automate infrastructure deployment
and helps achieve infrastructure immutability.

Deployment patterns

When a change in the workload is required, the immutable infrastructure deployment strategy
mandates that a new set of infrastructure resources is deployed, including all necessary changes.
It is important for this new set of resources to follow a rollout pattern that minimizes user impact.
There are two main strategies for this deployment:

Canary deployment: The practice of directing a small number of your customers to the new
version, usually running on a single service instance (the canary). You then deeply scrutinize any
behavior changes or errors that are generated. You can remove traffic from the canary if you
encounter critical problems and send the users back to the previous version. If the deployment
is successful, you can continue to deploy at your desired velocity, while monitoring the changes
for errors, until you are fully deployed. AWS CodeDeploy can be configured with a deployment
configuration that allows a canary deployment.

Blue/green deployment: Similar to the canary deployment, except that a full ﬂeet of the
application is deployed in parallel. You alternate your deployments across the two stacks (blue
and green). Once again, you can send traﬃc to the new version, and fall back to the old version

Change management 567

AWS Well-Architected Framework Framework

maintenance, validation, sharing, and deployment of customized, secure, and up-to-date Linux
or Windows custom AMI.
• Some of the services that support automation are:

• AWS Elastic Beanstalk is a service to rapidly deploy and scale web applications developed
with Java, .NET, PHP, Node.js, Python, Ruby, Go, and Docker on familiar servers such as
Apache, NGINX, Passenger, and IIS.

• AWS Proton helps platform teams connect and coordinate all the diﬀerent tools your
development teams need for infrastructure provisioning, code deployments, monitoring,
and updates. AWS Proton enables automated infrastructure as code provisioning and
deployment of serverless and container-based applications.

• Leveraging infrastructure as code makes it easy to automate infrastructure deployment, and

helps achieve infrastructure immutability. AWS provides services that enable the creation,
deployment, and maintenance of infrastructure in a programmatic, descriptive, and declarative
way.

• AWS CloudFormation helps developers create AWS resources in an orderly and predictable
fashion. Resources are written in text files using JSON or YAML format. The templates
require a specific syntax and structure that depends on the types of resources being created
and managed. You author your resources in JSON or YAML with any code editor such as AWS
Cloud9, check it into a version control system, and then CloudFormation builds the specified
services in safe, repeatable manner.

• AWS Serverless Application Model (AWS SAM) is an open-source framework that you can use
to build serverless applications on AWS. AWS SAM integrates with other AWS services, and is
an extension of AWS CloudFormation.
• AWS Cloud Development Kit (AWS CDK) is an open-source software development framework
to model and provision your cloud application resources using familiar programming
languages. You can use AWS CDK to model application infrastructure using TypeScript,
Python, Java, and .NET. AWS CDK uses AWS CloudFormation in the background to provision
resources in a safe, repeatable manner.

• AWS Cloud Control API introduces a common set of Create, Read, Update, Delete, and
List (CRUDL) APIs to help developers manage their cloud infrastructure in an easy and
consistent way. The Cloud Control API common APIs allow developers to uniformly manage
the lifecycle of AWS and third-party services.

• Implement deployment patterns that minimize user impact.

• Canary deployments:

Change management 569

AWS Well-Architected Framework Framework

Common anti-patterns:

• You perform manual changes.

• You skip steps in your automation through manual emergency workﬂows.
• You don't follow your established plans and processes in favor of accelerated timelines.
• You perform rapid follow-on deployments without allowing for bake time.

Beneﬁts of establishing this best practice: When you use automation to deploy all changes, you
remove the potential for introduction of human error and provide the ability to test before you
change production. Performing this process prior to production push veriﬁes that your plans are
complete. Additionally, automatic rollback into your release process can identify production issues
and return your workload to its previously-working operational state.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Automate your deployment pipeline. Deployment pipelines allow you to invoke automated testing
and detection of anomalies, and either halt the pipeline at a certain step before production
deployment, or automatically roll back a change. An integral part of this is the adoption of the
culture of continuous integration and continuous delivery/deployment (CI/CD), where a commit
or code change passes through various automated stage gates from build and test stages to
deployment on production environments.

Although conventional wisdom suggests that you keep people in the loop for the most diﬃcult
operational procedures, we suggest that you automate the most diﬃcult procedures for that very
reason.

Implementation steps

You can automate deployments to remove manual operations by following these steps:

• Set up a code repository to store your code securely: Use AWS CodeCommit, to create a secure
Git-based repository.
• Conﬁgure a continuous integration service to compile your source code, run tests, and create
deployment artifacts: To set up a build project for this purpose, see Getting started with AWS
CodeBuild using the console.
• Set up a deployment service that automates application deployments and handles the
complexity of application updates without reliance on error-prone manual deployments:

Change management 571

AWS Well-Architected Framework Framework

• AWS Summit 2019: CI/CD on AWS

Failure management
Questions
• REL 9. How do you back up data?
• REL 10. How do you use fault isolation to protect your workload?
• REL 11. How do you design your workload to withstand component failures?
• REL 12. How do you test reliability?
• REL 13. How do you plan for disaster recovery (DR)?

REL 9. How do you back up data?

Back up data, applications, and conﬁguration to meet your requirements for recovery time
objectives (RTO) and recovery point objectives (RPO).

Best practices
• REL09-BP01 Identify and back up all data that needs to be backed up, or reproduce the data
from sources
• REL09-BP02 Secure and encrypt backups
• REL09-BP03 Perform data backup automatically
• REL09-BP04 Perform periodic recovery of the data to verify backup integrity and processes

REL09-BP01 Identify and back up all data that needs to be backed up, or reproduce the data
from sources

Understand and use the backup capabilities of the data services and resources used by the
workload. Most services provide capabilities to back up workload data.

Desired outcome: Data sources have been identified and classified based on criticality. Then,
establish a strategy for data recovery based on the RPO. This strategy involves either backing up
these data sources, or having the ability to reproduce data from other sources. In the case of data
loss, the strategy implemented allows recovery or the reproduction of data within the defined RPO
and RTO.

Failure management 573

AWS Well-Architected Framework Framework

backup. Another example, if working with Amazon EMR, it might not be necessary to backup your
HDFS data store, as long as you can reproduce the data into Amazon EMR from Amazon S3.

When selecting a backup strategy, consider the time it takes to recover data. The time needed to
recover data depends on the type of backup (in the case of a backup strategy), or the complexity of
the data reproduction mechanism. This time should fall within the RTO for the workload.

Implementation steps

1. Identify all data sources for the workload. Data can be stored on a number of resources such
as databases, volumes, filesystems, logging systems, and object storage. Refer to the Resources
section to find Related documents on different AWS services where data is stored, and the
backup capability these services provide.
2. Classify data sources based on criticality. Different data sets will have different levels of
criticality for a workload, and therefore different requirements for resiliency. For example, some
data might be critical and require a RPO near zero, while other data might be less critical and
can tolerate a higher RPO and some data loss. Similarly, different data sets might have different
RTO requirements as well.
3. Use AWS or third-party services to create backups of the data. AWS Backup is a managed
service that allows creating backups of various data sources on AWS. AWS Elastic Disaster
Recovery handles automated sub-second data replication to an AWS Region. Most AWS services
also have native capabilities to create backups. The AWS Marketplace has many solutions that
provide these capabilites as well. Refer to the Resources listed below for information on how to
create backups of data from various AWS services.
4. For data that is not backed up, establish a data reproduction mechanism. You might choose
not to backup data that can be reproduced from other sources for various reasons. There might
be a situation where it is cheaper to reproduce data from sources when needed rather than
creating a backup as there may be a cost associated with storing backups. Another example is
where restoring from a backup takes longer than reproducing the data from sources, resulting
in a breach in RTO. In such situations, consider tradeoffs and establish a well-defined process
for how data can be reproduced from these sources when data recovery is necessary. For
example, if you have loaded data from Amazon S3 to a data warehouse (like Amazon Redshift),
or MapReduce cluster (like Amazon EMR) to do analysis on that data, this may be an example
of data that can be reproduced from other sources. As long as the results of these analyses are
either stored somewhere or reproducible, you would not suffer a data loss from a failure in the
data warehouse or MapReduce cluster. Other examples that can be reproduced from sources
include caches (like Amazon ElastiCache) or RDS read replicas.

Failure management 575

AWS Well-Architected Framework Framework

Related videos:

• AWS re:Invent 2021 - Backup, disaster recovery, and ransomware protection with AWS
• AWS Backup Demo: Cross-Account and Cross-Region Backup
• AWS re:Invent 2019: Deep dive on AWS Backup, ft. Rackspace (STG341)

Related examples:

• Well-Architected Lab - Implementing Bi-Directional Cross-Region Replication (CRR) for Amazon

S3
• Well-Architected Lab - Testing Backup and Restore of Data
• Well-Architected Lab - Backup and Restore with Failback for Analytics Workload
• Well-Architected Lab - Disaster Recovery - Backup and Restore

REL09-BP02 Secure and encrypt backups

Control and detect access to backups using authentication and authorization. Prevent and detect if
data integrity of backups is compromised using encryption.

Common anti-patterns:

• Having the same access to the backups and restoration automation as you do to the data.
• Not encrypting your backups.

Beneﬁts of establishing this best practice: Securing your backups prevents tampering with the
data, and encryption of the data prevents access to that data if it is accidentally exposed.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Control and detect access to backups using authentication and authorization, such as AWS Identity
and Access Management (IAM). Prevent and detect if data integrity of backups is compromised
using encryption.

Amazon S3 supports several methods of encryption of your data at rest. Using server-side
encryption, Amazon S3 accepts your objects as unencrypted data, and then encrypts them as they
are stored. Using client-side encryption, your workload application is responsible for encrypting the

Failure management 577

AWS Well-Architected Framework Framework

• AWS Marketplace: products that can be used for backup

• Amazon EBS Encryption
• Amazon S3: Protecting Data Using Encryption
• CRR Additional Conﬁguration: Replicating Objects Created with Server-Side Encryption (SSE)
Using Encryption Keys stored in AWS KMS
• DynamoDB Encryption at Rest
• Encrypting Amazon RDS Resources
• Encrypting Data and Metadata in Amazon EFS
• Encryption for Backups in AWS
• Managing Encrypted Tables
• Security Pillar - AWS Well-Architected Framework
• What is AWS Elastic Disaster Recovery?

Related examples:

• Well-Architected Lab - Implementing Bi-Directional Cross-Region Replication (CRR) for Amazon

REL09-BP03 Perform data backup automatically

Conﬁgure backups to be taken automatically based on a periodic schedule informed by the

Recovery Point Objective (RPO), or by changes in the dataset. Critical datasets with low data loss
requirements need to be backed up automatically on a frequent basis, whereas less critical data
where some loss is acceptable can be backed up less frequently.

Desired outcome: An automated process that creates backups of data sources at an established
cadence.

Common anti-patterns:

• Performing backups manually.

• Using resources that have backup capability, but not including the backup in your automation.

Beneﬁts of establishing this best practice: Automating backups veriﬁes that they are taken
regularly based on your RPO, and alerts you if they are not taken.

Failure management 579

AWS Well-Architected Framework Framework

3. Use an automated backup solution or managed service. AWS Backup is a fully-managed

service that makes it easy to centralize and automate data protection across AWS services,
in the cloud, and on-premises. Using backup plans in AWS Backup, create rules which deﬁne
the resources to backup, and the frequency at which these backups should be created. This
frequency should be informed by the RPO established in Step 2. For hands-on guidance on how
to create automated backups using AWS Backup, see Testing Backup and Restore of Data. Native
backup capabilities are oﬀered by most AWS services that store data. For example, RDS can be
leveraged for automated backups with point-in-time recovery (PITR).

4. For data sources not supported by an automated backup solution or managed service such as
on-premises data sources or message queues, consider using a trusted third-party solution to
create automated backups. Alternatively, you can create automation to do this using the AWS
CLI or SDKs. You can use AWS Lambda Functions or AWS Step Functions to deﬁne the logic
involved in creating a data backup, and use Amazon EventBridge to invoke it at a frequency
based on your RPO.

Level of eﬀort for the Implementation Plan: Low

Resources

Related documents:

• APN Partner: partners that can help with backup

• AWS Marketplace: products that can be used for backup

• Creating an EventBridge Rule That Triggers on a Schedule

• What Is AWS Backup?

• What Is AWS Step Functions?

• What is AWS Elastic Disaster Recovery?

Related videos:

• AWS re:Invent 2019: Deep dive on AWS Backup, ft. Rackspace (STG341)

Related examples:

• Well-Architected Lab - Testing Backup and Restore of Data

Failure management 581

AWS Well-Architected Framework Framework

Using AWS, you can stand up a testing environment and restore your backups to assess RTO and
RPO capabilities, and run tests on data content and integrity.

Additionally, Amazon RDS and Amazon DynamoDB allow point-in-time recovery (PITR). Using
continuous backup, you can restore your dataset to the state it was in at a speciﬁed date and time.

If all the data is available, is not corrupted, is accessible, and any data loss falls within the RPO
for the workload. Such tests can also help ascertain if recovery mechanisms are fast enough to
accommodate the workload's RTO.

AWS Elastic Disaster Recovery offers continual point-in-time recovery snapshots of Amazon EBS
volumes. As source servers are replicated, point-in-time states are chronicled over time based on
the configured policy. Elastic Disaster Recovery helps you verify the integrity of these snapshots by
launching instances for test and drill purposes without redirecting the traffic.

Implementation steps

1. Identify data sources that are currently being backed up and where these backups are being
stored. For implementation guidance, see REL09-BP01 Identify and back up all data that needs
to be backed up, or reproduce the data from sources.
2. Establish criteria for data validation for each data source. Different types of data will have
different properties which might require different validation mechanisms. Consider how this
data might be validated before you are confident to use it in production. Some common ways to
validate data are using data and backup properties such as data type, format, checksum, size, or
a combination of these with custom validation logic. For example, this might be a comparison of
the checksum values between the restored resource and the data source at the time the backup
was created.
3. Establish RTO and RPO for restoring the data based on data criticality. For implementation
guidance, see REL13-BP01 Define recovery objectives for downtime and data loss.
4. Assess your recovery capability. Review your backup and restore strategy to understand if
it can meet your RTO and RPO, and adjust the strategy as necessary. Using AWS Resilience
Hub, you can run an assessment of your workload. The assessment evaluates your application
configuration against the resiliency policy and reports if your RTO and RPO targets can be met.
5. Do a test restore using currently established processes used in production for data restoration.
These processes depend on how the original data source was backed up, the format and storage
location of the backup itself, or if the data is reproduced from other sources. For example, if
you are using a managed service such as AWS Backup, this might be as simple as restoring the

Failure management 583

AWS Well-Architected Framework Framework

Figure 9. An automated backup and restore process

Level of eﬀort for the Implementation Plan: Moderate to high depending on the complexity of
the validation criteria.

Resources

Related documents:

• Automate data recovery validation with AWS Backup

• APN Partner: partners that can help with backup
• AWS Marketplace: products that can be used for backup
• Creating an EventBridge Rule That Triggers on a Schedule
• On-demand backup and restore for DynamoDB
• What Is AWS Backup?
• What Is AWS Step Functions?
• What is AWS Elastic Disaster Recovery
• AWS Elastic Disaster Recovery

Related examples:

• Well-Architected lab: Testing Backup and Restore of Data

Failure management 585

AWS Well-Architected Framework Framework

independent data centers) can be treated as a single logical deployment target for your workload,
including the ability to synchronously replicate data (for example, between databases). This allows
you to use Availability Zones in an active/active or active/standby conﬁguration.

Availability Zones are independent, and therefore workload availability is increased when the
workload is architected to use multiple zones. Some AWS services (including the Amazon EC2
instance data plane) are deployed as strictly zonal services where they have shared fate with the
Availability Zone they are in. Amazon EC2 instances in the other AZs will however be unaffected
and continue to function. Similarly, if a failure in an Availability Zone causes an Amazon Aurora
database to fail, a read-replica Aurora instance in an unaffected AZ can be automatically promoted
to primary. Regional AWS services, such as Amazon DynamoDB on the other hand internally use
multiple Availability Zones in an active/active configuration to achieve the availability design goals
for that service, without you needing to configure AZ placement.

Figure 9: Multi-tier architecture deployed across three Availability Zones. Note that Amazon S3 and
Amazon DynamoDB are always Multi-AZ automatically. The ELB also is deployed to all three zones.

While AWS control planes typically provide the ability to manage resources within the entire
Region (multiple Availability Zones), certain control planes (including Amazon EC2 and Amazon
EBS) have the ability to ﬁlter results to a single Availability Zone. When this is done, the request
is processed only in the speciﬁed Availability Zone, reducing exposure to disruption in other
Availability Zones. This AWS CLI example illustrates getting Amazon EC2 instance information from
only the us-east-2c Availability Zone:

AWS ec2 describe-instances --filters Name=availability-zone,Values=us-east-2c

Failure management 587

AWS Well-Architected Framework Framework

(Amazon S3) Replication, Amazon RDS Read Replicas (including Aurora Read Replicas), and Amazon
DynamoDB Global Tables. With continuous replication, versions of your data are available for near
immediate use in each of your active Regions.

Using AWS CloudFormation, you can deﬁne your infrastructure and deploy it consistently
across AWS accounts and across AWS Regions. And AWS CloudFormation StackSets extends this
functionality by allowing you to create, update, or delete AWS CloudFormation stacks across
multiple accounts and regions with a single operation. For Amazon EC2 instance deployments, an
AMI (Amazon Machine Image) is used to supply information such as hardware conﬁguration and
installed software. You can implement an Amazon EC2 Image Builder pipeline that creates the
AMIs you need and copy these to your active regions. This ensures that these Golden AMIs have
everything you need to deploy and scale-out your workload in each new region.

To route traffic, both Amazon Route 53 and AWS Global Accelerator permit the definition of
policies that determine which users go to which active regional endpoint. With Global Accelerator
you set a traffic dial to control the percentage of traffic that is directed to each application
endpoint. Route 53 supports this percentage approach, and also multiple other available policies
including geoproximity and latency based ones. Global Accelerator automatically leverages the
extensive network of AWS edge servers, to onboard traffic to the AWS network backbone as soon
as possible, resulting in lower request latencies.

All of these capabilities operate so as to preserve each Region’s autonomy. There are very few
exceptions to this approach, including our services that provide global edge delivery (such as
Amazon CloudFront and Amazon Route 53), along with the control plane for the AWS Identity and
Access Management (IAM) service. Most services operate entirely within a single Region.

On-premises data center

For workloads that run in an on-premises data center, architect a hybrid experience when possible.
AWS Direct Connect provides a dedicated network connection from your premises to AWS allowing
you to run in both.

Another option is to run AWS infrastructure and services on premises using AWS Outposts. AWS
Outposts is a fully managed service that extends AWS infrastructure, AWS services, APIs, and tools
to your data center. The same hardware infrastructure used in the AWS Cloud is installed in your
data center. AWS Outposts are then connected to the nearest AWS Region. You can then use AWS
Outposts to support your workloads that have low latency or local data processing requirements.

Level of risk exposed if this best practice is not established: High

Failure management 589

AWS Well-Architected Framework Framework

• Determine if AWS Local Zones helps you provide service to your users. If you have low-latency
requirements, see if AWS Local Zones is located near your users. If yes, then use it to deploy
workloads closer to those users.

• AWS Local Zones FAQ

Resources

Related documents:

• AWS Global Infrastructure

• AWS Local Zones FAQ

• Amazon ECS task placement strategies

• Choosing Regions and Availability Zones

• Example: Distributing instances across Availability Zones

• Global Tables: Multi-Region Replication with DynamoDB

• Using Amazon Aurora global databases

• Creating a Multi-Region Application with AWS Services blog series

• What is AWS Outposts?

Related videos:

• AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-
R2)
• AWS re:Invent 2019: Innovation and operation of the AWS global network infrastructure
(NET339)

REL10-BP02 Select the appropriate locations for your multi-location deployment

Desired outcome: For high availability, always (when possible) deploy your workload components
to multiple Availability Zones (AZs). For workloads with extreme resilience requirements, carefully
evaluate the options for a multi-Region architecture.

Failure management 591

AWS Well-Architected Framework Framework

Implementation guidance

For a disaster event based on disruption or partial loss of one Availability Zone, implementing
a highly available workload in multiple Availability Zones within a single AWS Region helps
mitigate against natural and technical disasters. Each AWS Region is comprised of multiple
Availability Zones, each isolated from faults in the other zones and separated by a meaningful
distance. However, for a disaster event that includes the risk of losing multiple Availability Zone
components, which are a signiﬁcant distance away from each other, you should implement
disaster recovery options to mitigate against failures of a Region-wide scope. For workloads that
require extreme resilience (critical infrastructure, health-related applications, ﬁnancial system
infrastructure, etc.), a multi-Region strategy may be required.

Implementation Steps

1. Evaluate your workload and determine whether the resilience needs can be met by a multi-
AZ approach (single AWS Region), or if they require a multi-Region approach. Implementing a
multi-Region architecture to satisfy these requirements will introduce additional complexity,
therefore carefully consider your use case and its requirements. Resilience requirements can
almost always be met using a single AWS Region. Consider the following possible requirements
when determining whether you need to use multiple Regions:

a. Disaster recovery (DR): For a disaster event based on disruption or partial loss of one
Availability Zone, implementing a highly available workload in multiple Availability Zones
within a single AWS Region helps mitigate against natural and technical disasters. For a
disaster event that includes the risk of losing multiple Availability Zone components, which
are a signiﬁcant distance away from each other, you should implement disaster recovery
across multiple Regions to mitigate against natural disasters or technical failures of a Region-
wide scope.

b. High availability (HA): A multi-Region architecture (using multiple AZs in each Region) can be
used to achieve greater then four 9’s (> 99.99%) availability.

c. Stack localization: When deploying a workload to a global audience, you can deploy localized
stacks in diﬀerent AWS Regions to serve audiences in those Regions. Localization can include
language, currency, and types of data stored.

d. Proximity to users: When deploying a workload to a global audience, you can reduce latency
by deploying stacks in AWS Regions close to where the end users are.

e. Data residency: Some workloads are subject to data residency requirements, where data
from certain users must remain within a speciﬁc country’s borders. Based on the regulation in

Failure management 593

AWS Well-Architected Framework Framework

i. Endpoints for standard accelerators in AWS Global Accelerator - AWS Global Accelerator
(amazon.com)
d. For applications that leverage AWS EventBridge, consider cross-Region buses to forward
events to other Regions you select.
i. Sending and receiving Amazon EventBridge events between AWS Regions
e. For Amazon Aurora databases, consider Aurora global databases, which span multiple AWS
regions. Existing clusters can be modiﬁed to add new Regions as well.
i. Getting started with Amazon Aurora global databases
f. If your workload includes AWS Key Management Service (AWS KMS) encryption keys, consider
whether multi-Region keys are appropriate for your application.
i. Multi-Region keys in AWS KMS
g. For other AWS service features, see this blog series on Creating a Multi-Region Application
with AWS Services series

Level of eﬀort for the Implementation Plan: Moderate to High

Resources

Related documents:

• Creating a Multi-Region Application with AWS Services series

• Disaster Recovery (DR) Architecture on AWS, Part IV: Multi-site Active/Active
• AWS Global Infrastructure
• AWS Local Zones FAQ
• Disaster Recovery (DR) Architecture on AWS, Part I: Strategies for Recovery in the Cloud
• Disaster recovery is diﬀerent in the cloud
• Global Tables: Multi-Region Replication with DynamoDB

Related videos:

• AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-
R2)
• Auth0: Multi-Region High-Availability Architecture that Scales to 1.5B+ Logins a Month with
automated failover

Failure management 595

AWS Well-Architected Framework Framework

For stateful server-based workloads deployed to an on-premise data center, you can use AWS
Elastic Disaster Recovery to protect your workloads in AWS. If you are already hosted in AWS, you
can use Elastic Disaster Recovery to protect your workload to an alternative Availability Zone or
Region. Elastic Disaster Recovery uses continual block-level replication to a lightweight staging
area to provide fast, reliable recovery of on-premises and cloud-based applications.

Implementation steps

1. Implement self-healing. Deploy your instances or containers using automatic scaling when
possible. If you cannot use automatic scaling, use automatic recovery for EC2 instances or
implement self-healing automation based on Amazon EC2 or ECS container lifecycle events.
• Use Amazon EC2 Auto Scaling groups for instances and container workloads that have no
requirements for a single instance IP address, private IP address, Elastic IP address, and
instance metadata.
• The launch template user data can be used to implement automation that can self-heal
most workloads.
• Use automatic recovery of Amazon EC2 instances for workloads that require a single instance
ID address, private IP address, elastic IP address, and instance metadata.
• Automatic Recovery will send recovery status alerts to a SNS topic as the instance failure is
detected.
• Use Amazon EC2 instance lifecycle events or Amazon ECS events to automate self-healing
where automatic scaling or EC2 recovery cannot be used.
• Use the events to invoke automation that will heal your component according to the
process logic you require.
• Protect stateful workloads that are limited to a single location using AWS Elastic Disaster
Recovery.

Resources

Related documents:

• Amazon ECS events

• Amazon EC2 Auto Scaling lifecycle hooks
• Recover your instance.
• Service automatic scaling
• What Is Amazon EC2 Auto Scaling?

Failure management 597

AWS Well-Architected Framework Framework

and Regions to provide fault isolation, but the concept of fault isolation can be extended to your
workload’s architecture as well.

The overall workload is partitioned cells by a partition key. This key needs to align with the grain of
the service, or the natural way that a service's workload can be subdivided with minimal cross-cell
interactions. Examples of partition keys are customer ID, resource ID, or any other parameter easily
accessible in most API calls. A cell routing layer distributes requests to individual cells based on the
partition key and presents a single endpoint to clients.

Figure 11: Cell-based architecture

Implementation steps

When designing a cell-based architecture, there are several design considerations to consider:

1. Partition key: Special consideration should be taken while choosing the partition key.
• It should align with the grain of the service, or the natural way that a service's workload can
be subdivided with minimal cross-cell interactions. Examples are customer ID or resource
ID.
• The partition key must be available in all requests, either directly or in a way that could be
easily inferred deterministically by other parameters.

Failure management 599

AWS Well-Architected Framework Framework

region of a country, or technical failures of Region-wide scope. Be aware that implementing

a multi-Region architecture can be signiﬁcantly complex, and is usually not required for most
workloads. For more detail, see REL10-BP02 Select the appropriate locations for your multi-
location deployment.

6. Code deployment: A staggered code deployment strategy should be preferred over deploying
code changes to all cells at the same time.

• This helps minimize potential failure to multiple cells due to a bad deployment or human
error. For more detail, see Automating safe, hands-oﬀ deployment.

Resources

Related best practices:

• REL07-BP04 Load test your workload

• REL10-BP02 Select the appropriate locations for your multi-location deployment

Related documents:

• Reliability, constant work, and a good cup of coﬀee

• AWS and Compartmentalization

• Workload isolation using shuﬄe-sharding

• Automating safe, hands-oﬀ deployment

Related videos:

• AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and
Small

• AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)

• Shuﬄe-sharding: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)

• AWS Summit ANZ 2021 - Everything fails, all the time: Designing for resilience

Related examples:

• Well-Architected Lab - Fault isolation with shuﬄe sharding

Failure management 601

AWS Well-Architected Framework Framework

• No metrics measuring the user experience of the workload.

• Too many monitors are created.

Beneﬁts of establishing this best practice: Having appropriate monitoring at all layers allows you
to reduce recovery time by reducing time to detection.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Identify all workloads that will be reviewed for monitoring. Once you have identiﬁed all
components of the workload that will need to monitored, you will now need to determine the
monitoring interval. The monitoring interval will have a direct impact on how fast recovery can be
initiated based on the time it takes to detect a failure. The mean time to detection (MTTD) is the
amount of time between a failure occurring and when repair operations begin. The list of services
should be extensive and complete.

Monitoring must cover all layers of the application stack including application, platform,
infrastructure, and network.

Your monitoring strategy should consider the impact of gray failures. For more detail on gray
failures, see Gray failures in the Advanced Multi-AZ Resilience Patterns whitepaper.

Implementation steps

• Your monitoring interval is dependent on how quickly you must recover. Your recovery time
is driven by the time it takes to recover, so you must determine the frequency of collection by
accounting for this time and your recovery time objective (RTO).
• Configure detailed monitoring for components and managed services.
• Determine if detailed monitoring for EC2 instances and Auto Scaling is necessary. Detailed
monitoring provides one minute interval metrics, and default monitoring provides five minute
interval metrics.
• Determine if enhanced monitoring for RDS is necessary. Enhanced monitoring uses an agent
on RDS instances to get useful information about different process or threads.
• Determine the monitoring requirements of critical serverless components for Lambda, API
Gateway, Amazon EKS, Amazon ECS, and all types of load balancers.
• Determine the monitoring requirements of storage components for Amazon S3, Amazon FSx,
Amazon EFS, and Amazon EBS.

Failure management 603

AWS Well-Architected Framework Framework

• Enable or Disable Detailed Monitoring for Your Instance

• Enhanced Monitoring
• Monitoring Your Auto Scaling Groups and Instances Using Amazon CloudWatch
• Publishing Custom Metrics
• Using Amazon CloudWatch Alarms
• Using CloudWatch Dashboards
• Using Cross Region Cross Account CloudWatch Dashboards
• Using Cross Region Cross Account X-Ray Tracing
• Understanding availability
• Implementing Amazon Health Aware (AHA)

Related videos:

• Mitigating gray failures

Related examples:

• Well-Architected Lab: Level 300: Implementing Health Checks and Managing Dependencies to
Improve Reliability
• One Observability Workshop: Explore X-Ray

Related tools:

• CloudWatch
• CloudWatch X-Ray

REL11-BP02 Fail over to healthy resources

If a resource failure occurs, healthy resources should continue to serve requests. For location
impairments (such as Availability Zone or AWS Region), ensure that you have systems in place to
fail over to healthy resources in unimpaired locations.

When designing a service, distribute load across resources, Availability Zones, or Regions.
Therefore, failure of an individual resource or impairment can be mitigated by shifting traﬃc to

Failure management 605

AWS Well-Architected Framework Framework

Implementation guidance

AWS services, such as Elastic Load Balancing and Amazon EC2 Auto Scaling, help distribute load
across resources and Availability Zones. Therefore, failure of an individual resource (such as an EC2
instance) or impairment of an Availability Zone can be mitigated by shifting traﬃc to remaining
healthy resources.

For multi-Region workloads, designs are more complicated. For example, cross-Region read replicas
allow you to deploy your data to multiple AWS Regions. However, failover is still required to
promote the read replica to primary and then point your traﬃc to the new endpoint. Amazon
Route 53, Amazon Application Recovery Controller (ARC), Amazon CloudFront, and AWS Global
Accelerator can help route traﬃc across AWS Regions.

AWS services, such as Amazon S3, Lambda, API Gateway, Amazon SQS, Amazon SNS, Amazon SES,
Amazon Pinpoint, Amazon ECR, AWS Certiﬁcate Manager, EventBridge, or Amazon DynamoDB, are
automatically deployed to multiple Availability Zones by AWS. In case of failure, these AWS services
automatically route traﬃc to healthy locations. Data is redundantly stored in multiple Availability
Zones and remains available.

For Amazon RDS, Amazon Aurora, Amazon Redshift, Amazon EKS, or Amazon ECS, Multi-AZ is
a conﬁguration option. AWS can direct traﬃc to the healthy instance if failover is initiated. This
failover action may be taken by AWS or as required by the customer

For Amazon EC2 instances, Amazon Redshift, Amazon ECS tasks, or Amazon EKS pods, you choose
which Availability Zones to deploy to. For some designs, Elastic Load Balancing provides the
solution to detect instances in unhealthy zones and route traﬃc to the healthy ones. Elastic Load
Balancing can also route traﬃc to components in your on-premises data center.

For Multi-Region traffic failover, rerouting can leverage Amazon Route 53, Amazon Application
Recovery Controller, AWS Global Accelerator, Route 53 Private DNS for VPCs, or CloudFront to
provide a way to define internet domains and assign routing policies, including health checks, to
route traffic to healthy Regions. AWS Global Accelerator provides static IP addresses that act as a
fixed entry point to your application, then route to endpoints in AWS Regions of your choosing,
using the AWS global network instead of the internet for better performance and reliability.

Implementation steps

• Create failover designs for all appropriate applications and services. Isolate each architecture
component and create failover designs meeting RTO and RPO for each component.

Failure management 607

AWS Well-Architected Framework Framework

• ACM Replication and Failover

• Parameter Store Replication and Failover
• ECR cross region replication and Failover
• Secrets manager cross region replication conﬁguration
• Enable cross region replication for EFS and Failover
• EFS Cross Region Replication and Failover
• Networking Failover
• S3 Endpoint failover using MRAP
• Create cross region replication for S3
• Guidance for Cross Region Failover and Graceful Failback on AWS
• Failover using multi-region global accelerator
• Failover with DRS
• Creating Disaster Recovery Mechanisms Using Amazon Route 53

Related examples:

• Disaster Recovery on AWS

• Elastic Disaster Recovery on AWS

REL11-BP03 Automate healing on all layers

Upon detection of a failure, use automated capabilities to perform actions to remediate.

Degradations may be automatically healed through internal service mechanisms or require
resources to be restarted or removed through remediation actions.

For self-managed applications and cross-Region healing, recovery designs and automated healing
processes can be pulled from existing best practices.

The ability to restart or remove a resource is an important tool to remediate failures. A best
practice is to make services stateless where possible. This prevents loss of data or availability
on resource restart. In the cloud, you can (and generally should) replace the entire resource (for
example, a compute instance or serverless function) as part of the restart. The restart itself is a
simple and reliable way to recover from failure. Many diﬀerent types of failures occur in workloads.
Failures can occur in hardware, software, communications, and operations.

Failure management 609

AWS Well-Architected Framework Framework

reduced capacity while it's recovering a new node. Example services are Mongo, DynamoDB
Accelerator, Amazon Redshift, Amazon EMR, Cassandra, Kafka, MSK-EC2, Couchbase, ELK, and
Amazon OpenSearch Service. Many of these services can be designed with additional auto healing
features. Some cluster technologies must generate an alert upon the loss a node triggering an
automated or manual workﬂow to recreate a new node. This workﬂow can be automated using
AWS Systems Manager to remediate issues quickly.

Amazon EventBridge can be used to monitor and ﬁlter for events such as CloudWatch alarms
or changes in state in other AWS services. Based on event information, it can then invoke AWS
Lambda, Systems Manager Automation, or other targets to run custom remediation logic on your
workload. Amazon EC2 Auto Scaling can be conﬁgured to check for EC2 instance health. If the
instance is in any state other than running, or if the system status is impaired, Amazon EC2 Auto
Scaling considers the instance to be unhealthy and launches a replacement instance. For large-
scale replacements (such as the loss of an entire Availability Zone), static stability is preferred for
high availability.

Implementation steps

• Use Auto Scaling groups to deploy tiers in a workload. Auto Scaling can perform self-healing on
stateless applications and add or remove capacity.

• For compute instances noted previously, use load balancing and choose the appropriate type of
load balancer.

• Consider healing for Amazon RDS. With standby instances, configure for auto failover to the
standby instance. For Amazon RDS Read Replica, automated workflow is required to make a read
replica primary.
• Implement automatic recovery on EC2 instances that have applications deployed that cannot be
deployed in multiple locations, and can tolerate rebooting upon failures. Automatic recovery can
be used to replace failed hardware and restart the instance when the application is not capable
of being deployed in multiple locations. The instance metadata and associated IP addresses are
kept, as well as the EBS volumes and mount points to Amazon Elastic File System or File Systems
for Lustre and Windows. Using AWS OpsWorks, you can configure automatic healing of EC2
instances at the layer level.

• Implement automated recovery using AWS Step Functions and AWS Lambda when you cannot
use automatic scaling or automatic recovery, or when automatic recovery fails. When you cannot
use automatic scaling, and either cannot use automatic recovery or automatic recovery fails, you
can automate the healing using AWS Step Functions and AWS Lambda.

Failure management 611

AWS Well-Architected Framework Framework

• Workshop on Auto Scaling

• Amazon RDS Failover Workshop

Related tools:

• CloudWatch

• CloudWatch X-Ray

REL11-BP04 Rely on the data plane and not the control plane during recovery

Control planes provide the administrative APIs used to create, read and describe, update,
delete, and list (CRUDL) resources, while data planes handle day-to-day service traﬃc. When
implementing recovery or mitigation responses to potentially resiliency-impacting events, focus on
using a minimal number of control plane operations to recover, rescale, restore, heal, or failover the
service. Data plane action should supersede any activity during these degradation events.

For example, the following are all control plane actions: launching a new compute instance,
creating block storage, and describing queue services. When you launch compute instances, the
control plane has to perform multiple tasks like ﬁnding a physical host with capacity, allocating
network interfaces, preparing local block storage volumes, generating credentials, and adding
security rules. Control planes tend to be complicated orchestration.

Desired outcome: When a resource enters an impaired state, the system is capable of
automatically or manually recovering by shifting traﬃc from impaired to healthy resources.

Common anti-patterns:

• Dependence on changing DNS records to re-route traﬃc.

• Dependence on control-plane scaling operations to replace impaired components due to

insuﬃciently provisioned resources.

• Relying on extensive, multi service, multi-API control plane actions to remediate any category of
impairment.

Beneﬁts of establishing this best practice: Increased success rate for automated remediation can
reduce your mean time to recovery and improve availability of the workload.

Failure management 613

AWS Well-Architected Framework Framework

Implementation steps

For each workload that needs to be restored after a degradation event, evaluate the failover
runbook, high availability design, auto healing design, or HA resource restoration plan. Identity
each action that might be considered a control plane action.

Consider changing the control action to a data plane action:

• Auto Scaling (control plane) to pre-scaled Amazon EC2 resources (data plane)
• Amazon EC2 instance scaling (control plane) to AWS Lambda scaling (data plane)
• Assess any designs using Kubernetes and the nature of the control plane actions. Adding pods
is a data plane action in Kubernetes. Actions should be limited to adding pods and not adding
nodes. Using over-provisioned nodes is the preferred method to limit control plane actions

Consider alternate approaches that allow for data plane actions to aﬀect the same remediation.

• Route 53 Record change (control plane) or Amazon Application Recovery Controller (data plane)
• Route 53 Health checks for more automated updates

Consider some services in a secondary Region, if the service is mission critical, to allow for more
control plane and data plane actions in an unaﬀected Region.

• Amazon EC2 Auto Scaling or Amazon EKS in a primary Region compared to Amazon EC2 Auto
Scaling or Amazon EKS in a secondary Region and routing traﬃc to secondary Region (control
plane action)
• Make read replica in secondary primary or attempting same action in primary Region (control
plane action)

Resources

Related best practices:

• Availability Deﬁnition
• REL11-BP01 Monitor all components of the workload to detect failures

Related documents:

Failure management 615

AWS Well-Architected Framework Framework

• Amazon CloudWatch
• AWS X-Ray

REL11-BP05 Use static stability to prevent bimodal behavior

Workloads should be statically stable and only operate in a single normal mode. Bimodal behavior
is when your workload exhibits diﬀerent behavior under normal and failure modes.

For example, you might try and recover from an Availability Zone failure by launching new
instances in a diﬀerent Availability Zone. This can result in a bimodal response during a failure
mode. You should instead build workloads that are statically stable and operate within only one
mode. In this example, those instances should have been provisioned in the second Availability
Zone before the failure. This static stability design veriﬁes that the workload only operates in a
single mode.

Desired outcome: Workloads do not exhibit bimodal behavior during normal and failure modes.

Common anti-patterns:

• Assuming resources can always be provisioned regardless of the failure scope.

• Trying to dynamically acquire resources during a failure.
• Not provisioning adequate resources across zones or Regions until a failure occurs.
• Considering static stable designs for compute resources only.

Beneﬁts of establishing this best practice: Workloads running with statically stable designs are
capable of having predictable outcomes during normal and failure events.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Bimodal behavior occurs when your workload exhibits diﬀerent behavior under normal and failure
modes (for example, relying on launching new instances if an Availability Zone fails). An example
of bimodal behavior is when stable Amazon EC2 designs provision enough instances in each
Availability Zone to handle the workload load if one AZ were removed. Elastic Load Balancing or
Amazon Route 53 health would check to shift a load away from the impaired instances. After traﬃc
has shifted, use AWS Auto Scaling to asynchronously replace instances from the failed zone and

Failure management 617

AWS Well-Architected Framework Framework

Another example of bimodal behavior is allowing clients to bypass your workload cache when
failures occur. This might seem to be a solution that accommodates client needs but it can
signiﬁcantly change the demands on your workload and is likely to result in failures.

Assess critical workloads to determine what workloads require this type of resilience design. For
those that are deemed critical, each application component must be reviewed. Example types of
services that require static stability evaluations are:

• Compute: Amazon EC2, EKS-EC2, ECS-EC2, EMR-EC2

• Databases: Amazon Redshift, Amazon RDS, Amazon Aurora

• Storage: Amazon S3 (Single Zone), Amazon EFS (mounts), Amazon FSx (mounts)

• Load balancers: Under certain designs

Implementation steps

• Build systems that are statically stable and operate in only one mode. In this case, provision
enough instances in each Availability Zone or Region to handle the workload capacity if one
Availability Zone or Region were removed. A variety of services can be used for routing to
healthy resources, such as:

• Cross Region DNS Routing

• MRAP Amazon S3 MultiRegion Routing

• AWS Global Accelerator

• Amazon Application Recovery Controller

• Conﬁgure database read replicas to account for the loss of a single primary instance or a read
replica. If traﬃc is being served by read replicas, the quantity in each Availability Zone and each
Region should equate to the overall need in case of the zone or Region failure.

• Conﬁgure critical data in Amazon S3 storage that is designed to be statically stable for data
stored in case of an Availability Zone failure. If Amazon S3 One Zone-IA storage class is used, this
should not be considered statically stable, as the loss of that zone minimizes access to this stored
data.

• Load balancers are sometimes configured incorrectly or by design to service a specific Availability
Zone. In this case, the statically stable design might be to spread a workload across multiple
AZs in a more complex design. The original design may be used to reduce interzone traffic for
security, latency, or cost reasons.

Failure management 619

AWS Well-Architected Framework Framework

can detect patterns of problems, including those addressed by auto healing, so that you can
resolve root cause issues.

Resilient systems are designed so that degradation events are immediately communicated to
the appropriate teams. These notiﬁcations should be sent through one or many communication
channels.

Desired outcome: Alerts are immediately sent to operations teams when thresholds are breached,
such as error rates, latency, or other critical key performance indicator (KPI) metrics, so that these
issues are resolved as soon as possible and user impact is avoided or minimized.

Common anti-patterns:

• Sending too many alarms.

• Sending alarms that are not actionable.
• Setting alarm thresholds too high (over sensitive) or too low (under sensitive).
• Not sending alarms for external dependencies.
• Not considering gray failures when designing monitoring and alarms.
• Performing healing automation, but not notifying the appropriate team that healing was
needed.

Benefits of establishing this best practice: Notifications of recovery make operational and
business teams aware of service degradations so that they can react immediately to minimize both
mean time to detect (MTTD) and mean time to repair (MTTR). Notifications of recovery events also
assure that you don't ignore problems that occur infrequently.

Level of risk exposed if this best practice is not established: Medium. Failure to implement
appropriate monitoring and events notiﬁcation mechanisms can result in failure to detect patterns
of problems, including those addressed by auto healing. A team will only be made aware of system
degradation when users contact customer service or by chance.

Implementation guidance

When defining a monitoring strategy, a triggered alarm is a common event. This event would
likely contain an identifier for the alarm, the alarm state (such as IN ALARM or OK), and details
of what triggered it. In many cases, an alarm event should be detected and an email notification
sent. This is an example of an action on an alarm. Alarm notification is critical in observability,
as it informs the right people that there is an issue. However, when action on events mature in

Failure management 621

AWS Well-Architected Framework Framework

• Creating a CloudWatch Alarm Based on a Static Threshold

• What Is Amazon EventBridge?
• What is Amazon Simple Notiﬁcation Service?
• Publishing Custom Metrics
• Using Amazon CloudWatch Alarms
• Amazon Health Aware (AHA)
• Setup CloudWatch Composite alarms
• What's new in AWS Observability at re:Invent 2022

Related tools:

• CloudWatch
• CloudWatch X-Ray

REL11-BP07 Architect your product to meet availability targets and uptime service level
agreements (SLAs)

Architect your product to meet availability targets and uptime service level agreements (SLAs). If
you publish or privately agree to availability targets or uptime SLAs, verify that your architecture
and operational processes are designed to support them.

Desired outcome: Each application has a deﬁned target for availability and SLA for performance
metrics, which can be monitored and maintained in order to meet business outcomes.

Common anti-patterns:

• Designing and deploying workload’s without setting any SLAs.

• SLA metrics are set too high without rationale or business requirements.
• Setting SLAs without taking into account for dependencies and their underlying SLA.
• Application designs are created without considering the Shared Responsibility Model for
Resilience.

Benefits of establishing this best practice: Designing applications based on key resiliency targets
helps you meet business objectives and customer expectations. These objectives help drive the
application design process that evaluates different technologies and considers various tradeoffs.

Failure management 623

AWS Well-Architected Framework Framework

• What are the requirements for data consistency and availability?

• Are there considerations for resource sparing or resource static stability?
• What are the service dependencies?
• Deﬁne SLA metrics based on the workload architecture while working with stakeholders.
Consider the SLAs of all dependencies used by the workload.
• Once the SLA target has been set, optimize the architecture to meet the SLA.
• Once the design is set that will meet the SLA, implement operational changes, process
automation, and runbooks that also will have focus on reducing MTTD and MTTR.
• Once deployed, monitor and report on the SLA.

Resources

Related best practices:

• REL03-BP01 Choose how to segment your workload

• REL10-BP01 Deploy the workload to multiple locations
• REL11-BP01 Monitor all components of the workload to detect failures
• REL11-BP03 Automate healing on all layers
• REL12-BP05 Test resiliency using chaos engineering
• REL13-BP01 Deﬁne recovery objectives for downtime and data loss
• Understanding workload health

Related documents:

• Availability with redundancy

• Reliability pillar - Availability
• Measuring availability
• AWS Fault Isolation Boundaries
• Shared Responsibility Model for Resiliency
• Static stability using Availability Zones
• AWS Service Level Agreements (SLAs)
• Guidance for Cell-based Architecture on AWS
• AWS infrastructure

Failure management 625

AWS Well-Architected Framework Framework

Common anti-patterns:

• Planning to deploy a workload without knowing the processes to diagnose issues or respond to
incidents.

• Unplanned decisions about which systems to gather logs and metrics from when investigating an
event.

• Not retaining metrics and events long enough to be able to retrieve the data.

Beneﬁts of establishing this best practice: Capturing playbooks ensures that processes can be
consistently followed. Codifying your playbooks limits the introduction of errors from manual
activity. Automating playbooks shortens the time to respond to an event by eliminating the
requirement for team member intervention or providing them additional information when their
intervention begins.

Level of risk exposed if this best practice is not established: High

Implementation guidance

• Use playbooks to identify issues. Playbooks are documented processes to investigate issues.
Allow consistent and prompt responses to failure scenarios by documenting processes in
playbooks. Playbooks must contain the information and guidance necessary for an adequately
skilled person to gather applicable information, identify potential sources of failure, isolate
faults, and determine contributing factors (perform post-incident analysis).

• Implement playbooks as code. Perform your operations as code by scripting your playbooks
to ensure consistency and limit reduce errors caused by manual processes. Playbooks can
be composed of multiple scripts representing the diﬀerent steps that might be necessary to
identify the contributing factors to an issue. Runbook activities can be invoked or performed
as part of playbook activities, or might prompt to run a playbook in response to identiﬁed
events.

• Automate your operational playbooks with AWS Systems Manager

• AWS Systems Manager Run Command

• AWS Systems Manager Automation

• What is AWS Lambda?

• What Is Amazon EventBridge?

• Using Amazon CloudWatch Alarms

Failure management 627

AWS Well-Architected Framework Framework

• Focus on assigning blame rather than understanding the root cause, creating a culture of fear
and hindering open communication
• Failure to share insights, which keeps incident analysis ﬁndings within a small group and
prevents others from beneﬁting from the lessons learned
• No mechanism to capture institutional knowledge, thereby losing valuable insights by not
preserving the lessons-learned in the form of updated best practices and resulting in repeat
incidents with the same or similar root cause

Beneﬁts of establishing this best practice: Conducting post-incident analysis and sharing
the results permits other workloads to mitigate the risk if they have implemented the same
contributing factors, and allows them to implement the mitigation or automated recovery before
an incident occurs.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Good post-incident analysis provides opportunities to propose common solutions for problems
with architecture patterns that are used in other places in your systems.

A cornerstone of the COE process is documenting and addressing issues. It is recommended to

deﬁne a standardized way to document critical root causes, and ensure they are reviewed and
addressed. Assign clear ownership for the post-incident analysis process. Designate a responsible
team or individual who will oversee incident investigations and follow-ups.

Encourage a culture that focuses on learning and improvement rather than assigning blame.
Emphasize that the goal is to prevent future incidents, not to penalize individuals.

Develop well-deﬁned procedures for conducting post-incident analyses. These procedures should
outline the steps to be taken, the information to be collected, and the key questions to be
addressed during the analysis. Investigate incidents thoroughly, going beyond immediate causes to
identify root causes and contributing factors. Use techniques like the ﬁve whys to delve deep into
the underlying issues.

Maintain a repository of lessons learned from incident analyses. This institutional knowledge can
serve as a reference for future incidents and prevention eﬀorts. Share ﬁndings and insights from
post-incident analyses, and consider holding open-invite post-incident review meetings to discuss
lessons learned.

Failure management 629

AWS Well-Architected Framework Framework

• Why you should develop a correction of error (COE)

Related videos:

• Amazon’s approach to failing successfully

• AWS re:Invent 2021 - Amazon Builders’ Library: Operational Excellence at Amazon

REL12-BP03 Test functional requirements

Use techniques such as unit tests and integration tests that validate required functionality.

You achieve the best outcomes when these tests are run automatically as part of build and
deployment actions. For instance, using AWS CodePipeline, developers commit changes to a source
repository where CodePipeline automatically detects the changes. Those changes are built, and
tests are run. After the tests are complete, the built code is deployed to staging servers for testing.
From the staging server, CodePipeline runs more tests, such as integration or load tests. Upon
the successful completion of those tests, CodePipeline deploys the tested and approved code to
production instances.

Additionally, experience shows that synthetic transaction testing (also known as canary testing,
but not to be confused with canary deployments) that can run and simulate customer behavior is
among the most important testing processes. Run these tests constantly against your workload
endpoints from diverse remote locations. Amazon CloudWatch Synthetics allows you to create
canaries to monitor your endpoints and APIs.

Level of risk exposed if this best practice is not established: High

Implementation guidance

• Test functional requirements. These include unit tests and integration tests that validate required
functionality.

• Use CodePipeline with AWS CodeBuild to test code and run builds

• AWS CodePipeline Adds Support for Unit and Custom Integration Testing with AWS CodeBuild

• Continuous Delivery and Continuous Integration

• Using Canaries (Amazon CloudWatch Synthetics)

• Software test automation

Failure management 631

AWS Well-Architected Framework Framework

• Use infrastructure as code concepts to create an environment as similar to your production

environment as possible.

Resources

Related documents:

• Distributed Load Testing on AWS: simulate thousands of connected users

• Apache JMeter

REL12-BP05 Test resiliency using chaos engineering

Run chaos experiments regularly in environments that are in or as close to production as possible
to understand how your system responds to adverse conditions.

Desired outcome:

The resilience of the workload is regularly veriﬁed by applying chaos engineering in the form
of fault injection experiments or injection of unexpected load, in addition to resilience testing
that validates known expected behavior of your workload during an event. Combine both chaos
engineering and resilience testing to gain conﬁdence that your workload can survive component
failure and can recover from unexpected disruptions with minimal to no impact.

Common anti-patterns:

• Designing for resiliency, but not verifying how the workload functions as a whole when faults
occur.
• Never experimenting under real-world conditions and expected load.
• Not treating your experiments as code or maintaining them through the development cycle.
• Not running chaos experiments both as part of your CI/CD pipeline, as well as outside of
deployments.
• Neglecting to use past post-incident analyses when determining which faults to experiment with.

Beneﬁts of establishing this best practice: Injecting faults to verify the resilience of your workload
allows you to gain conﬁdence that the recovery procedures of your resilient design will work in the
case of a real fault.

Failure management 633

AWS Well-Architected Framework Framework

these faults, including networking eﬀects such as latency, dropped messages, and DNS failures,
could include the inability to resolve a name, reach the DNS service, or establish connections to
dependent services.

Chaos engineering tools:

AWS Fault Injection Service (AWS FIS) is a fully managed service for running fault injection
experiments that can be used as part of your CD pipeline, or outside of the pipeline. AWS FIS is a
good choice to use during chaos engineering game days. It supports simultaneously introducing
faults across diﬀerent types of resources including Amazon EC2, Amazon Elastic Container Service
(Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon RDS. These faults
include termination of resources, forcing failovers, stressing CPU or memory, throttling, latency,
and packet loss. Since it is integrated with Amazon CloudWatch Alarms, you can set up stop
conditions as guardrails to rollback an experiment if it causes unexpected impact.

AWS Fault Injection Service integrates with AWS resources to allow you to run fault injection
experiments for your workloads.

There are also several third-party options for fault injection experiments. These include open-
source tools such as Chaos Toolkit, Chaos Mesh, and Litmus Chaos, as well as commercial options
like Gremlin. To expand the scope of faults that can be injected on AWS, AWS FIS integrates
with Chaos Mesh and Litmus Chaos, allowing you to coordinate fault injection workﬂows among
multiple tools. For example, you can run a stress test on a pod’s CPU using Chaos Mesh or Litmus
faults while terminating a randomly selected percentage of cluster nodes using AWS FIS fault
actions.

Failure management 635

AWS Well-Architected Framework Framework

Chaos engineering and continuous resilience ﬂywheel, using the scientiﬁc method by Adrian
Hornsby.

a. Deﬁne steady state as some measurable output of a workload that indicates normal behavior.

Your workload exhibits steady state if it is operating reliably and as expected. Therefore,
validate that your workload is healthy before deﬁning steady state. Steady state does not
necessarily mean no impact to the workload when a fault occurs, as a certain percentage
in faults could be within acceptable limits. The steady state is your baseline that you will
observe during the experiment, which will highlight anomalies if your hypothesis deﬁned in
the next step does not turn out as expected.

For example, a steady state of a payments system can be deﬁned as the processing of 300
TPS with a success rate of 99% and round-trip time of 500 ms.

b. Form a hypothesis about how the workload will react to the fault.

Failure management 637

AWS Well-Architected Framework Framework

with the experiment. There are several options for injecting the faults. For workloads on
AWS, AWS FIS provides many predeﬁned fault simulations called actions. You can also deﬁne
custom actions that run in AWS FIS using AWS Systems Manager documents.

We discourage the use of custom scripts for chaos experiments, unless the scripts have
the capabilities to understand the current state of the workload, are able to emit logs, and
provide mechanisms for rollbacks and stop conditions where possible.

An eﬀective framework or toolset which supports chaos engineering should track the current
state of an experiment, emit logs, and provide rollback mechanisms to support the controlled
running of an experiment. Start with an established service like AWS FIS that allows you
to perform experiments with a clearly deﬁned scope and safety mechanisms that rollback
the experiment if the experiment introduces unexpected turbulence. To learn about a wider
variety of experiments using AWS FIS, also see the Resilient and Well-Architected Apps with
Chaos Engineering lab. Also, AWS Resilience Hub will analyze your workload and create
experiments that you can choose to implement and run in AWS FIS.

Note
For every experiment, clearly understand the scope and its impact. We recommend
that faults should be simulated ﬁrst on a non-production environment before being
run in production.

Experiments should run in production under real-world load using canary deployments
that spin up both a control and experimental system deployment, where feasible. Running
experiments during off-peak times is a good practice to mitigate potential impact when first
experimenting in production. Also, if using actual customer traffic poses too much risk, you
can run experiments using synthetic traffic on production infrastructure against the control
and experimental deployments. When using production is not possible, run experiments in
pre-production environments that are as close to production as possible.

You must establish and monitor guardrails to ensure the experiment does not impact
production traﬃc or other systems beyond acceptable limits. Establish stop conditions
to stop an experiment if it reaches a threshold on a guardrail metric that you deﬁne. This
should include the metrics for steady state for the workload, as well as the metric against the
components into which you’re injecting the fault. A synthetic monitor (also known as a user
canary) is one metric you should usually include as a user proxy. Stop conditions for AWS FIS

Failure management 639

AWS Well-Architected Framework Framework

In our two previous examples, we include the steady state metrics of less than 0.01% increase
in server-side (5xx) errors and less than one minute of database read and write errors.

The 5xx errors are a good metric because they are a consequence of the failure mode that
a client of the workload will experience directly. The database errors measurement is good
as a direct consequence of the fault, but should also be supplemented with a client impact
measurement such as failed customer requests or errors surfaced to the client. Additionally,
include a synthetic monitor (also known as a user canary) on any APIs or URIs directly
accessed by the client of your workload.

e. Improve the workload design for resilience.

If steady state was not maintained, then investigate how the workload design can be
improved to mitigate the fault, applying the best practices of the AWS Well-Architected
Reliability pillar. Additional guidance and resources can be found in the AWS Builder’s Library,
which hosts articles about how to improve your health checks or employ retries with backoﬀ
in your application code, among others.

After these changes have been implemented, run the experiment again (shown by the dotted
line in the chaos engineering ﬂywheel) to determine their eﬀectiveness. If the verify step
indicates the hypothesis holds true, then the workload will be in steady state, and the cycle
continues.

4. Run experiments regularly.

A chaos experiment is a cycle, and experiments should be run regularly as part of chaos
engineering. After a workload meets the experiment’s hypothesis, the experiment should be
automated to run continually as a regression part of your CI/CD pipeline. To learn how to do
this, see this blog on how to run AWS FIS experiments using AWS CodePipeline. This lab on
recurrent AWS FIS experiments in a CI/CD pipeline allows you to work hands-on.

Fault injection experiments are also a part of game days (see REL12-BP06 Conduct game
days regularly). Game days simulate a failure or event to verify systems, processes, and team
responses. The purpose is to actually perform the actions the team would perform as if an
exceptional event happened.

5. Capture and store experiment results.

Results for fault injection experiments must be captured and persisted. Include all necessary
data (such as time, workload, and conditions) to be able to later analyze experiment results and

Failure management 641

AWS Well-Architected Framework Framework

• Serverless Chaos lab

• Measure and Improve Your Application Resilience with AWS Resilience Hub lab

Related tools:

• AWS Fault Injection Service

• AWS Marketplace: Gremlin Chaos Engineering Platform
• Chaos Toolkit
• Chaos Mesh
• Litmus

REL12-BP06 Conduct game days regularly

Use game days to regularly exercise your procedures for responding to events and failures as close
to production as possible (including in production environments) with the people who will be
involved in actual failure scenarios. Game days enforce measures to ensure that production events
do not impact users.

Game days simulate a failure or event to test systems, processes, and team responses. The
purpose is to actually perform the actions the team would perform as if an exceptional event
happened. This will help you understand where improvements can be made and can help develop
organizational experience in dealing with events. These should be conducted regularly so that your
team builds muscle memory on how to respond.

After your design for resiliency is in place and has been tested in non-production environments,
a game day is the way to ensure that everything works as planned in production. A game day,
especially the ﬁrst one, is an “all hands on deck” activity where engineers and operations are
all informed when it will happen, and what will occur. Runbooks are in place. Simulated events
are run, including possible failure events, in the production systems in the prescribed manner,
and impact is assessed. If all systems operate as designed, detection and self-healing will occur
with little to no impact. However, if negative impact is observed, the test is rolled back and the
workload issues are remedied, manually if necessary (using the runbook). Since game days often
take place in production, all precautions should be taken to ensure that there is no impact on
availability to your customers.

Common anti-patterns:

Failure management 643

AWS Well-Architected Framework Framework

Best practices
• REL13-BP01 Deﬁne recovery objectives for downtime and data loss

• REL13-BP02 Use deﬁned recovery strategies to meet the recovery objectives

• REL13-BP03 Test disaster recovery implementation to validate the implementation

• REL13-BP04 Manage conﬁguration drift at the DR site or Region

• REL13-BP05 Automate recovery

REL13-BP01 Deﬁne recovery objectives for downtime and data loss

The workload has a recovery time objective (RTO) and recovery point objective (RPO).

Recovery Time Objective (RTO) is the maximum acceptable delay between the interruption of
service and restoration of service. This determines what is considered an acceptable time window
when service is unavailable.

Recovery Point Objective (RPO) is the maximum acceptable amount of time since the last data
recovery point. This determines what is considered an acceptable loss of data between the last
recovery point and the interruption of service.

RTO and RPO values are important considerations when selecting an appropriate Disaster Recovery
(DR) strategy for your workload. These objectives are determined by the business, and then used by
technical teams to select and implement a DR strategy.

Desired Outcome:

Every workload has an assigned RTO and RPO, defined based on business impact. The workload
is assigned to a predefined tier, defining service availability and acceptable loss of data, with
an associated RTO and RPO. If such tiering is not possible then this can be assigned bespoke
per workload, with the intent to create tiers later. RTO and RPO are used as one of the primary
considerations for selection of a disaster recovery strategy implementation for the workload.
Additional considerations in picking a DR strategy are cost constraints, workload dependencies, and
operational requirements.

For RTO, understand impact based on duration of an outage. Is it linear, or are there nonlinear
implications? (for example. after four hours, you shut down a manufacturing line until the start of
the next shift).

Failure management 645

AWS Well-Architected Framework Framework

Implementation guidance

For the given workload, you must understand the impact of downtime and lost data on your
business. The impact generally grows larger with greater downtime or data loss, but the shape
of this growth can diﬀer based on the workload type. For example, you may be able to tolerate
downtime for up to an hour with little impact, but after that impact quickly rises. Impact to
business manifests in many forms including monetary cost (such as lost revenue), customer trust
(and impact to reputation), operational issues (such as missing payroll or decreased productivity),
and regulatory risk. Use the following steps to understand these impacts, and set RTO and RPO for
your workload.

Implementation Steps

1. Determine your business stakeholders for this workload, and engage with them to implement
these steps. Recovery objectives for a workload are a business decision. Technical teams then
work with business stakeholders to use these objectives to select a DR strategy.

Note
For steps 2 and 3, you can use the the section called “Implementation worksheet”.

2. Gather the necessary information to make a decision by answering the questions below.

3. Do you have categories or tiers of criticality for workload impact in your organization?

a. If yes, assign this workload to a category

b. If no, then establish these categories. Create ﬁve or fewer categories and reﬁne the range of
your recovery time objective for each one. Example categories include: critical, high, medium,
low. To understand how workloads map to categories, consider whether the workload is
mission critical, business important, or non-business driving.
c. Set workload RTO and RPO based on category. Always choose a category more strict (lower
RTO and RPO) than the raw values calculated entering this step. If this results in an unsuitably
large change in value, then consider creating a new category.

4. Based on these answers, assign RTO and RPO values to the workload. This can be done directly,
or by assigning the workload to a predeﬁned tier of service.

5. Document the disaster recovery plan (DRP) for this workload, which is a part of your
organization’s business continuity plan (BCP), in a location accessible to the workload team and
stakeholders

Failure management 647

AWS Well-Architected Framework Framework

b. Choose recovery objectives that are achievable given the recovery capabilities of downstream
dependencies. Non-critical downstream dependencies (ones you can “work around”) can
be excluded. Or, work with critical downstream dependencies to improve their recovery
capabilities where necessary.

Additional questions

Consider these questions, and how they may apply to this workload:

4. Do you have diﬀerent RTO and RPO depending on the type of outage (Region vs. AZ, etc.)?

5. Is there a speciﬁc time (seasonality, sales events, product launches) when your RTO/RPO may
change? If so, what is the diﬀerent measurement and time boundary?

6. How many customers will be impacted if workload is disrupted?

7. What is the impact to reputation if workload is disrupted?

8. What other operational impacts may occur if workload is disrupted? For example, impact to
employee productivity if email systems are unavailable, or if Payroll systems are unable to
submit transactions.

9. How does workload RTO and RPO align with Line of Business and Organizational DR Strategy?

10.Are there internal contractual obligations for providing a service? Are there penalties for not
meeting them?

11.What are the regulatory or compliance constraints with the data?

Implementation worksheet

You can use this worksheet for implementation steps 2 and 3. You may adjust this worksheet to
suit your speciﬁc needs, such as adding additional questions.

Failure management 649

AWS Well-Architected Framework Framework

• APN Partner: partners that can help with disaster recovery

• AWS Marketplace: products that can be used for disaster recovery

Related videos:

• AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-
R2)

• Disaster Recovery of Workloads on AWS

REL13-BP02 Use deﬁned recovery strategies to meet the recovery objectives

Deﬁne a disaster recovery (DR) strategy that meets your workload's recovery objectives. Choose a
strategy such as backup and restore, standby (active/passive), or active/active.

Desired outcome: For each workload, there is a deﬁned and implemented DR strategy that allows
the workload to achieve DR objectives. DR strategies between workloads make use of reusable
patterns (such as the strategies previously described),

Common anti-patterns:

• Implementing inconsistent recovery procedures for workloads with similar DR objectives.

• Leaving the DR strategy to be implemented ad-hoc when a disaster occurs.

• Having no plan for disaster recovery.

• Dependency on control plane operations during recovery.

Beneﬁts of establishing this best practice:

• Using deﬁned recovery strategies allows you to use common tooling and test procedures.

• Using deﬁned recovery strategies improves knowledge sharing between teams and
implementation of DR on the workloads they own.

Level of risk exposed if this best practice is not established: High. Without a planned,
implemented, and tested DR strategy, you are unlikely to achieve recovery objectives in the event
of a disaster.

Failure management 651

AWS Well-Architected Framework Framework

• Pilot light (RPO in minutes, RTO in tens of minutes): Provision a copy of your core workload
infrastructure in the recovery Region. Replicate your data into the recovery Region and create
backups of it there. Resources required to support data replication and backup, such as
databases and object storage, are always on. Other elements such as application servers or
serverless compute are not deployed, but can be created when needed with the necessary
conﬁguration and application code.

• Warm standby (RPO in seconds, RTO in minutes): Maintain a scaled-down but fully functional
version of your workload always running in the recovery Region. Business-critical systems are
fully duplicated and are always on, but with a scaled down ﬂeet. Data is replicated and live in the
recovery Region. When the time comes for recovery, the system is scaled up quickly to handle
the production load. The more scaled-up the warm standby is, the lower RTO and control plane
reliance will be. When fully scales this is known as hot standby.

• Multi-Region (multi-site) active-active (RPO near zero, RTO potentially zero): Your workload is
deployed to, and actively serving traffic from, multiple AWS Regions. This strategy requires you
to synchronize data across Regions. Possible conflicts caused by writes to the same record in two
different regional replicas must be avoided or handled, which can be complex. Data replication is
useful for data synchronization and will protect you against some types of disaster, but it will not
protect you against data corruption or destruction unless your solution also includes options for
point-in-time recovery.

Note
The difference between pilot light and warm standby can sometimes be difficult to
understand. Both include an environment in your recovery Region with copies of your
primary region assets. The distinction is that pilot light cannot process requests without
additional action taken first, while warm standby can handle traffic (at reduced capacity
levels) immediately. Pilot light will require you to turn on servers, possibly deploy
additional (non-core) infrastructure, and scale up, while warm standby only requires you
to scale up (everything is already deployed and running). Choose between these based on
your RTO and RPO needs.
When cost is a concern, and you wish to achieve a similar RPO and RTO objectives as
defined in the warm standby strategy, you could consider cloud native solutions, like AWS
Elastic Disaster Recovery, that take the pilot light approach and offer improved RPO and
RTO targets.

Failure management 653

AWS Well-Architected Framework Framework

Backup and restore

Backup and restore is the least complex strategy to implement, but will require more time and
eﬀort to restore the workload, leading to higher RTO and RPO. It is a good practice to always
make backups of your data, and copy these to another site (such as another AWS Region).

Figure 19: Backup and restore architecture

For more details on this strategy see Disaster Recovery (DR) Architecture on AWS, Part II: Backup
and Restore with Rapid Recovery.

Pilot light

With the pilot light approach, you replicate your data from your primary Region to your recovery
Region. Core resources used for the workload infrastructure are deployed in the recovery Region,
however additional resources and any dependencies are still needed to make this a functional
stack. For example, in Figure 20, no compute instances are deployed.

Failure management 655

AWS Well-Architected Framework Framework

Figure 21: Warm standby architecture

Using warm standby or pilot light requires scaling up resources in the recovery Region. To verify
capacity is available when needed, consider the use for capacity reservations for EC2 instances.
If using AWS Lambda, then provisioned concurrency can provide runtime environments so that
they are prepared to respond immediately to your function's invocations.

For more details on this strategy, see Disaster Recovery (DR) Architecture on AWS, Part III: Pilot
Light and Warm Standby.

Multi-site active/active

You can run your workload simultaneously in multiple Regions as part of a multi-site active/
active strategy. Multi-site active/active serves traﬃc from all regions to which it is deployed.
Customers may select this strategy for reasons other than DR. It can be used to increase
availability, or when deploying a workload to a global audience (to put the endpoint closer to
users and/or to deploy stacks localized to the audience in that region). As a DR strategy, if the
workload cannot be supported in one of the AWS Regions to which it is deployed, then that
Region is evacuated, and the remaining Regions are used to maintain availability. Multi-site
active/active is the most operationally complex of the DR strategies, and should only be selected
when business requirements necessitate it.

Failure management 657

AWS Well-Architected Framework Framework

Figure 23: AWS Elastic Disaster Recovery architecture

Additional practices for protecting data

With all strategies, you must also mitigate against a data disaster. Continuous data replication
protects you against some types of disaster, but it may not protect you against data corruption
or destruction unless your strategy also includes versioning of stored data or options for point-
in-time recovery. You must also back up the replicated data in the recovery site to create point-
in-time backups in addition to the replicas.

Using multiple Availability Zones (AZs) within a single AWS Region

When using multiple AZs within a single Region, your DR implementation uses multiple
elements of the above strategies. First you must create a high-availability (HA) architecture,
using multiple AZs as shown in Figure 23. This architecture makes use of a multi-site active/
active approach, as the Amazon EC2 instances and the Elastic Load Balancer have resources
deployed in multiple AZs, actively handing requests. The architecture also demonstrates hot

Failure management 659

AWS Well-Architected Framework Framework

3. Assess the resources of your workload, and what their conﬁguration will be in the recovery
Region prior to failover (during normal operation).

For infrastructure and AWS resources use infrastructure as code such as AWS CloudFormation
or third-party tools like Hashicorp Terraform. To deploy across multiple accounts and Regions
with a single operation you can use AWS CloudFormation StackSets. For Multi-site active/
active and Hot Standby strategies, the deployed infrastructure in your recovery Region has
the same resources as your primary Region. For Pilot Light and Warm Standby strategies, the
deployed infrastructure will require additional actions to become production ready. Using
CloudFormation parameters and conditional logic, you can control whether a deployed stack is
active or standby with a single template. When using Elastic Disaster Recovery, the service will
replicate and orchestrate the restoration of application conﬁgurations and compute resources.

All DR strategies require that data sources are backed up within the AWS Region, and then those
backups are copied to the recovery Region. AWS Backup provides a centralized view where
you can conﬁgure, schedule, and monitor backups for these resources. For Pilot Light, Warm
Standby, and Multi-site active/active, you should also replicate data from the primary Region
to data resources in the recovery Region, such as Amazon Relational Database Service (Amazon
RDS) DB instances or Amazon DynamoDB tables. These data resources are therefore live and
ready to serve requests in the recovery Region.

To learn more about how AWS services operate across Regions, see this blog series on Creating a
Multi-Region Application with AWS Services.

4. Determine and implement how you will make your recovery Region ready for failover when
needed (during a disaster event).

For multi-site active/active, failover means evacuating a Region, and relying on the remaining
active Regions. In general, those Regions are ready to accept traﬃc. For Pilot Light and Warm
Standby strategies, your recovery actions will need to deploy the missing resources, such as the
EC2 instances in Figure 20, plus any other missing resources.

For all of the above strategies you may need to promote read-only instances of databases to
become the primary read/write instance.

For backup and restore, restoring data from backup creates resources for that data such as EBS
volumes, RDS DB instances, and DynamoDB tables. You also need to restore the infrastructure
and deploy code. You can use AWS Backup to restore data in the recovery Region. See REL09-
BP01 Identify and back up all data that needs to be backed up, or reproduce the data from

Failure management 661

AWS Well-Architected Framework Framework

is maintained. Therefore, the former read/write instance in the primary Region will become a
replica and receive updates from the recovery Region.

In cases where this is not automatic, you will need to re-establish the database in the primary
Region as a replica of the database in the recovery Region. In many cases this will involve
deleting the old primary database, and creating new replicas.

After a failover, if you can continue running in your recovery Region, consider making this the
new primary Region. You would still do all the above steps to make the former primary Region
into a recovery Region. Some organizations do a scheduled rotation, swapping their primary and
recovery Regions periodically (for example every three months).

All of the steps required to fail over and fail back should be maintained in a playbook that is
available to all members of the team, and is periodically reviewed.

When using Elastic Disaster Recovery, the service will assist in orchestrating and automating the
failback process. For more details, see Performing a failback.

Level of eﬀort for the Implementation Plan: High

Resources

Related best practices:

• the section called “REL09-BP01 Identify and back up all data that needs to be backed up, or
reproduce the data from sources”
• the section called “REL11-BP04 Rely on the data plane and not the control plane during
recovery”
• the section called “REL13-BP01 Deﬁne recovery objectives for downtime and data loss”

Related documents:

• AWS Architecture Blog: Disaster Recovery Series

• Disaster Recovery of Workloads on AWS: Recovery in the Cloud (AWS Whitepaper)
• Disaster recovery options in the cloud
• Build a serverless multi-region, active-active backend solution in an hour
• Multi-region serverless backend — reloaded

Failure management 663

AWS Well-Architected Framework Framework

Implementation guidance

A pattern to avoid is developing recovery paths that are rarely exercised. For example, you might
have a secondary data store that is used for read-only queries. When you write to a data store and
the primary fails, you might want to fail over to the secondary data store. If you don’t frequently
test this failover, you might ﬁnd that your assumptions about the capabilities of the secondary
data store are incorrect. The capacity of the secondary, which might have been suﬃcient when you
last tested, might be no longer be able to tolerate the load under this scenario. Our experience has
shown that the only error recovery that works is the path you test frequently. This is why having
a small number of recovery paths is best. You can establish recovery patterns and regularly test
them. If you have a complex or critical recovery path, you still need to regularly exercise that failure
in production to convince yourself that the recovery path works. In the example we just discussed,
you should fail over to the standby regularly, regardless of need.

Implementation steps

1. Engineer your workloads for recovery. Regularly test your recovery paths. Recovery-oriented
computing identifies the characteristics in systems that enhance recovery: isolation and
redundancy, system-wide ability to roll back changes, ability to monitor and determine health,
ability to provide diagnostics, automated recovery, modular design, and ability to restart.
Exercise the recovery path to verify that you can accomplish the recovery in the specified time
to the specified state. Use your runbooks during this recovery to document problems and find
solutions for them before the next test.
2. For Amazon EC2-based workloads, use AWS Elastic Disaster Recovery to implement and launch
drill instances for your DR strategy. AWS Elastic Disaster Recovery provides the ability to
efficiently run drills, which helps you prepare for a failover event. You can also frequently launch
of your instances using Elastic Disaster Recovery for test and drill purposes without redirecting
the traffic.

Resources

Related documents:

• APN Partner: partners that can help with disaster recovery

• AWS Architecture Blog: Disaster Recovery Series
• AWS Marketplace: products that can be used for disaster recovery
• AWS Elastic Disaster Recovery

Failure management 665

AWS Well-Architected Framework Framework

Implementation guidance

• Ensure that your delivery pipelines deliver to both your primary and backup sites. Delivery
pipelines for deploying applications into production must distribute to all the specified disaster
recovery strategy locations, including dev and test environments.
• Permit AWS Config to track potential drift locations. Use AWS Config rules to create systems that
enforce your disaster recovery strategies and generate alerts when they detect drift.
• Remediating Noncompliant AWS Resources by AWS Config Rules
• AWS Systems Manager Automation
• Use AWS CloudFormation to deploy your infrastructure. AWS CloudFormation can detect drift
between what your CloudFormation templates specify and what is actually deployed.
• AWS CloudFormation: Detect Drift on an Entire CloudFormation Stack

Resources

Related documents:

• APN Partner: partners that can help with disaster recovery

• AWS Architecture Blog: Disaster Recovery Series
• AWS CloudFormation: Detect Drift on an Entire CloudFormation Stack
• AWS Marketplace: products that can be used for disaster recovery
• AWS Systems Manager Automation
• Disaster Recovery of Workloads on AWS: Recovery in the Cloud (AWS Whitepaper)
• How do I implement an Infrastructure Conﬁguration Management solution on AWS?
• Remediating Noncompliant AWS Resources by AWS Conﬁg Rules

Related videos:

• AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-
R2)

REL13-BP05 Automate recovery

Use AWS or third-party tools to automate system recovery and route traﬃc to the DR site or
Region.

Failure management 667

AWS Well-Architected Framework Framework

• AWS Architecture Blog: Disaster Recovery Series

• AWS Marketplace: products that can be used for disaster recovery

• AWS Systems Manager Automation

• AWS Elastic Disaster Recovery

• Disaster Recovery of Workloads on AWS: Recovery in the Cloud (AWS Whitepaper)

Related videos:

• AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-
R2)

Performance efficiency
The Performance Efficiency pillar includes the ability to use cloud resources efficiently to meet
performance requirements, and to maintain that efficiency as demand changes and technologies
evolve. You can find prescriptive guidance on implementation in the Performance Efficiency Pillar
whitepaper.

Best practice areas

• Architecture selection

• Compute and hardware

• Data management

• Networking and content delivery

• Process and culture

Architecture selection
Questions

• PERF 1. How do you select appropriate cloud resources and architecture for your workload?

Performance eﬃciency 669

AWS Well-Architected Framework Framework

Level of risk exposed if this best practice is not established: High

Implementation guidance

AWS continually releases new services and features that can improve performance and reduce
the cost of cloud workloads. Staying up-to-date with these new services and features is crucial for
maintaining performance eﬃcacy in the cloud. Modernizing your workload architecture also helps
you accelerate productivity, drive innovation, and unlock more growth opportunities.

Implementation steps

• Inventory your workload software and architecture for related services. Decide which category of
products to learn more about.
• Explore AWS offerings to identify and learn about the relevant services and configuration
options that can help you improve performance and reduce cost and operational complexity.
• Amazon Web Services Cloud
• AWS Academy
• What’s New with AWS?
• AWS Blog
• AWS Skill Builder
• AWS Events and Webinars
• AWS Training and Certifications
• AWS Youtube Channel
• AWS Workshops
• AWS Communities
• Use Amazon Q to get relevant information and advice about services.
• Use sandbox (non-production) environments to learn and experiment with new services without
incurring extra cost.
• Continually learn about new cloud services and features.

Resources

Related documents:

• Overview of Amazon Web Services

• Amazon EC2 features

Architecture selection 671

AWS Well-Architected Framework Framework

Beneﬁts of establishing this best practice: Using guidance from a cloud provider or an
appropriate partner can help you to make the right architectural choices for your workload and
give you conﬁdence in your decisions.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

AWS offers a wide range of guidance, documentation, and resources that can help you build
and manage efficient cloud workloads. AWS documentation provides code samples, tutorials,
and detailed service explanations. In addition to documentation, AWS provides training and
certification programs, solutions architects, and professional services that can help customers
explore different aspects of cloud services and implement efficient cloud architecture on AWS.

Leverage these resources to gain insights into valuable knowledge and best practices, save time,
and achieve better outcomes in the AWS Cloud.

Implementation steps

• Review AWS documentation and guidance and follow the best practices. These resources can
help you effectively choose and configure services and achieve better performance.
• AWS documentation (like user guides and whitepapers)
• AWS Blog
• AWS Training and Certifications
• AWS Youtube Channel
• Join AWS partner events (like AWS Global Summits, AWS re:Invent, user groups, and workshops)
to learn from AWS experts about best practices for using AWS services.
• Learn step-by-step with an AWS Partner Learning Plan
• AWS Events and Webinars
• AWS Workshops
• AWS Communities
• Reach out to AWS for assistance when you need additional guidance or product information.
AWS Solutions Architects and AWS Professional Services provide guidance for solution
implementation. AWS Partners provide AWS expertise to help you unlock agility and innovation
for your business.
• Use AWS Support if you need technical support to use a service effectively. Our Support plans
are designed to give you the right mix of tools and access to expertise so that you can be

Architecture selection 673

AWS Well-Architected Framework Framework

• You do not deﬁne storage lifecycle policies.

• You do not review new services and features of the AWS Cloud.
• You only use block storage.

Beneﬁts of establishing this best practice: Factoring cost into your decision making allows you to
use more eﬃcient resources and explore other investments.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Optimizing workloads for cost can improve resource utilization and avoid waste in a cloud
workload. Factoring cost into architectural decisions usually includes right-sizing workload
components and enabling elasticity, which results in improved cloud workload performance
eﬃciency.

Implementation steps

• Establish cost objectives like budget limits for your cloud workload.
• Identify the key components (like instances and storage) that drive cost of your workload.
You can use AWS Pricing Calculator and AWS Cost Explorer to identify key cost drivers in your
workload.
• Understand pricing models in the cloud, such as On-Demand, Reserved Instances, Savings Plans,
and Spot Instances.
• Use Well-Architected cost optimization best practices to optimize these key components for cost.
• Continually monitor and analyze cost to identify cost optimization opportunities in your
workload.
• Use AWS Budgets to get alerts for unacceptable costs.
• Use AWS Compute Optimizer or AWS Trusted Advisor to get cost optimization
recommendations.
• Use AWS Cost Anomaly Detection to get automated cost anomaly detection and root cause
analysis.

Resources

Related documents:

Architecture selection 675

AWS Well-Architected Framework Framework

Common anti-patterns:

• You assume that all performance gains should be implemented, even if there are tradeoﬀs for
implementation.
• You only evaluate changes to workloads when a performance issue has reached a critical point.

Benefits of establishing this best practice: When you are evaluating potential performance-
related improvements, you must decide if the tradeoffs for the changes are acceptable with
the workload requirements. In some cases, you may have to implement additional controls to
compensate for the tradeoffs.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Identify critical areas in your architecture in terms of performance and customer impact. Determine
how you can make improvements, what trade-oﬀs those improvements bring, and how they
impact the system and the user experience. For example, implementing caching data can help
dramatically improve performance but requires a clear strategy for how and when to update or
invalidate cached data to prevent incorrect system behavior.

Implementation steps

• Understand your workload requirements and SLAs.

• Clearly define evaluation factors. Factors may relate to cost, reliability, security, and performance
of your workload.
• Select architecture and services that can address your requirements.
• Conduct experimentation and proof of concepts (POCs) to evaluate trade-off factors and
impact on customers and architecture efficiency. Usually, highly-available, performant, and
secure workloads consume more cloud resources while providing better customer experience.
Understand the trade-offs across your workload’s complexity, performance, and cost. Typically,
prioritizing two of the factors comes at the expense of the third.

Resources

Related documents:

• Amazon Builders’ Library

Architecture selection 677

AWS Well-Architected Framework Framework

Deploy your workload using policies or reference architectures. Integrate the services into your
cloud deployment, then use your performance tests to verify that you can continue to meet your
performance requirements.

Implementation steps

• Clearly understand the requirements of your cloud workload.

• Review internal and external policies to identify the most relevant ones.
• Use the appropriate reference architectures provided by AWS or your industry best practices.
• Create a continuum consisting of policies, standards, reference architectures, and prescriptive
guidelines for common situations. Doing so allows your teams to move faster. Tailor the assets
for your vertical if applicable.
• Validate these policies and reference architectures for your workload in sandbox environments.
• Stay up-to-date with industry standards and AWS updates to make sure your policies and
reference architectures help optimize your cloud workload.

Resources

Related documents:

• AWS Architecture Center

• AWS Partner Network
• AWS Solutions Library
• AWS Knowledge Center
• AWS Architecture Blog

Related videos:

• This is my Architecture
• AWS re:Invent 2022 - Accelerate value for your business with SAP & AWS reference architecture

Related examples:

• AWS Samples
• AWS SDK Examples

Architecture selection 679

AWS Well-Architected Framework Framework

• Define the objectives, baseline, testing scenarios, metrics (like CPU utilization, latency, or
throughput), and KPIs for your benchmark.
• Focus on user requirements in terms of user experience and factors such as response time and
accessibility.
• Identify a benchmarking tool that is suitable for your workload. You can use AWS services like
Amazon CloudWatch or a third-party tool that is compatible with your workload.
• Configure and instrument:
• Set up your environment and configure your resources.
• Implement monitoring and logging to capture testing results.
• Benchmark and monitor:
• Perform your benchmark tests and monitor the metrics during the test.
• Analyze and document:
• Document your benchmarking process and findings.
• Analyze the results to identify bottlenecks, trends, and areas of improvement.
• Use test results to make architectural decisions and adjust your workload. This may include
changing services or adopting new features.
• Optimize and repeat:
• Adjust resource configurations and allocations based on your benchmarks.
• Retest your workload after the adjustment to validate your improvements.
• Document your learnings, and repeat the process to identify other areas of improvement.

Resources

Related documents:

• AWS Architecture Center

• AWS Partner Network
• AWS Solutions Library
• AWS Knowledge Center
• Amazon CloudWatch RUM
• Amazon CloudWatch Synthetics
• Genomics workﬂows, Part 5: automated benchmarking
Architecture selection 681
AWS Well-Architected Framework Framework

Implementation guidance

Use internal experience and knowledge of the cloud or external resources such as published use
cases or whitepapers to choose resources and services in your architecture. You should have a well-
deﬁned process that encourages experimentation and benchmarking with the services that could
be used in your workload.

Backlogs for critical workloads should consist of not just user stories which deliver functionality
relevant to business and users, but also technical stories which form an architecture runway for
the workload. This runway is informed by new advancements in technology and new services and
adopts them based on data and proper justiﬁcation. This veriﬁes that the architecture remains
future-proof and does not stagnate.

Implementation steps

• Engage with key stakeholders to deﬁne workload requirements, including performance,

availability, and cost considerations. Consider factors such as the number of users and usage
pattern for your workload.

• Create an architecture runway or a technology backlog which is prioritized along with the
functional backlog.

• Evaluate and assess diﬀerent cloud services (for more detail, see PERF01-BP01 Learn about and
understand available cloud services and features).

• Explore diﬀerent architectural patterns, like microservices or serverless, that meet your
performance requirements (for more detail, see PERF01-BP02 Use guidance from your cloud
provider or an appropriate partner to learn about architecture patterns and best practices).

• Consult other teams, architecture diagrams, and resources, such as AWS Solution Architects, AWS
Architecture Center, and AWS Partner Network, to help you choose the right architecture for
your workload.

• Deﬁne performance metrics like throughput and response time that can help you evaluate the
performance of your workload.

• Experiment and use deﬁned metrics to validate the performance of the selected architecture.

• Continually monitor and make adjustments as needed to maintain the optimal performance of
your architecture.

Architecture selection 683

AWS Well-Architected Framework Framework

components and allow diﬀerent features to improve performance. Selecting the wrong compute
choice for an architecture can lead to lower performance eﬃciency.

Best practices
• PERF02-BP01 Select the best compute options for your workload
• PERF02-BP02 Understand the available compute conﬁguration and features
• PERF02-BP03 Collect compute-related metrics
• PERF02-BP04 Conﬁgure and right-size compute resources
• PERF02-BP05 Scale your compute resources dynamically
• PERF02-BP06 Use optimized hardware-based compute accelerators

PERF02-BP01 Select the best compute options for your workload

Selecting the most appropriate compute option for your workload allows you to improve
performance, reduce unnecessary infrastructure costs, and lower the operational eﬀorts required
to maintain your workload.

Common anti-patterns:

• You use the same compute option that was used on premises.
• You lack awareness of the cloud compute options, features, and solutions, and how those
solutions might improve your compute performance.
• You over-provision an existing compute option to meet scaling or performance requirements
when an alternative compute option would align to your workload characteristics more precisely.

Beneﬁts of establishing this best practice: By identifying the compute requirements and
evaluating against the options available, you can make your workload more resource eﬃcient.

Level of risk exposed if this best practice is not established: High

Implementation guidance

To optimize your cloud workloads for performance eﬃciency, it is important to select the most
appropriate compute options for your use case and performance requirements. AWS provides a
variety of compute options that cater to diﬀerent workloads in the cloud. For instance, you can
use Amazon EC2 to launch and manage virtual servers, AWS Lambda to run code without having
to provision or manage servers, Amazon ECS or Amazon EKS to run and manage containers, or

Compute and hardware 685

AWS Well-Architected Framework Framework

AWS service Key characteristics Common use cases

scales Amazon Elastic

Container Service (Amazon
ECS), Amazon Elastic
Kubernetes Service
(Amazon EKS), and AWS
Fargate compute resources
, with an option to use On-
Demand or Spot Instances
based on your job requireme
nts

Amazon Lightsail Preconﬁgured Linux and Simple web applications,

Windows application for custom website
running small workloads

• Evaluate cost (like hourly charge or data transfer) and management overhead (like patching and
scaling) associated to each compute option.
• Perform experiments and benchmarking in a non-production environment to identify which
compute option can best address your workload requirements.
• Once you have experimented and identiﬁed your new compute solution, plan your migration and
validate your performance metrics.
• Use AWS monitoring tools like Amazon CloudWatch and optimization services like AWS Compute
Optimizer to continually optimize your compute resources based on real-world usage patterns.

Resources

Related documents:

• Cloud Compute with AWS

• Amazon EC2 Instance Types
• Amazon EKS Containers: Amazon EKS Worker Nodes
• Amazon ECS Containers: Amazon ECS Container Instances
• Functions: Lambda Function Conﬁguration

Compute and hardware 687

AWS Well-Architected Framework Framework

• You do not evaluate compute options or available instance families against workload
characteristics.
• You over-provision compute resources to meet peak-demand requirements.

Beneﬁts of establishing this best practice: Be familiar with AWS compute features and
conﬁgurations so that you can use a compute solution optimized to meet your workload
characteristics and needs.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Each compute solution has unique configurations and features available to support different
workload characteristics and requirements. Learn how these options complement your workload,
and determine which configuration options are best for your application. Examples of these
options include instance family, sizes, features (GPU, I/O), bursting, time-outs, function sizes,
container instances, and concurrency. If your workload has been using the same compute option
for more than four weeks and you anticipate that the characteristics will remain the same in the
future, you can use AWS Compute Optimizer to find out if your current compute option is suitable
for the workloads from CPU and memory perspective.

Implementation steps

• Understand workload requirements (like CPU need, memory, and latency).

• Review AWS documentation and best practices to learn about recommended conﬁguration
options that can help improve compute performance. Here are some key conﬁguration options
to consider:

Conﬁguration option Examples

Instance type • Compute-optimized instances are ideal

for the workloads that require high higher
vCPU to memory ratio.
• Memory-optimized instances deliver large
amounts of memory to support memory
intensive workloads.

Compute and hardware 689

AWS Well-Architected Framework Framework

Conﬁguration option Examples

Hardware-based compute accelerators • Accelerated computing instances perform

functions like graphics processing or data
pattern matching more eﬃciently than
CPU-based alternatives.
• For machine learning workloads, take
advantage of purpose-built hardware that
is speciﬁc to your workload, such as AWS
Trainium, AWS Inferentia, and Amazon EC2
DL1

Resources

Related documents:

• Cloud Compute with AWS

• Amazon EC2 Instance Types
• Processor State Control for Your Amazon EC2 Instance
• Amazon EKS Containers: Amazon EKS Worker Nodes
• Amazon ECS Containers: Amazon ECS Container Instances
• Functions: Lambda Function Conﬁguration

Related videos:

• AWS re:Invent 2023 – AWS Graviton: The best price performance for your AWS workloads
• AWS re:Invent 2023 – New Amazon EC2 generative AI capabilities in AWS Management Console
• AWS re:Invent 2023 – What's new with Amazon EC2
• AWS re:Invent 2023 – Smart savings: Amazon EC2 cost-optimization strategies
• AWS re:Invent 2021 – Powering next-gen Amazon EC2: Deep dive on the Nitro System
• AWS re:Invent 2019 – Amazon EC2 foundations
• AWS re:Invent 2022 – Optimizing Amazon EKS for performance and cost on AWS

Related examples:

Compute and hardware 691

AWS Well-Architected Framework Framework

into utilization levels or performance bottlenecks. Use these metrics as part of a data-driven
approach to actively tune and optimize your workload's resources. In an ideal case, you should
collect all metrics related to your compute resources in a single platform with retention policies
implemented to support cost and operational goals.

Implementation steps

• Identify which performance-related metrics are relevant to your workload. You should collect
metrics around resource utilization and the way your cloud workload is operating (like response
time and throughput).

• Amazon EC2 default metrics

• Amazon ECS default metrics

• Amazon EKS default metrics

• Lambda default metrics

• Amazon EC2 memory and disk metrics

• Choose and set up the right logging and monitoring solution for your workload.

• AWS native Observability

• AWS Distro for OpenTelemetry

• Amazon Managed Service for Prometheus

• Deﬁne the required ﬁlter and aggregation for the metrics based on your workload requirements.

• Quantify custom application metrics with Amazon CloudWatch Logs and metric ﬁlters

• Collect custom metrics with Amazon CloudWatch strategic tagging

• Conﬁgure data retention policies for your metrics to match your security and operational goals.

• Default data retention for CloudWatch metrics

• Default data retention for CloudWatch Logs

• If required, create alarms and notiﬁcations for your metrics to help you proactively respond to
performance-related issues.

• Create alarms for custom metrics using Amazon CloudWatch anomaly detection

• Create metrics and alarms for speciﬁc web pages with Amazon CloudWatch RUM

• Use automation to deploy your metric and log aggregation agents.

• AWS Systems Manager automation

• OpenTelemetry Collector
Compute and hardware 693
AWS Well-Architected Framework Framework

PERF02-BP04 Conﬁgure and right-size compute resources

Conﬁgure and right-size compute resources to match your workload’s performance requirements
and avoid under- or over-utilized resources.

Common anti-patterns:

• You ignore your workload performance requirements resulting in over-provisioned or under-

provisioned compute resources.
• You only choose the largest or smallest instance available for all workloads.
• You only use one instance family for ease of management.
• You ignore recommendations from AWS Cost Explorer or Compute Optimizer for right-sizing.
• You do not re-evaluate the workload for suitability of new instance types.
• You certify only a small number of instance conﬁgurations for your organization.

Beneﬁts of establishing this best practice: Right-sizing compute resources ensures optimal
operation in the cloud by avoiding over-provisioning and under-provisioning resources. Properly
sizing compute resources typically results in better performance and enhanced customer
experience, while also lowering cost.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Right-sizing allows organizations to operate their cloud infrastructure in an eﬃcient and cost-
eﬀective manner while addressing their business needs. Over-provisioning cloud resources can lead
to extra costs, while under-provisioning can result in poor performance and a negative customer
experience. AWS provides tools such as AWS Compute Optimizer and AWS Trusted Advisor that use
historical data to provide recommendations to right-size your compute resources.

Implementation steps

• Choose an instance type to best ﬁt your needs:

• How do I choose the appropriate Amazon EC2 instance type for my workload?
• Attribute-based instance type selection for Amazon EC2 Fleet
• Create an Auto Scaling group using attribute-based instance type selection
• Optimizing your Kubernetes compute costs with Karpenter consolidation

Compute and hardware 695

AWS Well-Architected Framework Framework

• AWS re:Invent 2019 – Amazon EC2 foundations

Related examples:

• AWS Compute Optimizer Demo code

• Amazon EKS workshop
• Right-sizing recommendations

PERF02-BP05 Scale your compute resources dynamically

Use the elasticity of the cloud to scale your compute resources up or down dynamically to match
your needs and avoid over- or under-provisioning capacity for your workload.

Common anti-patterns:

• You react to alarms by manually increasing capacity.

• You use the same sizing guidelines (generally static infrastructure) as in on-premises.
• You leave increased capacity after a scaling event instead of scaling back down.

Benefits of establishing this best practice: Configuring and testing the elasticity of compute
resources can help you save money, maintain performance benchmarks, and improve reliability as
traffic changes.

Level of risk exposed if this best practice is not established: High

Implementation guidance

AWS provides the ﬂexibility to scale your resources up or down dynamically through a variety of
scaling mechanisms in order to meet changes in demand. Combined with compute-related metrics,
a dynamic scaling allows workloads to automatically respond to changes and use the optimal set of
compute resources to achieve its goal.

You can use a number of diﬀerent approaches to match supply of resources with demand.

• Target-tracking approach: Monitor your scaling metric and automatically increase or decrease
capacity as you need it.
• Predictive scaling: Scale in anticipation of daily and weekly trends.
• Schedule-based approach: Set your own scaling schedule according to predictable load changes.

Compute and hardware 697

AWS Well-Architected Framework Framework

• Verify that workload deployments can handle both scaling events (up and down). As an example,
you can use Activity history to verify a scaling activity for an Auto Scaling group.
• Evaluate your workload for predictable patterns and proactively scale as you anticipate predicted
and planned changes in demand. With predictive scaling, you can eliminate the need to
overprovision capacity. For more detail, see Predictive Scaling with Amazon EC2 Auto Scaling.

Resources

Related documents:

• Cloud Compute with AWS

• Amazon EC2 Instance Types
• Amazon ECS Containers: Amazon ECS Container Instances
• Amazon EKS Containers: Amazon EKS Worker Nodes
• Functions: Lambda Function Conﬁguration
• Processor State Control for Your Amazon EC2 Instance
• Deep Dive on Amazon ECS Cluster Auto Scaling
• Introducing Karpenter – An Open-Source High-Performance Kubernetes Cluster Autoscaler

Related videos:

• AWS re:Invent 2023 – AWS Graviton: The best price performance for your AWS workloads
• AWS re:Invent 2023 – New Amazon EC2 generative AI capabilities in AWS Management Console
• AWS re:Invent 2023 – What’s new with Amazon EC2
• AWS re:Invent 2023 – Smart savings: Amazon EC2 cost-optimization strategies
• AWS re:Invent 2021 – Powering next-gen Amazon EC2: Deep dive on the Nitro System
• AWS re:Invent 2019 – Amazon EC2 foundations

Related examples:

• Amazon EC2 Auto Scaling Group Examples

• Amazon EKS Workshop
• Scale your Amazon EKS workloads by running on IPv6

Compute and hardware 699

AWS Well-Architected Framework Framework

• Optimize the code, network operation, and settings of hardware accelerators to make sure that
underlying hardware is fully utilized.
• Optimize GPU settings

• GPU Monitoring and Optimization in the Deep Learning AMI

• Optimizing I/O for GPU performance tuning of deep learning training in Amazon SageMaker

• Use the latest high performant libraries and GPU drivers.

• Use automation to release GPU instances when not in use.

Resources

Related documents:

• Working with GPUs on Amazon Elastic Container Service

• GPU instances

• Instances with AWS Trainium

• Instances with AWS Inferentia

• Let’s Architect! Architecting with custom chips and accelerators

• Accelerated Computing

• Amazon EC2 VT1 Instances

• How do I choose the appropriate Amazon EC2 instance type for my workload?

• Choose the best AI accelerator and model compilation for computer vision inference with
Amazon SageMaker

Related videos:

• AWS re:Invent 2021 - How to select Amazon Elastic Compute Cloud GPU instances for deep
learning

• AWS re:Invent 2022 - [NEW LAUNCH!] Introducing AWS Inferentia2-based Amazon EC2 Inf2
instances

• AWS re:Invent 2022 - Accelerate deep learning and innovate faster with AWS Trainium

• AWS re:Invent 2022 - Deep learning on AWS with NVIDIA: From training to deployment

Compute and hardware 701

AWS Well-Architected Framework Framework

Common anti-patterns:

• You stick to one data store because there is internal experience and knowledge of one particular
type of database solution.
• You assume that all workloads have similar data storage and access requirements.
• You have not implemented a data catalog to inventory your data assets.

Beneﬁts of establishing this best practice: Understanding data characteristics and requirements
allows you to determine the most eﬃcient and performant storage technology appropriate for
your workload needs.

Level of risk exposed if this best practice is not established: High

Implementation guidance

When selecting and implementing data storage, make sure that the querying, scaling, and storage
characteristics support the workload data requirements. AWS provides numerous data storage
and database technologies including block storage, object storage, streaming storage, file system,
relational, key-value, document, in-memory, graph, time series, and ledger databases. Each data
management solution has options and configurations available to you to support your use-cases
and data models. By understanding data characteristics and requirements, you can break away
from monolithic storage technology and restrictive, one-size-fits-all approaches to focus on
managing data appropriately.

Implementation steps

• Conduct an inventory of the various data types that exist in your workload.
• Understand and document data characteristics and requirements, including:
• Data type (unstructured, semi-structured, relational)
• Data volume and growth
• Data durability: persistent, ephemeral, transient
• ACID (atomicity, consistency, isolation, durability) requirements
• Data access patterns (read-heavy or write-heavy)
• Latency
• Throughput
• IOPS (input/output operations per second)

Data management 703

AWS Well-Architected Framework Framework

Type AWS Services Key characteristics

FSx latency, throughput, and

IOPS vary per ﬁle system
and should be considered
when selecting the right ﬁle
system for your workload
needs.

Block storage Amazon Elastic Block Store Scalable, high-performance

(Amazon EBS) block-storage service
designed for Amazon Elastic
Compute Cloud (Amazon
EC2). Amazon EBS includes
SSD-backed storage for
transactional, IOPS-intensive
workloads and HDD-backe
d storage for throughput-
intensive workloads.

Relational database Amazon Aurora, Amazon Designed to support ACID

RDS, Amazon Redshift. (atomicity, consistency,
isolation, durability) transacti
ons, and maintain referenti
al integrity and strong data
consistency. Many tradition
al applications, enterpris
e resource planning (ERP),
customer relationship
management (CRM), and
ecommerce use relational
databases to store their data.

Data management 705

AWS Well-Architected Framework Framework

Type AWS Services Key characteristics

Graph database Amazon Neptune Used for applications that

must navigate and query
millions of relationships
between highly connected
graph datasets with milliseco
nd latency at large scale.
Many companies use
graph databases for fraud
detection, social networkin
g, and recommendation
engines.

Time Series database Amazon Timestream Used to eﬃciently collect,

synthesize, and derive
insights from data that
changes over time. IoT
applications, DevOps, and
industrial telemetry can
utilize time-series databases.

Wide column Amazon Keyspaces (for Uses tables, rows, and

Apache Cassandra) columns, but unlike a
relational database, the
names and format of the
columns can vary from
row to row in the same
table. You typically see a
wide column store in high
scale industrial apps for
equipment maintenance,
ﬂeet management, and route
optimization.

Data management 707

AWS Well-Architected Framework Framework

Question Things to consider

Is ACID (atomicity, consistency, isolation, • If the ACID properties associated with

durability) compliance required? relational databases are required, consider
a relational database such as Amazon RDS
and Aurora.
• If strong consistency is required for NoSQL
database, you can use strongly consistent
reads with DynamoDB.

How will the storage requirements change • Serverless databases such as DynamoDB
over time? How does this impact scalability? and Amazon Quantum Ledger Database
(Amazon QLDB) will scale dynamically.
• Relational databases have upper bounds
on provisioned storage, and often must be
horizontally partitioned using mechanism
s such as sharding once they reach these
limits.

What is the proportion of read queries in • Read-heavy workloads can beneﬁt from a
relation to write queries? Would caching be caching layer, like ElastiCache or DAX if the
likely to improve performance? database is DynamoDB.
• Reads can also be oﬄoaded to read
replicas with relational databases such as
Amazon RDS.

Data management 709

AWS Well-Architected Framework Framework

Question Things to consider

What is the operational expectation for the • Leveraging Amazon RDS instead of
database? Is moving to managed services a Amazon EC2, and DynamoDB or Amazon
primary concern? DocumentDB instead of self-hosting a
NoSQL database can reduce operational
overhead.

How is the database currently accessed? Is it • If you have dependencies on external

only application access, or are there business tooling then you may have to maintain
intelligence (BI) users and other connected compatibility with the databases they
oﬀ-the-shelf applications? support. Amazon RDS is fully compatible
with the diﬀerence engine versions that it
supports including Microsoft SQL Server,
Oracle, MySQL, and PostgreSQL.

• Perform experiments and benchmarking in a non-production environment to identify which data

store can address your workload requirements.

Resources

Related documents:

• Amazon EBS Volume Types

• Amazon EC2 Storage
• Amazon EFS: Amazon EFS Performance
• Amazon FSx for Lustre Performance
• Amazon FSx for Windows File Server Performance
• Amazon S3 Glacier: S3 Glacier Documentation
• Amazon S3: Request Rate and Performance Considerations
• Cloud Storage with AWS
• Amazon EBS I/O Characteristics
• Cloud Databases with AWS
• AWS Database Caching
• DynamoDB Accelerator

Data management 711

AWS Well-Architected Framework Framework

• Amazon Neptune Samples

PERF03-BP02 Evaluate available conﬁguration options for data store

Understand and evaluate the various features and conﬁguration options available for your data
stores to optimize storage space and performance for your workload.

Common anti-patterns:

• You only use one storage type, such as Amazon EBS, for all workloads.
• You use provisioned IOPS for all workloads without real-world testing against all storage tiers.
• You are not aware of the conﬁguration options of your chosen data management solution.
• You rely solely on increasing instance size without looking at other available conﬁguration
options.
• You are not testing the scaling characteristics of your data store.

Benefits of establishing this best practice: By exploring and experimenting with the data store
configurations, you may be able to reduce the cost of infrastructure, improve performance, and
lower the effort required to maintain your workloads.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

A workload could have one or more data stores used based on data storage and access
requirements. To optimize your performance efficiency and cost, you must evaluate data access
patterns to determine the appropriate data store configurations. While you explore data store
options, take into consideration various aspects such as the storage options, memory, compute,
read replica, consistency requirements, connection pooling, and caching options. Experiment with
these various configuration options to improve performance efficiency metrics.

Implementation steps

• Understand the current conﬁgurations (like instance type, storage size, or database engine
version) of your data store.
• Review AWS documentation and best practices to learn about recommended conﬁguration
options that can help improve the performance of your data store. Key data store options to
consider are the following:

Data management 713

AWS Well-Architected Framework Framework

Conﬁguration option Examples

Scaling writes (like partition key sharding or • For relational databases, you can increase
introducing a queue) the size of the instance to accommoda
te an increased workload or increase the
provisioned IOPs to allow for an increased
throughput to the underlying storage.
• You can also introduce a queue in front of
your database rather than writing directly
to the database. This pattern allows you to
decouple the ingestion from the database
and control the ﬂow-rate so the database
does not get overwhelmed.
• Batching your write requests rather than
creating many short-lived transactions can
help improve throughput in high-write
volume relational databases.
• Serverless databases like DynamoDB can
scale the write throughput automatically or
by adjusting the provisioned write capacity
units (WCU) depending on the capacity
mode.
• You can still run into issues with hot
partitions when you reach the throughpu
t limits for a given partition key. This can
be mitigated by choosing a more evenly
distributed partition key or by write-sha
rding the partition key.

Data management 715

AWS Well-Architected Framework Framework

• Cloud Storage with AWS

• Amazon EBS Volume Types
• Amazon EC2 Storage
• Amazon EFS: Amazon EFS Performance
• Amazon FSx for Lustre Performance
• Amazon FSx for Windows File Server Performance
• Amazon S3 Glacier: S3 Glacier Documentation
• Amazon S3: Request Rate and Performance Considerations
• Amazon EBS I/O Characteristics
• Cloud Databases with AWS
• AWS Database Caching
• DynamoDB Accelerator
• Amazon Aurora best practices
• Amazon Redshift performance
• Amazon Athena top 10 performance tips
• Amazon Redshift Spectrum best practices
• Amazon DynamoDB best practices

Related videos:

• AWS re:Invent 2023: Improve Amazon Elastic Block Store efficiency and be more cost-efficient
• AWS re:Invent 2023: Optimize storage price and performance with Amazon Simple Storage
Service
• AWS re:Invent 2023: Building and optimizing a data lake on Amazon Simple Storage Service
• AWS re:Invent 2023: What's new with AWS file storage
• AWS re:Invent 2023: Dive deep into Amazon DynamoDB

Related examples:

• AWS Purpose Built Databases Workshop

• Databases for Developers
• AWS Modern Data Architecture Immersion Day

Data management 717

AWS Well-Architected Framework Framework

disk storage, disk I/O, cache hit ratio, and network inbound and outbound metrics, while the data
store metrics might include transactions per second, top queries, average queries rates, response
times, index usage, table locks, query timeouts, and number of connections open. This data is
crucial to understand how the workload is performing and how the data management solution is
used. Use these metrics as part of a data-driven approach to tune and optimize your workload's
resources.

Use tools, libraries, and systems that record performance measurements related to database
performance.

Implementation steps

• Identify the key performance metrics for your data store to track.

• Amazon S3 Metrics and dimensions

• Monitoring metrics for in an Amazon RDS instance

• Monitoring DB load with Performance Insights on Amazon RDS

• Overview of Enhanced Monitoring

• DynamoDB Metrics and dimensions

• Monitoring DynamoDB Accelerator

• Monitoring Amazon MemoryDB with Amazon CloudWatch

• Which Metrics Should I Monitor?

• Monitoring Amazon Redshift cluster performance

• Timestream metrics and dimensions

• Amazon CloudWatch metrics for Amazon Aurora

• Logging and monitoring in Amazon Keyspaces (for Apache Cassandra)

• Monitoring Amazon Neptune Resources

• Use an approved logging and monitoring solution to collect these metrics. Amazon CloudWatch
can collect metrics across the resources in your architecture. You can also collect and publish
custom metrics to surface business or derived metrics. Use CloudWatch or third-party solutions
to set alarms that indicate when thresholds are breached.

• Check if data store monitoring can beneﬁt from a machine learning solution that detects
performance anomalies.

• Amazon DevOps Guru for Amazon RDS provides visibility into performance issues and makes
recommendations for corrective actions.
Data management 719
AWS Well-Architected Framework Framework

• Amazon RDS Monitoring Workshop

• AWS Purpose Built Databases Workshop

PERF03-BP04 Implement strategies to improve query performance in data store

Implement strategies to optimize data and improve data query to enable more scalability and
eﬃcient performance for your workload.

Common anti-patterns:

• You do not partition data in your data store.

• You store data in only one ﬁle format in your data store.
• You do not use indexes in your data store.

Beneﬁts of establishing this best practice: Optimizing data and query performance results in
more eﬃciency, lower cost, and improved user experience.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Data optimization and query tuning are critical aspects of performance eﬃciency in a data store,
as they impact the performance and responsiveness of the entire cloud workload. Unoptimized
queries can result in greater resource usage and bottlenecks, which reduce the overall eﬃciency of
a data store.

Data optimization includes several techniques to ensure eﬃcient data storage and access. This also
help to improve the query performance in a data store. Key strategies include data partitioning,
data compression, and data denormalization, which help data to be optimized for both storage and
access.

Implementation steps

• Understand and analyze the critical data queries which are performed in your data store.
• Identify the slow-running queries in your data store and use query plans to understand their
current state.
• Analyzing the query plan in Amazon Redshift

Data management 721

AWS Well-Architected Framework Framework

• Optimize Amazon Athena Queries with New Query Analysis Tools

Related examples:

• Amazon S3 Select - Querying data without servers or databases

• AWS Purpose Built Databases Workshop

PERF03-BP05 Implement data access patterns that utilize caching

Implement access patterns that can beneﬁt from caching data for fast retrieval of frequently
accessed data.

Common anti-patterns:

• You cache data that changes frequently.

• You rely on cached data as if it is durably stored and always available.
• You don't consider the consistency of your cached data.
• You don't monitor the eﬃciency of your caching implementation.

Beneﬁts of establishing this best practice: Storing data in a cache can improve read latency, read
throughput, user experience, and overall eﬃciency, as well as reduce costs.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

A cache is a software or hardware component aimed at storing data so that future requests for the
same data can be served faster or more eﬃciently. The data stored in a cache can be reconstructed
if lost by repeating an earlier calculation or fetching it from another data store.

Data caching can be one of the most eﬀective strategies to improve your overall application
performance and reduce burden on your underlying primary data sources. Data can be cached
at multiple levels in the application, such as within the application making remote calls, known
as client-side caching, or by using a fast secondary service for storing the data, known as remote
caching.

Client-side caching

Data management 723

AWS Well-Architected Framework Framework

• Enable features such as automatic connection retries, exponential backoff, client-side timeouts,
and connection pooling in the client, if available, as they can improve performance and
reliability.
• Best practices: Redis clients and Amazon ElastiCache (Redis OSS)
• Monitor cache hit rate with a goal of 80% or higher. Lower values may indicate insufficient cache
size or an access pattern that does not benefit from caching.
• Which metrics should I monitor?
• Best practices for monitoring Redis workloads on Amazon ElastiCache
• Monitoring best practices with Amazon ElastiCache (Redis OSS) using Amazon CloudWatch
• Implement data replication to offload reads to multiple instances and improve data read
performance and availability.

Resources

Related documents:

• Using the Amazon ElastiCache Well-Architected Lens

• Monitoring best practices with Amazon ElastiCache (Redis OSS) using Amazon CloudWatch
• Which Metrics Should I Monitor?
• Performance at Scale with Amazon ElastiCache whitepaper
• Caching challenges and strategies

Related videos:

• Amazon ElastiCache Learning Path

• Design for success with Amazon ElastiCache best practices
• AWS re:Invent 2020 - Design for success with Amazon ElastiCache best practices
• AWS re:Invent 2023 - [LAUNCH] Introducing Amazon ElastiCache Serverless
• AWS re:Invent 2022 - 5 great ways to reimagine your data layer with Redis
• AWS re:Invent 2021 - Deep dive on Amazon ElastiCache (Redis OSS)

Related examples:

• Boosting MySQL database performance with Amazon ElastiCache (Redis OSS)

Data management 725

AWS Well-Architected Framework Framework

• You use on-premises concepts and strategies for networking solutions in the cloud.

Beneﬁts of establishing this best practice: Understanding how networking impacts workload
performance helps you identify potential bottlenecks, improve user experience, increase reliability,
and lower operational maintenance as the workload changes.

Level of risk exposed if this best practice is not established: High

Implementation guidance

The network is responsible for the connectivity between application components, cloud services,
edge networks, and on-premises data, and therefore it can heavily impact workload performance.
In addition to workload performance, user experience can be also impacted by network latency,
bandwidth, protocols, location, network congestion, jitter, throughput, and routing rules.

Have a documented list of networking requirements from the workload including latency, packet
size, routing rules, protocols, and supporting traﬃc patterns. Review the available networking
solutions and identify which service meets your workload networking characteristics. Cloud-based
networks can be quickly rebuilt, so evolving your network architecture over time is necessary to
improve performance eﬃciency.

Implementation steps:

• Deﬁne and document networking performance requirements, including metrics such as network
latency, bandwidth, protocols, locations, traﬃc patterns (spikes and frequency), throughput,
encryption, inspection, and routing rules.
• Learn about key AWS networking services like VPCs, AWS Direct Connect, Elastic Load Balancing
(ELB), and Amazon Route 53.
• Capture the following key networking characteristics:

Characteristics Tools and metrics

Foundational networking characteristics • VPC Flow Logs

• AWS Transit Gateway Flow Logs
• AWS Transit Gateway metrics
• AWS PrivateLink metrics

Application networking characteristics • Elastic Fabric Adapter

Networking and content delivery 727

AWS Well-Architected Framework Framework

• EC2 Placement Groups

• Enabling Enhanced Networking with the Elastic Network Adapter (ENA) on Linux Instances
• Network Load Balancer
• Networking Products with AWS
• Transit Gateway
• Transitioning to latency-based routing in Amazon Route 53
• VPC Endpoints

Related videos:

• AWS re:Invent 2023 - AWS networking foundations

• AWS re:Invent 2023 - What can networking do for your application?
• AWS re:Invent 2023 - Advanced VPC designs and new capabilities
• AWS re:Invent 2023 - A developer’s guide to cloud networking
• AWS re:Invent 2019 - Connectivity to AWS and hybrid AWS network architectures
• AWS re:Invent 2019 - Optimizing Network Performance for Amazon EC2 Instances
• AWS Summit Online - Improve Global Network Performance for Applications
• AWS re:Invent 2020 - Networking best practices and tips with the Well-Architected Framework
• AWS re:Invent 2020 - AWS networking best practices in large-scale migrations

Related examples:

• AWS Transit Gateway and Scalable Security Solutions

• AWS Networking Workshops
• Hands-on Network Firewall Workshop
• Observing and Diagnosing your Network on AWS
• Finding and addressing Network Misconﬁgurations on AWS

PERF04-BP02 Evaluate available networking features

Evaluate networking features in the cloud that may increase performance. Measure the impact of
these features through testing, metrics, and analysis. For example, take advantage of network-level
features that are available to reduce latency, network distance, or jitter.

Networking and content delivery 729

AWS Well-Architected Framework Framework

• Use an existing configuration management database (CMDB) tool or a service such as AWS
Config to create an inventory of your workload and how it’s configured.
• If this is an existing workload, identify and document the benchmark for your performance
metrics, focusing on the bottlenecks and areas to improve. Performance-related networking
metrics will differ per workload based on business requirements and workload characteristics.
As a start, these metrics might be important to review for your workload: bandwidth, latency,
packet loss, jitter, and retransmits.
• If this is a new workload, perform load tests to identify performance bottlenecks.
• For the performance bottlenecks you identify, review the configuration options for your
solutions to identify performance improvement opportunities. Check out the following key
networking options and features:

Improvement opportunity Solution

Network path or routes Use Network Access Analyzer to identify

paths or routes.

Network protocols See PERF04-BP05 Choose network protocols

to improve performance

See PERF04-BP03 Choose appropriate

dedicated connectivity or VPN for your
workload

Network services AWS Global Accelerator is a networking

service that improves the performance of

Networking and content delivery 731

AWS Well-Architected Framework Framework

Improvement opportunity Solution

Storage resource features Amazon S3 Transfer Acceleration is a feature

that lets external users beneﬁt from the
networking optimizations of CloudFront to
upload data to Amazon S3. This improves the
ability to transfer large amounts of data from
remote locations that don’t have dedicated
connectivity to the AWS Cloud.

Amazon S3 Multi-Region Access Points

replicates content to multiple Regions and
simpliﬁes the workload by providing one
access point. When a Multi-Region Access
Point is used, you can request or write data
to Amazon S3 with the service identifying the
lowest latency bucket.

Networking and content delivery 733

AWS Well-Architected Framework Framework

• EC2 Enhanced Networking on Linux

• EC2 Enhanced Networking on Windows
• EC2 Placement Groups
• Enabling Enhanced Networking with the Elastic Network Adapter (ENA) on Linux Instances
• Network Load Balancer
• Networking Products with AWS
• Transitioning to Latency-Based Routing in Amazon Route 53
• VPC Endpoints
• VPC Flow Logs

Related videos:

• AWS re:Invent 2023 – Ready for what's next? Designing networks for growth and ﬂexibility
• AWS re:Invent 2023 – Advanced VPC designs and new capabilities
• AWS re:Invent 2023 – A developer's guide to cloud networking
• AWS re:Invent 2022 – Dive deep on AWS networking infrastructure
• AWS re:Invent 2019 – Connectivity to AWS and hybrid AWS network architectures
• AWS re:Invent 2018 – Optimizing Network Performance for Amazon EC2 Instances
• AWS Global Accelerator

Related examples:

• AWS Transit Gateway and Scalable Security Solutions

• AWS Networking Workshops
• Observing and diagnosing your network
• Finding and addressing network misconﬁgurations on AWS

PERF04-BP03 Choose appropriate dedicated connectivity or VPN for your workload

When hybrid connectivity is required to connect on-premises and cloud resources, provision
adequate bandwidth to meet your performance requirements. Estimate the bandwidth and latency
requirements for your hybrid workload. These numbers will drive your sizing requirements.

Common anti-patterns:

Networking and content delivery 735

AWS Well-Architected Framework Framework

• AWS Direct Connect provides dedicated connectivity to the AWS environment, from 50 Mbps
up to 100 Gbps, using either dedicated connections or hosted connections. This gives you
managed and controlled latency and provisioned bandwidth so your workload can connect
eﬃciently to other environments. Using AWS Direct Connect partners, you can have end-to-
end connectivity from multiple environments, providing an extended network with consistent
performance. AWS oﬀers scaling direct connect connection bandwidth using either native 100
Gbps, link aggregation group (LAG), or BGP equal-cost multipath (ECMP).

• The AWS Site-to-Site VPN provides a managed VPN service supporting internet protocol
security (IPsec). When a VPN connection is created, each VPN connection includes two tunnels
for high availability.

• Follow AWS documentation to choose an appropriate connectivity option:

• If you decide to use AWS Direct Connect, select the appropriate bandwidth for your
connectivity.
• If you are using an AWS Site-to-Site VPN across multiple locations to connect to an AWS
Region, use an accelerated Site-to-Site VPN connection for the opportunity to improve
network performance.

• If your network design consists of IPSec VPN connection over AWS Direct Connect, consider
using Private IP VPN to improve security and achieve segmentation. AWS Site-to-Site Private
IP VPN is deployed on top of transit virtual interface (VIF).

• AWS Direct Connect SiteLink allows creating low-latency and redundant connections between
your data centers worldwide by sending data over the fastest path between AWS Direct
Connect locations, bypassing AWS Regions.

• Validate your connectivity setup before deploying to production. Perform security and
performance testing to assure it meets your bandwidth, reliability, latency, and compliance
requirements.

• Regularly monitor your connectivity performance and usage and optimize if required.

Networking and content delivery 737

AWS Well-Architected Framework Framework

• AWS Transit Gateway and Scalable Security Solutions

• AWS Networking Workshops

PERF04-BP04 Use load balancing to distribute traﬃc across multiple resources

Distribute traffic across multiple resources or services to allow your workload to take advantage
of the elasticity that the cloud provides. You can also use load balancing for offloading encryption
termination to improve performance, reliability and manage and route traffic effectively.

Common anti-patterns:

• You don’t consider your workload requirements when choosing the load balancer type.
• You don’t leverage the load balancer features for performance optimization.
• The workload is exposed directly to the internet without a load balancer.
• You route all internet traﬃc through existing load balancers.
• You use generic TCP load balancing and making each compute node handle SSL encryption.

Beneﬁts of establishing this best practice: A load balancer handles the varying load of your
application traﬃc in a single Availability Zone or across multiple Availability Zones and enables
high availability, automatic scaling, and better utilization for your workload.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Load balancers act as the entry point for your workload, from which point they distribute the
traﬃc to your backend targets, such as compute instances or containers, to improve utilization.

Choosing the right load balancer type is the ﬁrst step to optimize your architecture. Start by listing
your workload characteristics, such as protocol (like TCP, HTTP, TLS, or WebSockets), target type
(like instances, containers, or serverless), application requirements (like long running connections,
user authentication, or stickiness), and placement (like Region, Local Zone, Outpost, or zonal
isolation).

AWS provides multiple models for your applications to use load balancing. Application Load
Balancer is best suited for load balancing of HTTP and HTTPS traﬃc and provides advanced
request routing targeted at the delivery of modern application architectures, including
microservices and containers.

Networking and content delivery 739

AWS Well-Architected Framework Framework

• Application load balancing on Amazon EKS

• Network load balancing on Amazon EKS

Implementation steps

• Deﬁne your load balancing requirements including traﬃc volume, availability and application
scalability.

• Choose the right load balancer type for your application.

• Use Application Load Balancer for HTTP/HTTPS workloads.

• Use Network Load Balancer for non-HTTP workloads that run on TCP or UDP.

• Use a combination of both (ALB as a target of NLB) if you want to leverage features of both
products. For example, you can do this if you want to use the static IPs of NLB together with
HTTP header based routing from ALB, or if you want to expose your HTTP workload to an AWS
PrivateLink.

• For a full comparison of load balancers, see ELB product comparison.

• Use SSL/TLS oﬄoading if possible.

• Conﬁgure HTTPS/TLS listeners with both Application Load Balancer and Network Load
Balancer integrated with AWS Certiﬁcate Manager.

• Note that some workloads may require end-to-end encryption for compliance reasons. In this
case, it is a requirement to allow encryption at the targets.

• For security best practices, see SEC09-BP02 Enforce encryption in transit.

• Select the right routing algorithm (only ALB).

• The routing algorithm can make a diﬀerence in how well-used your backend targets are and
therefore how they impact performance. For example, ALB provides two options for routing
algorithms:

• Least outstanding requests: Use to achieve a better load distribution to your backend targets
for cases when the requests for your application vary in complexity or your targets vary in
processing capability.

• Round robin: Use when the requests and targets are similar, or if you need to distribute
requests equally among targets.

• Consider cross-zone or zonal isolation.

• Use cross-zone turned off (zonal isolation) for latency improvements and zonal failure
domains. It is turned off by default in NLB and in ALB you can turn it off per target group.
Networking and content delivery 741
AWS Well-Architected Framework Framework

• AWS re:Invent 2018: Elastic Load Balancing: Deep Dive and Best Practices
• AWS re:Invent 2021 - How to choose the right load balancer for your AWS workloads
• AWS re:Invent 2019: Get the most from Elastic Load Balancing for diﬀerent workloads

Related examples:

• Gateway Load Balancer

• CDK and AWS CloudFormation samples for Log Analysis with Amazon Athena

PERF04-BP05 Choose network protocols to improve performance

Make decisions about protocols for communication between systems and networks based on the
impact to the workload’s performance.

There is a relationship between latency and bandwidth to achieve throughput. If your ﬁle transfer
is using Transmission Control Protocol (TCP), higher latencies will most likely reduce overall
throughput. There are approaches to ﬁx this with TCP tuning and optimized transfer protocols, but
one solution is to use User Datagram Protocol (UDP).

Common anti-patterns:

• You use TCP for all workloads regardless of performance requirements.

Beneﬁts of establishing this best practice: Verifying that an appropriate protocol is used for
communication between users and workload components helps improve overall user experience
for your applications. For instance, connection-less UDP allows for high speed, but it doesn't oﬀer
retransmission or high reliability. TCP is a full featured protocol, but it requires greater overhead
for processing the packets.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

If you have the ability to choose different protocols for your application and you have expertise in
this area, optimize your application and end-user experience by using a different protocol. Note
that this approach comes with significant difficulty and should only be attempted if you have
optimized your application in other ways first.

Networking and content delivery 743

AWS Well-Architected Framework Framework

lower latency between your client devices and your workload on AWS. With AWS Transfer
Family, you can use TCP-based protocols such as Secure Shell File Transfer Protocol (SFTP) and
File Transfer Protocol over SSL (FTPS) to securely scale and manage your file transfers to AWS
storage services.
• Use network latency to determine if TCP is appropriate for communication between workload
components. If the network latency between your client application and server is high, then
the TCP three-way handshake can take some time, thereby impacting on the responsiveness
of your application. Metrics such as time to first byte (TTFB) and round-trip time (RTT) can be
used to measure network latency. If your workload serves dynamic content to users, consider
using Amazon CloudFront, which establishes a persistent connection to each origin for dynamic
content to remove the connection setup time that would otherwise slow down each client
request.
• Using TLS with TCP or UDP can result in increased latency and reduced throughput for your
workload due to the impact of encryption and decryption. For such workloads, consider SSL/
TLS offloading on Elastic Load Balancing to improve workload performance by allowing the
load balancer to handle SSL/TLS encryption and decryption process instead of having backend
instances do it. This can help reduce the CPU utilization on the backend instances, which can
improve performance and increase capacity.
• Use the Network Load Balancer (NLB) to deploy services that rely on the UDP protocol, such
as authentication and authorization, logging, DNS, IoT, and streaming media, to improve the
performance and reliability of your workload. The NLB distributes incoming UDP traffic across
multiple targets, allowing you to scale your workload horizontally, increase capacity, and reduce
the overhead of a single target.
• For your High Performance Computing (HPC) workloads, consider using the Elastic Network
Adapter (ENA) Express functionality that uses the SRD protocol to improve network performance
by providing a higher single flow bandwidth (25 Gbps) and lower tail latency (99.9 percentile) for
network traffic between EC2 instances.
• Use the Application Load Balancer (ALB) to route and load balance your gRPC (Remote Procedure
Calls) traffic between workload components or between gRPC clients and services. gRPC uses the
TCP-based HTTP/2 protocol for transport and it provides performance benefits such as lighter
network footprint, compression, efficient binary serialization, support for numerous languages,
and bi-directional streaming.

Resources

Related documents:

Networking and content delivery 745

AWS Well-Architected Framework Framework

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Resources, such as Amazon EC2 instances, are placed into Availability Zones within AWS Regions,
AWS Local Zones, AWS Outposts, or AWS Wavelength zones. Selection of this location inﬂuences
network latency and throughput from a given user location. Edge services like Amazon CloudFront
and AWS Global Accelerator can also be used to improve network performance by either caching
content at edge locations or providing users with an optimal path to the workload through the
AWS global network.

Amazon EC2 provides placement groups for networking. A placement group is a logical grouping
of instances to decrease latency. Using placement groups with supported instance types and an
Elastic Network Adapter (ENA) enables workloads to participate in a low-latency, reduced jitter 25
Gbps network. Placement groups are recommended for workloads that beneﬁt from low network
latency, high network throughput, or both.

Latency-sensitive services are delivered at edge locations using AWS global network, such as
Amazon CloudFront. These edge locations commonly provide services like content delivery
network (CDN) and domain name system (DNS). By having these services at the edge, workloads
can respond with low latency to requests for content or DNS resolution. These services also provide
geographic services, such as geotargeting of content (providing diﬀerent content based on the
end users’ location) or latency-based routing to direct end users to the nearest Region (minimum
latency).

Use edge services to reduce latency and to enable content caching. Conﬁgure cache control
correctly for both DNS and HTTP/HTTPS to gain the most beneﬁt from these approaches.

Implementation steps

• Capture information about the IP traﬃc going to and from network interfaces.
• Logging IP traﬃc using VPC Flow Logs
• How the client IP address is preserved in AWS Global Accelerator
• Analyze network access patterns in your workload to identify how users use your application.
• Use monitoring tools, such as Amazon CloudWatch and AWS CloudTrail, to gather data on
network activities.
• Analyze the data to identify the network access pattern.
• Select Regions for your workload deployment based on the following key elements:

Networking and content delivery 747

AWS Well-Architected Framework Framework

Service When to use

Amazon CloudFront Functions Use for simple use cases like HTTP(s) requests
or response manipulations that can be
initiated by short-lived functions.

AWS IoT Greengrass Use to run local compute, messaging, and

data caching for connected devices.

• Some applications require fixed entry points or higher performance by reducing first byte latency
and jitter, and increasing throughput. These applications can benefit from networking services
that provide static anycast IP addresses and TCP termination at edge locations. AWS Global
Accelerator can improve performance for your applications by up to 60% and provide quick
failover for multi-region architectures. AWS Global Accelerator provides you with static anycast
IP addresses that serve as a fixed entry point for your applications hosted in one or more AWS
Regions. These IP addresses permit traffic to ingress onto the AWS global network as close to
your users as possible. AWS Global Accelerator reduces the initial connection setup time by
establishing a TCP connection between the client and the AWS edge location closest to the
client. Review the use of AWS Global Accelerator to improve the performance of your TCP/UDP
workloads and provide quick failover for multi-Region architectures.

Resources

Related best practices:

• COST07-BP02 Implement Regions based on cost

• COST08-BP03 Implement services to reduce data transfer costs

• REL10-BP01 Deploy the workload to multiple locations

• REL10-BP02 Select the appropriate locations for your multi-location deployment

• SUS01-BP01 Choose Region based on both business requirements and sustainability goals

• SUS02-BP04 Optimize geographic placement of workloads based on their networking

requirements

• SUS04-BP07 Minimize data movement across networks

Related documents:

Networking and content delivery 749

AWS Well-Architected Framework Framework

PERF04-BP07 Optimize network conﬁguration based on metrics

Use collected and analyzed data to make informed decisions about optimizing your network
conﬁguration.

Common anti-patterns:

• You assume that all performance-related issues are application-related.

• You only test your network performance from a location close to where you have deployed the
workload.
• You use default conﬁgurations for all network services.
• You overprovision the network resource to provide suﬃcient capacity.

Beneﬁts of establishing this best practice: Collecting necessary metrics of your AWS network
and implementing network monitoring tools allows you to understand network performance and
optimize network conﬁgurations.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

Monitoring traffic to and from VPCs, subnets, or network interfaces is crucial to understand how to
utilize AWS network resources and optimize network configurations. By using the following AWS
networking tools, you can further inspect information about the traffic usage, network access and
logs.

Implementation steps

• Identify the key performance metrics such as latency or packet loss to collect. AWS provides
several tools that can help you to collect these metrics. By using the following tools, you can
further inspect information about the traﬃc usage, network access, and logs:

AWS tool Where to use

Amazon VPC IP Address Manager. Use IPAM to plan, track, and monitor IP
addresses for your AWS and on-premises
workloads. This is a best practice to optimize
IP address usage and allocation.

Networking and content delivery 751

AWS Well-Architected Framework Framework

AWS tool Where to use

Network Access Analyzer Network Access Analyzer helps you understan

d network access to your resources. You can
use Network Access Analyzer to specify your
network access requirements and identify
potential network paths that do not meet
your speciﬁed requirements. By optimizin
g your corresponding network conﬁgura
tion, you can understand and verify the state
of your network and demonstrate if your
network on AWS meets your compliance
requirements.

Amazon CloudWatch Use Amazon CloudWatch and turn on the

appropriate metrics for network options.
Make sure to choose the right network metric
for your workload. For example, you can turn
on metrics for VPC Network Address Usage,
VPC NAT Gateway, AWS Transit Gateway,
VPN tunnel, AWS Network Firewall, Elastic
Load Balancing, and AWS Direct Connect.
Continually monitoring metrics is a good
practice to observe and understand your
network status and usage, which helps you
optimize network conﬁguration based on
your observations.

Networking and content delivery 753

AWS Well-Architected Framework Framework

• Optimize performance and reduce costs for network analytics with VPC Flow Logs in Apache
Parquet format
• Monitoring your global and core networks with Amazon CloudWatch metrics
• Continuously monitor network traﬃc and resources

Related videos:

• AWS re:Invent 2023 – A developer's guide to cloud networking

• AWS re:Invent 2023 – Ready for what’s next? Designing networks for growth and ﬂexibility
• AWS re:Invent 2023 – Advanced VPC designs and new capabilities
• AWS re:Invent 2022 – Dive deep on AWS networking infrastructure
• AWS re:Invent 2020 – Networking best practices and tips with the AWS Well-Architected
Framework
• AWS re:Invent 2020 – Monitoring and troubleshooting network traﬃc

Related examples:

• AWS Networking Workshops

• AWS Network Monitoring
• Observing and diagnosing your network on AWS
• Finding and addressing network misconﬁgurations on AWS

Process and culture

Questions
• PERF 5. How do your organizational practices and culture contribute to performance eﬃciency in
your workload?

PERF 5. How do your organizational practices and culture contribute to

performance eﬃciency in your workload?

When architecting workloads, there are principles and practices that you can adopt to help you
better run eﬃcient high-performing cloud workloads. To adopt a culture that fosters performance
eﬃciency of cloud workloads, consider these key principles and practices:

Process and culture 755

AWS Well-Architected Framework Framework

might use page load time as an indication of overall performance. This metric would be one of
multiple data points that measures user experience. In addition to identifying the page load time
thresholds, you should document the expected outcome or business risk if ideal performance is not
met. A long page load time aﬀects your end users directly, decreases their user experience rating,
and can lead to a loss of customers. When you deﬁne your KPI thresholds, combine both industry
benchmarks and your end user expectations. For example, if the current industry benchmark is a
webpage loading within a two-second time period, but your end users expect a webpage to load
within a one-second time period, then you should take both of these data points into consideration
when establishing the KPI.

Your team must evaluate your workload KPIs using real-time granular data and historical data for
reference and create dashboards that perform metric math on your KPI data to derive operational
and utilization insights. KPIs should be documented and include thresholds that support business
goals and strategies, and should be mapped to metrics being monitored. KPIs should be revisited
when business goals, strategies, or end user requirements change.

Implementation steps

• Identify stakeholders: Identify and document key business stakeholders, including development
and operation teams.

• Deﬁne objectives: Work with these stakeholders to deﬁne and document objectives of your
workload. Consider the critical performance aspects of your workloads, such as throughput,
response time, and cost, as well as business goals, such as user satisfaction.

• Review industry best practices: Review industry best practices to identify relevant KPIs aligned
with your workload objectives.
• Identify metrics: Identify metrics that are aligned with your workload objectives and can help
you measure performance and business goals. Establish KPIs based on these metrics. Example
metrics are measurements like average response time or number of concurrent users.

• Deﬁne and document KPIs: Use industry best practices and your workload objectives to set
targets for your workload KPI. Use this information to set KPI thresholds for severity or alarm
level. Identify and document the risk and impact of a KPI is not met.

• Implement monitoring: Use monitoring tools such as Amazon CloudWatch or AWS Conﬁg to
collect metrics and measure KPIs.

• Visually communicate KPIs: Use dashboard tools like Amazon QuickSight to visualize and
communicate KPIs with stakeholders.

Process and culture 757

AWS Well-Architected Framework Framework

PERF05-BP02 Use monitoring solutions to understand the areas where performance is most
critical

Understand and identify areas where increasing the performance of your workload will have a
positive impact on eﬃciency or customer experience. For example, a website that has a large
amount of customer interaction can beneﬁt from using edge services to move content delivery
closer to customers.

Common anti-patterns:

• You assume that standard compute metrics such as CPU utilization or memory pressure are
enough to catch performance issues.

• You only use the default metrics recorded by your selected monitoring software.

• You only review metrics when there is an issue.

Beneﬁts of establishing this best practice: Understanding critical areas of performance helps
workload owners monitor KPIs and prioritize high-impact improvements.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Set up end-to-end tracing to identify traﬃc patterns, latency, and critical performance areas.
Monitor your data access patterns for slow queries or poorly fragmented and partitioned data.
Identify the constrained areas of the workload using load testing or monitoring.

Increase performance efficiency by understanding your architecture, traffic patterns, and data
access patterns, and identify your latency and processing times. Identify the potential bottlenecks
that might affect the customer experience as the workload grows. After investigating these areas,
look at which solution you could deploy to remove those performance concerns.

Implementation steps

• Set up end-to-end monitoring to capture all workload components and metrics. Here are
examples of monitoring solutions on AWS.

Process and culture 759

AWS Well-Architected Framework Framework

Resources

Related documents:

• What's new in AWS Observability at re:Invent 2023

• Amazon Builders’ Library
• X-Ray Documentation
• Amazon CloudWatch RUM
• Amazon DevOps Guru

Related videos:

• AWS re:Invent 2023 - [LAUNCH] Application monitoring for modern workloads

• AWS re:Invent 2023 - Implementing application observability
• AWS re:Invent 2023 - Building an eﬀective observability strategy
• AWS Summit SF 2022 - Full-stack observability and application monitoring with AWS
• AWS re:Invent 2022 - AWS optimization: Actionable steps for immediate results
• AWS re:Invent 2022 - The Amazon Builders’ Library: 25 years of Amazon operational excellence
• AWS re:Invent 2022 - How Amazon uses better metrics for improved website performance
• Visual Monitoring of Applications with Amazon CloudWatch Synthetics

Related examples:

• Measure page load time with Amazon CloudWatch Synthetics

• Amazon CloudWatch RUM Web Client
• X-Ray SDK for Python
• Distributed Load Testing on AWS

PERF05-BP03 Deﬁne a process to improve workload performance

Define a process to evaluate new services, design patterns, resource types, and configurations as
they become available. For example, run existing performance tests on new instance offerings to
determine their potential to improve your workload.

Process and culture 761

AWS Well-Architected Framework Framework

• Revisit and reﬁne: Regularly review your performance improvement process to identify areas for
enhancement.

Resources

Related documents:

• AWS Blog
• What's New with AWS
• AWS Skill Builder

Related videos:

• AWS re:Invent 2022 - Delivering sustainable, high-performing architectures

• AWS re:Invent 2023 - Optimize cost and performance and track progress toward mitigation
• AWS re:Invent 2022 - AWS optimization: Actionable steps for immediate results
• AWS re:Invent 2022 - Optimize your AWS workloads with best-practice guidance

Related examples:

• AWS Github

PERF05-BP04 Load test your workload

Load test your workload to verify it can handle production load and identify any performance
bottleneck.

Common anti-patterns:

• You load test individual parts of your workload but not your entire workload.
• You load test on infrastructure that is not the same as your production environment.
• You only conduct load testing to your expected load and not beyond, to help foresee where you
may have future problems.
• You perform load testing without consulting the Amazon EC2 Testing Policy and submitting
a Simulated Event Submissions Form. This results in your test failing to run, as it looks like a
denial-of-service event.

Process and culture 763

AWS Well-Architected Framework Framework

• Continually iterate: Load testing should be performed at regular cadence, especially after a
system change of update.

Resources

Related documents:

• Amazon CloudWatch RUM

• Amazon CloudWatch Synthetics

• Distributed Load Testing on AWS

Related videos:

• AWS Summit ANZ 2023: Accelerate with conﬁdence through AWS Distributed Load Testing
• AWS re:Invent 2022 - Scaling on AWS for your ﬁrst 10 million users

• Solving with AWS Solutions: Distributed Load Testing

• AWS re:Invent 2021 - Optimize applications through end user insights with Amazon CloudWatch
RUM

• Demo of Amazon CloudWatch Synthetics

Related examples:

• Distributed Load Testing on AWS

PERF05-BP05 Use automation to proactively remediate performance-related issues

Use key performance indicators (KPIs), combined with monitoring and alerting systems, to
proactively address performance-related issues.

Common anti-patterns:

• You only allow operations staﬀ the ability to make operational changes to the workload.

• You let all alarms ﬁlter to the operations team with no proactive remediation.

Process and culture 765

AWS Well-Architected Framework Framework

• Review and refine: Regularly assess the effectiveness of the automated remediation workflow.
Adjust initiation events and remediation logic if necessary.

Resources

Related documents:

• CloudWatch Documentation

• Monitoring, Logging, and Performance AWS Partner Network Partners

• X-Ray Documentation

• Using Alarms and Alarm Actions in CloudWatch

• Build a Cloud Automation Practice for Operational Excellence: Best Practices from AWS Managed
Services
• Automate your Amazon Redshift performance tuning with automatic table optimization

Related videos:

• AWS re:Invent 2023 - Strategies for automated scaling, remediation, and smart self-healing

• AWS re:Invent 2023 - [LAUNCH] Application monitoring for modern workloads

• AWS re:Invent 2023 - Implementing application observability

• AWS re:Invent 2021 - Intelligently automating cloud operations

• AWS re:Invent 2022 - Setting up controls at scale in your AWS environment

• AWS re:Invent 2022 - Automating patch management and compliance using AWS
• AWS re:Invent 2022 - How Amazon uses better metrics for improved website performance

• AWS re:Invent 2023 - Take a load oﬀ: Diagnose & resolve performance issues with Amazon RDS

• AWS re:Invent 2021 -{New Launch} Automatically detect and resolve issues with Amazon
DevOps Guru

• AWS re:Invent 2023 - Centralize your operations

Related examples:

• CloudWatch Logs Customize Alarms

Process and culture 767

AWS Well-Architected Framework Framework

instances are running the software and conﬁgurations required by your software policy and
which instances need to be updated.
• Assess the new update: Understand how to update the components of your workload. Take
advantage of agility in the cloud to quickly test how new features can improve your workload to
gain performance eﬃciency.

• Use automation: Use automation for the update process to reduce the level of eﬀort to deploy
new features and limit errors caused by manual processes.

• You can use CI/CD to automatically update AMIs, container images, and other artifacts related
to your cloud application.

• You can use tools such as AWS Systems Manager Patch Manager to automate the process of
system updates, and schedule the activity using AWS Systems Manager Maintenance Windows.

• Document the process: Document your process for evaluating updates and new services. Provide
your owners the time and space needed to research, test, experiment, and validate updates and
new services. Refer back to the documented business requirements and KPIs to help prioritize
which update will make a positive business impact.

Resources

Related documents:

• AWS Blog

• What's New with AWS

• Implementing up-to-date images with automated EC2 Image Builder pipelines

Related videos:

• AWS re:Inforce 2022 - Automating patch management and compliance using AWS

• All Things Patch: AWS Systems Manager | AWS Events

Related examples:

• Inventory and Patch Management

• One Observability Workshop

Process and culture 769

AWS Well-Architected Framework Framework

• Identify corrective actions: Use your analysis to identify corrective actions. This may include
parameter tuning, fixing bugs, and scaling resources.
• Document findings: Document your findings, including identified issues, root causes, and
corrective actions.

• Iterate and improve: Continually assess and improve the metrics review process. Use the lesson
learned from previous review to enhance the process over time.

Resources

Related documents:

• CloudWatch Documentation

• Collect metrics and logs from Amazon EC2 Instances and on-premises servers with the
CloudWatch Agent

• Query your metrics with CloudWatch Metrics Insights

• Monitoring, Logging, and Performance AWS Partner Network Partners

• X-Ray Documentation

Related videos:

• AWS re:Invent 2022 - Setting up controls at scale in your AWS environment

• AWS re:Invent 2022 - How Amazon uses better metrics for improved website performance

• AWS re:Invent 2023 - Building an eﬀective observability strategy

• AWS Summit SF 2022 - Full-stack observability and application monitoring with AWS

• AWS re:Invent 2023 - Take a load oﬀ: Diagnose & resolve performance issues with Amazon RDS

Related examples:

• Creating a dashboard with Amazon QuickSight

• CloudWatch Dashboards

Process and culture 771

AWS Well-Architected Framework Framework

COST01-BP01 Establish ownership of cost optimization

Create a team (Cloud Business Office, Cloud Center of Excellence, or FinOps team) that is
responsible for establishing and maintaining cost awareness across your organization. The owner
of cost optimization can be an individual or a team (requires people from finance, technology, and
business teams) that understands the entire organization and cloud finance.

Level of risk exposed if this best practice is not established: High

Implementation guidance

This is the introduction of a Cloud Business Oﬃce (CBO) or Cloud Center of Excellence (CCOE)
function or team that is responsible for establishing and maintaining a culture of cost awareness in
cloud computing. This function can be an existing individual, a team within your organization, or a
new team of key ﬁnance, technology, and organization stakeholders from across the organization.

The function (individual or team) prioritizes and spends the required percentage of their time on
cost management and cost optimization activities. For a small organization, the function might
spend a smaller percentage of time compared to a full-time function for a larger enterprise.

The function requires a multi-disciplinary approach, with capabilities in project management, data
science, financial analysis, and software or infrastructure development. They can improve workload
efficiency by running cost optimizations within three different ownerships:

• Centralized: Through designated teams such as FinOps team, Cloud Financial Management
(CFM) team, Cloud Business Oﬃce (CBO), or Cloud Center of Excellence (CCoE), customers can
design and implement governance mechanisms and drive best practices company-wide.
• Decentralized: Inﬂuencing technology teams to run cost optimizations.
• Hybrid: Combination of both centralized and decentralized teams can work together to run cost
optimizations.

The function may be measured against their ability to run and deliver against cost optimization
goals (for example, workload eﬃciency metrics).

You must secure executive sponsorship for this function, which is a key success factor. The sponsor
is regarded as a champion for cost eﬃcient cloud consumption, and provides escalation support
for the team to ensure that cost optimization activities are treated with the level of priority deﬁned
by the organization. Otherwise, guidance can be ignored and cost saving opportunities will not be

Practice Cloud Financial Management 773

AWS Well-Architected Framework Framework

During these regular reviews, you can review workload eﬃciency (cost) and business outcome.
For example, a 20% cost increase for a workload may align with increased customer usage. In
this case, this 20% cost increase can be interpreted as an investment. These regular cadence calls
can help teams to identify value KPIs that provide meaning to the entire organization.

Resources

Related documents:

• AWS CCOE Blog

• Creating Cloud Business Oﬃce
• CCOE - Cloud Center of Excellence

Related videos:

• Vanguard CCOE Success Story

Related examples:

• Using a Cloud Center of Excellence (CCOE) to Transform the Entire Enterprise

• Building a CCOE to transform the entire enterprise
• 7 Pitfalls to Avoid When Building CCOE

COST01-BP02 Establish a partnership between ﬁnance and technology

Involve ﬁnance and technology teams in cost and usage discussions at all stages of your cloud
journey. Teams regularly meet and discuss topics such as organizational goals and targets, current
state of cost and usage, and ﬁnancial and accounting practices.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Technology teams innovate faster in the cloud due to shortened approval, procurement, and
infrastructure deployment cycles. This can be an adjustment for ﬁnance organizations previously
used to running time-consuming and resource-intensive processes for procuring and deploying
capital in data center and on-premises environments, and cost allocation only at project approval.

Practice Cloud Financial Management 775

AWS Well-Architected Framework Framework

teams to innovate faster – the agility and ability to spin up and then tear down experiments. While
the variable nature of cloud consumption may impact predictability from a capital budgeting
and forecasting perspective, cloud provides organizations with the ability to reduce the cost of
over-provisioning, as well as reduce the opportunity cost associated with conservative under-
provisioning.

Establish a partnership between key ﬁnance and technology stakeholders to create a shared
understanding of organizational goals and develop mechanisms to succeed ﬁnancially in the
variable spend model of cloud computing. Relevant teams within your organization must be
involved in cost and usage discussions at all stages of your cloud journey, including:

• Financial leads: CFOs, financial controllers, financial planners, business analysts, procurement,
sourcing, and accounts payable must understand the cloud model of consumption, purchasing
options, and the monthly invoicing process. Finance needs to partner with technology teams
to create and socialize an IT value story, helping business teams understand how technology
spend is linked to business outcomes. This way, technology expenditures are viewed not as
costs, but rather as investments. Due to the fundamental differences between the cloud (such
as the rate of change in usage, pay as you go pricing, tiered pricing, pricing models, and detailed
billing and usage information) compared to on-premises operation, it is essential that the finance

Practice Cloud Financial Management 777

AWS Well-Architected Framework Framework

engagement models and a return on investment (ROI). Typically, third parties will contribute to
reporting and analysis of any workloads that they manage, and they will provide cost analysis of
any workloads that they design.

Implementing CFM and achieving success requires collaboration across ﬁnance, technology,
and business teams, and a shift in how cloud spend is communicated and evaluated across
the organization. Include engineering teams so that they can be part of these cost and usage
discussions at all stages, and encourage them to follow best practices and take agreed-upon
actions accordingly.

Implementation steps

• Define key members: Verify that all relevant members of your finance and technology teams
participate in the partnership. Relevant finance members will be those having interaction with
the cloud bill. This will typically be CFOs, financial controllers, financial planners, business
analysts, procurement, and sourcing. Technology members will typically be product and
application owners, technical managers and representatives from all teams that build on the
cloud. Other members may include business unit owners, such as marketing, that will influence
usage of products, and third parties such as consultants, to achieve alignment to your goals and
mechanisms, and to assist with reporting.

• Define topics for discussion: Define the topics that are common across the teams, or will need
a shared understanding. Follow cost from that time it is created, until the bill is paid. Note any
members involved, and organizational processes that are required to be applied. Understand
each step or process it goes through and the associated information, such as pricing models
available, tiered pricing, discount models, budgeting, and financial requirements.
• Establish regular cadence: To create a finance and technology partnership, establish a regular
communication cadence to create and maintain alignment. The group needs to come together
regularly against their goals and metrics. A typical cadence involves reviewing the state of the
organization, reviewing any programs currently running, and reviewing overall financial and
optimization metrics. Then key workloads are reported on in greater detail.

Resources

Related documents:

• AWS News Blog

Practice Cloud Financial Management 779

AWS Well-Architected Framework Framework

• Linked or member account: An account in Organizations is a standard AWS account that

contains your AWS resources and the identities that can access those resources.
• Environment: An environment is a collection of AWS resources that runs an application version.
An environment can be made with multiple linked or member accounts.
• Project: A project is a combination of set objectives or tasks to be accomplished within a ﬁxed
period. It is important to consider the project lifecycle during your forecast.
• AWS services: Groups or categories such as compute or storage services where you can group
AWS services for your forecast.
• Custom grouping: You can create custom groups based on your organization's needs, such as
business units, cost centers, teams, cost allocation tags, cost categories, linked accounts, or
combination of these.

Identify the business drivers that can impact your usage cost, and forecast for each of them
separately to calculate expected usage in advance. Some of the drivers might be linked to IT
and product teams within the organization. Other business drivers, such as marketing events,
promotions, geographic expansions, mergers, and acquisitions, are known by your sales, marketing,
and business leaders, and it's important to collaborate and account for all those demand drivers as
well.

You can use AWS Cost Explorer for trend-based forecasting in a deﬁned future time range based
on your past spend. AWS Cost Explorer's forecasting engine segments your historical data based on
charge types (for example, Reserved Instances) and uses a combination of machine learning and
rule-based models to predict spend across all charge types individually.

Once you've established your forecast process and built models, you can use AWS Budgets to set
custom budgets at a granular level by specifying the time period, recurrence, or amount (fixed or
variable) and add filters such as service, AWS Region, and tags. The budget is usually prepared for a
single year and remains fixed, which requires strict adherence from everyone involved. In contrast,
forecasts are more flexible, which allows for readjustments throughout the year and provides
dynamic projections over a period of one, two, or three years. Both budgets and forecasts play
a crucial role when you establish financial expectations among various technology and business
stakeholders. Accurate forecasts and implementation also provides accountability to stakeholders
who are directly responsible for provisioning cost in the first place, and it can also raise their overall
cost awareness.

To stay informed on the performance of your existing budgets, you can create and schedule AWS
Budgets reports to email you and your stakeholders on a regular cadence. You can also create AWS

Practice Cloud Financial Management 781

AWS Well-Architected Framework Framework

• Update existing forecast and budget processes: Based on adopted forecast methods such
as trend-based, business driver-based, or a combination of both forecasting methods, deﬁne
your forecast and budget processes. Budgets should be calculated, realistic, and based on your
forecasts.

• Configure alerts and notifications: Use AWS Budgets alerts and cost anomaly detection to get
alerts and notifications.

• Perform regular reviews with key stakeholders: For example, align on changes in business
direction and usage with stakeholders in IT, ﬁnance, platform teams, and other areas of the
business.

Resources

Related documents:

• AWS Cost Explorer

• AWS Cost and Usage Report

• Forecasting with Cost Explorer

• Amazon QuickSight Forecasting

• Amazon Forecast

• AWS Budgets

Related videos:

• How can I use AWS Budgets to track my spending and usage

• AWS Cost Optimization Series: AWS Budgets

Related examples:

• Understand and build driver-based forecasting

• How to establish and drive a forecasting culture

• How to improve your cloud cost forecasting

• Using the right tools for your cloud cost forecasting

Practice Cloud Financial Management 783

AWS Well-Architected Framework Framework

that is more cost-aware, from developers architecting a new born-in-the-cloud application, to

ﬁnance managers analyzing the ROI on these new cloud investments.

Implementation steps

• Identify relevant organizational processes: Each organizational unit reviews their processes
and identiﬁes processes that impact cost and usage. Any processes that result in the creation or
termination of a resource need to be included for review. Look for processes that can support
cost awareness in your business, such as incident management and training.
• Establish self-sustaining cost-aware culture: Make sure all the relevant stakeholders align with
cause-of-change and impact as a cost so that they understand cloud cost. This will allow your
organization to establish a self-sustaining cost-aware culture of innovation.
• Update processes with cost awareness: Each process is modiﬁed to be made cost aware. The
process may require additional pre-checks, such as assessing the impact of cost, or post-checks
validating that the expected changes in cost and usage occurred. Supporting processes such as
training and incident management can be extended to include items for cost and usage.

To get help, reach out to CFM experts through your Account team, or explore the resources and
related documents below.

Resources

Related documents:

• AWS Cloud Financial Management

Related examples:

• Strategy for Eﬃcient Cloud Cost Management

• Cost Control Blog Series #3: How to Handle Cost Shock
• A Beginner’s Guide to AWS Cost Management

COST01-BP05 Report and notify on cost optimization

Set up cloud budgets and configure mechanisms to detect anomalies in usage. Configure related
tools for cost and usage alerts against pre-defined targets and receive notifications when any
usage exceeds those targets. Have regular meetings to analyze the cost-effectiveness of your
workloads and promote cost awareness.

Practice Cloud Financial Management 785

AWS Well-Architected Framework Framework

commitment, providing insight into estimated savings, Savings Plans coverage, and Savings Plans
utilization. This helps organizations to understand how their Savings Plans apply to each hour of
spend without having to invest time and resources into building models to analyze their spend.

Periodically create reports containing a highlight of Savings Plans, Reserved Instances, and
Amazon EC2 rightsizing recommendations from AWS Cost Explorer to start reducing the cost
associated with steady-state workloads, idle, and underutilized resources. Identify and recoup
spend associated with cloud waste for resources that are deployed. Cloud waste occurs when
incorrectly-sized resources are created or diﬀerent usage patterns are observed instead what is
expected. Follow AWS best practices to reduce your waste or ask your account team and partner to
help you to optimize and save your cloud costs.

Generate reports regularly for better purchasing options for your resources to drive down unit
costs for your workloads. Purchasing options such as Savings Plans, Reserved Instances, or
Amazon EC2 Spot Instances oﬀer the deepest cost savings for fault-tolerant workloads and
allow stakeholders (business owners, ﬁnance, and tech teams) to be part of these commitment
discussions.

Share the reports that contain opportunities or new release announcements that may help you to
reduce total cost of ownership (TCO) of the cloud. Adopt new services, Regions, features, solutions,
or new ways to achieve further cost reductions.

Implementation steps

• Configure AWS Budgets: Configure AWS Budgets on all accounts for your workload. Set a
budget for the overall account spend, and a budget for the workload by using tags.
• Well-Architected Labs: Cost and Governance Usage
• Report on cost optimization: Set up a regular cycle to discuss and analyze the efficiency of
the workload. Using the metrics established, report on the metrics achieved and the cost of
achieving them. Identify and fix any negative trends, as well as positive trends that you can
promote across your organization. Reporting should involve representatives from the application
teams and owners, finance, and key decision makers with respect to cloud expenditure.

Resources

Related documents:

• AWS Cost Explorer

Practice Cloud Financial Management 787

AWS Well-Architected Framework Framework

• Report on cost optimization: Set up a regular cycle to discuss and analyze the efficiency of
the workload. Using the metrics established, report on the metrics achieved and the cost of
achieving them. Identify and fix any negative trends, and identify positive trends to promote
across your organization. Reporting should involve representatives from the application teams
and owners, finance, and management.
• Create and activate daily granularity AWS Budgets for the cost and usage to take timely
actions to prevent any potential cost overruns: AWS Budgets allow you to configure alert
notifications, so you stay informed if any of your budget types fall out of your pre-configured
thresholds. The best way to leverage AWS Budgets is to set your expected cost and usage as your
limits, so that anything above your budgets can be considered overspend.
• Create AWS Cost Anomaly Detection for cost monitor: AWS Cost Anomaly Detection uses
advanced Machine Learning technology to identify anomalous spend and root causes, so you
can quickly take action. It allows you to configure cost monitors that define spend segments you
want to evaluate (for example, individual AWS services, member accounts, cost allocation tags,
and cost categories), and lets you set when, where, and how you receive your alert notifications.
For each monitor, attach multiple alert subscriptions for business owners and technology
teams, including a name, a cost impact threshold, and alerting frequency (individual alerts, daily
summary, weekly summary) for each subscription.
• Use AWS Cost Explorer or integrate your AWS Cost and Usage Report (CUR) data with Amazon
QuickSight dashboards to visualize your organization’s costs: AWS Cost Explorer has an easy-
to-use interface that lets you visualize, understand, and manage your AWS costs and usage over
time. The Cost Intelligence Dashboard is a customizable and accessible dashboard to help create
the foundation of your own cost management and optimization tool.

Resources

Related documents:

• AWS Budgets
• AWS Cost Explorer
• Daily Cost and Usage Budgets
• AWS Cost Anomaly Detection

Related examples:

• Well-Architected Labs: Visualization

Practice Cloud Financial Management 789

AWS Well-Architected Framework Framework

• Meet with your account team: Schedule a regular cadence with your account team, meet with
them and discuss industry trends and AWS services. Speak with your account manager, Solutions
Architect, and support team.

Resources

Related documents:

• AWS Cost Management

• What’s New with AWS
• AWS News Blog

Related examples:

• Amazon EC2 – 15 Years of Optimizing and Saving Your IT Costs

• AWS News Blog - Price Reduction

COST01-BP08 Create a cost-aware culture

Implement changes or programs across your organization to create a cost-aware culture. It is

recommended to start small, then as your capabilities increase and your organization’s use of the
cloud increases, implement large and wide ranging programs.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

A cost-aware culture allows you to scale cost optimization and Cloud Financial Management
(ﬁnancial operations, cloud center of excellence, cloud operations teams, and so on) through best
practices that are performed in an organic and decentralized manner across your organization.
Cost awareness allows you to create high levels of capability across your organization with minimal
eﬀort, compared to a strict top-down, centralized approach.

Creating cost awareness in cloud computing, especially for primary cost drivers in cloud computing,
allows teams to understand expected outcomes of any changes in cost perspective. Teams who
access the cloud environments should be aware of pricing models and the diﬀerence between
traditional on-premesis datacenters and cloud computing.

Practice Cloud Financial Management 791

AWS Well-Architected Framework Framework

• AWS events and meetups: Attend local AWS summits, and any local meetups with other
organizations from your local area.
• Subscribe to blogs: Go to the AWS blogs pages and subscribe to the What's New Blog and other
relevant blogs to follow new releases, implementations, examples, and changes shared by AWS.

Resources

Related documents:

• AWS Blog
• AWS Cost Management
• AWS News Blog

Related examples:

• AWS Cloud Financial Management

• AWS Well-Architected Labs: Cloud Financial Management

COST01-BP09 Quantify business value from cost optimization

Quantifying business value from cost optimization allows you to understand the entire set of
beneﬁts to your organization. Because cost optimization is a necessary investment, quantifying
business value allows you to explain the return on investment to stakeholders. Quantifying
business value can help you gain more buy-in from stakeholders on future cost optimization
investments, and provides a framework to measure the outcomes for your organization’s cost
optimization activities.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Quantifying the business value means measuring the beneﬁts that businesses gain from the
actions and decisions they take. Business value can be tangible (like reduced expenses or increased
proﬁts) or intangible (like improved brand reputation or increased customer satisfaction).

To quantify business value from cost optimization means determining how much value or benefit
you’re getting from your efforts to spend more efficiently. For example, if a company spends
$100,000 to deploy a workload on AWS and later optimizes it, the new cost becomes only $80,000

Practice Cloud Financial Management 793

AWS Well-Architected Framework Framework

• AWS Cost Explorer

Related videos:

• Unlock Business Value with Windows on AWS

Related examples:

• Measuring and Maximizing the Business Value of Customer 360

• The Business Value of Adopting Amazon Web Services Managed Databases
• The Business Value of Amazon Web Services for Independent Software Vendors
• Business Value of Cloud Modernization
• The Business Value of Migration to Amazon Web Services

Expenditure and usage awareness

Questions
• COST 2. How do you govern usage?
• COST 3. How do you monitor your cost and usage?
• COST 4. How do you decommission resources?

COST 2. How do you govern usage?

Establish policies and mechanisms to verify that appropriate costs are incurred while objectives are
achieved. By employing a checks-and-balances approach, you can innovate without overspending.

Best practices
• COST02-BP01 Develop policies based on your organization requirements
• COST02-BP02 Implement goals and targets
• COST02-BP03 Implement an account structure
• COST02-BP04 Implement groups and roles
• COST02-BP05 Implement cost controls
• COST02-BP06 Track project lifecycle

Expenditure and usage awareness 795

AWS Well-Architected Framework Framework

lower performance storage in test and development environments), which types of resources can
be used by diﬀerent groups (for example, the largest size of resource in a development account
is medium) and how long these resources will be in use (whether temporary, short term, or for a
speciﬁc period of time).

Policy example

The following is a sample policy you can review to create your own cloud governance policies,
which focus on cost optimization. Make sure you adjust policy based on your organization’s
requirements and your stakeholders’ requests.

• Policy name: Deﬁne a clear policy name, such as Resource Optimization and Cost Reduction
Policy.
• Purpose: Explain why this policy should be used and what is the expected outcome. The
objective of this policy is to verify that there is a minimum cost required to deploy and run the
desired workload to meet business requirements.
• Scope: Clearly deﬁne who should use this policy and when it should be used, such as DevOps X
Team to use this policy in us-east customers for X environment (production or non-production).

Policy statement

1. Select us-east-1or multiple us-east regions based on your workload’s environment and business
requirement (development, user acceptance testing, pre-production, or production).
2. Schedule Amazon EC2 and Amazon RDS instances to run between six in the morning and eight
at night (Eastern Standard Time (EST)).
3. Stop all unused Amazon EC2 instances after eight hours and unused Amazon RDS instances
after 24 hours of inactivity.
4. Terminate all unused Amazon EC2 instances after 24 hours of inactivity in non-production
environments. Remind Amazon EC2 instance owner (based on tags) to review their stopped
Amazon EC2 instances in production and inform them that their Amazon EC2 instances will be
terminated within 72 hours if they are not in use.
5. Use generic instance family and size such as m5.large and then resize the instance based on CPU
and memory utilization using AWS Compute Optimizer.
6. Prioritize using auto scaling to dynamically adjust the number of running instances based on
traﬃc.
7. Use spot instances for non-critical workloads.

Expenditure and usage awareness 797

AWS Well-Architected Framework Framework

• Deﬁne locations for your workload: Deﬁne where your workload operates, including the
country and the area within the country. This information is used for mapping to AWS Regions
and Availability Zones.

• Define and group services and resources: Define the services that the workloads require. For
each service, specify the types, the size, and the number of resources required. Define groups for
the resources by function, such as application servers or database storage. Resources can belong
to multiple groups.

• Deﬁne and group the users by function: Deﬁne the users that interact with the workload,
focusing on what they do and how they use the workload, not on who they are or their position
in the organization. Group similar users or functions together. You can use the AWS managed
policies as a guide.

• Define the actions: Using the locations, resources, and users identified previously, define
the actions that are required by each to achieve the workload outcomes over its life time
(development, operation, and decommission). Identify the actions based on the groups, not the
individual elements in the groups, in each location. Start broadly with read or write, then refine
down to specific actions to each service.

• Deﬁne the review period: Workloads and organizational requirements can change over time.
Deﬁne the workload review schedule to ensure it remains aligned with organizational priorities.

• Document the policies: Verify the policies that have been deﬁned are accessible as required
by your organization. These policies are used to implement, maintain, and audit access of your
environments.

Resources

Related documents:

• Change Management in the Cloud

• AWS Managed Policies for Job Functions

• AWS multiple account billing strategy

• Actions, Resources, and Condition Keys for AWS Services

• AWS Management and Governance

• Control access to AWS Regions using IAM policies

• Global Infrastructures Regions and AZs

Expenditure and usage awareness 799

AWS Well-Architected Framework Framework

you can achieve this through the establishment of capability in cost optimization, as well as new
service and feature releases.

Targets are the quantiﬁable benchmarks you want to reach to meet your goals and benchmarks
compare your actual results against a target. Establish benchmarks with KPIs for the cost per
unit of compute services (such as Spot adoption, Graviton adoption, latest instance types, and
On-Demands coverage), storage services (such as EBS GP3 adoption, obsolete EBS snapshots,
and Amazon S3 standard storage), or database service usage (such as RDS open-source engines,
Graviton adoption, and On-demand coverage). These benchmarks and KPIs can help you verify that
you use AWS services in the most cost-eﬀective manner.

The following table provides a list of standard AWS metrics for reference. Each organization can
have diﬀerent target values for these KPIs.

Category KPI (%) Description

Compute EC2 usage Coverage EC2 instances (in cost or

hours) using SP+RI+Spot
compared to total (in cost or
hours) of EC2 instances

Compute Compute SP/RI utilization Utilized SP or RI hours

compared to total available
SP or RI hours

Compute EC2/Hour cost EC2 cost divided by the

number of EC2 instances
running in that hour

Compute vCPU cost Cost per vCPU for all

instances

Compute Latest Instance Generation Percentage of instances on

Graviton (or other modern
generation instance types)

Database RDS coverage RDS instances (in cost or

hours) using RI compared to

Expenditure and usage awareness 801

AWS Well-Architected Framework Framework

performance metrics within an organization, distinguish between different types of metrics that
serve distinct purposes. These metrics primarily measure the performance and efficiency of the
technical infrastructure rather than directly the overall business impact. For instance, they might
track server response times, network latency, or system uptime. These metrics are crucial to assess
how well the infrastructure supports the organization's technical operations. However, they don't
provide direct insight into broader business objectives like customer satisfaction, revenue growth,
or market share. To gain a comprehensive understanding of business performance, complement
these efficiency metrics with strategic business metrics that directly correlate with business
outcomes.

Establish near real-time visibility over your KPIs and related savings opportunities and track your
progress over time. To get started with the deﬁnition and tracking of KPI goals, we recommend the
KPI dashboard from Cloud Intelligence Dashboards (CID). Based on the data from Cost and Usage
Report (CUR), the KPI dashboard provides a series of recommended cost optimization KPIs, with
the ability to set custom goals and track progress over time.

If you have other solutions to set and track KPI goals, make sure these methods are adopted by all
cloud ﬁnancial management stakeholders in your organization.

Implementation steps

• Deﬁne expected usage levels: To begin, focus on usage levels. Engage with the application
owners, marketing, and greater business teams to understand what the expected usage levels
are for the workload. How might customer demand change over time, and what can change due
to seasonal increases or marketing campaigns?

• Define workload resourcing and costs: With usage levels defined, quantify the changes in
workload resources required to meet those usage levels. You may need to increase the size or
number of resources for a workload component, increase data transfer, or change workload
components to a different service at a specific level. Specify the costs at each of these major
points, and predict the change in cost when there is a change in usage.

• Define business goals: Take the output from the expected changes in usage and cost, combine
this with expected changes in technology, or any programs that you are running, and develop
goals for the workload. Goals must address usage and cost, as well as the relationship between
the two. Goals must be simple, high-level, and help people understand what the business
expects in terms of outcomes (such as making sure unused resources are kept below certain cost
level). You don't need to define goals for each unused resource type or define costs that can
cause losses in goals and targets. Verify that there are organizational programs (for example,

Expenditure and usage awareness 803

AWS Well-Architected Framework Framework

COST02-BP03 Implement an account structure

Implement a structure of accounts that maps to your organization. This assists in allocating and
managing costs throughout your organization.

Level of risk exposed if this best practice is not established: High

Implementation guidance

AWS Organizations allows you to create multiple AWS accounts which can help you centrally
govern your environment as you scale your workloads on AWS. You can model your organizational
hierarchy by grouping AWS accounts in organizational unit (OU) structure and creating multiple
AWS accounts under each OU. To create an account structure, you need to decide which of your
AWS accounts will be the management account ﬁrst. After that, you can create new AWS accounts
or select existing accounts as member accounts based on your designed account structure by
following management account best practices and member account best practices.

It is advised to always have at least one management account with one member account linked
to it, regardless of your organization size or usage. All workload resources should reside only
within member accounts and no resource should be created in management account. There is
no one size ﬁts all answer for how many AWS accounts you should have. Assess your current and
future operational and cost models to ensure that the structure of your AWS accounts reﬂects
your organization’s goals. Some companies create multiple AWS accounts for business reasons, for
example:

• Administrative or fiscal and billing isolation is required between organization units, cost centers,
or specific workloads.
• AWS service limits are set to be specific to particular workloads.
• There is a requirement for isolation and separation between workloads and resources.

Within AWS Organizations, consolidated billing creates the construct between one or more
member accounts and the management account. Member accounts allow you to isolate and
distinguish your cost and usage by groups. A common practice is to have separate member
accounts for each organization unit (such as ﬁnance, marketing, and sales), or for each environment
lifecycle (such as development, testing and production), or for each workload (workload a, b, and
c), and then aggregate these linked accounts using consolidated billing.

Consolidated billing allows you to consolidate payment for multiple member AWS accounts under
a single management account, while still providing visibility for each linked account’s activity. As

Expenditure and usage awareness 805

AWS Well-Architected Framework Framework

AWS Control Tower can quickly set up and conﬁgure multiple AWS accounts, ensuring that
governance is aligned with your organization’s requirements.

Implementation steps

• Deﬁne separation requirements: Requirements for separation are a combination of multiple

factors, including security, reliability, and financial constructs. Work through each factor in order
and specify whether the workload or workload environment should be separate from other
workloads. Security promotes adhesion to access and data requirements. Reliability manages
limits so that environments and workloads do not impact others. Review the security and
reliability pillars of the Well-Architected Framework periodically and follow the provided best
practices. Financial constructs create strict financial separation (different cost center, workload
ownerships and accountability). Common examples of separation are production and test
workloads being run in separate accounts, or using a separate account so that the invoice and
billing data can be provided to the individual business units or departments in the organization
or stakeholder who owns the account.
• Define grouping requirements: Requirements for grouping do not override the separation
requirements, but are used to assist management. Group together similar environments or
workloads that do not require separation. An example of this is grouping multiple test or
development environments from one or more workloads together.
• Define account structure: Using these separations and groupings, specify an account for
each group and maintain separation requirements. These accounts are your member or linked
accounts. By grouping these member accounts under a single management or payer account, you
combine usage, which allows for greater volume discounts across all accounts, which provides
a single bill for all accounts. It's possible to separate billing data and provide each member
account with an individual view of their billing data. If a member account must not have its
usage or billing data visible to any other account, or if a separate bill from AWS is required,
define multiple management or payer accounts. In this case, each member account has its own
management or payer account. Resources should always be placed in member or linked accounts.
The management or payer accounts should only be used for management.

Resources

Related documents:

• Using Cost Allocation Tags

• AWS managed policies for job functions

Expenditure and usage awareness 807

AWS Well-Architected Framework Framework

Implementation guidance

User roles and groups are fundamental building blocks in the design and implementation of secure
and eﬃcient systems. Roles and groups help organizations balance the need for control with
the requirement for ﬂexibility and productivity, ultimately supporting organizational objectives
and user needs. As recommended in Identity and access management section of AWS Well-
Architected Framework Security Pillar, you need robust identity management and permissions in
place to provide access to the right resources for the right people under the right conditions. Users
receive only the access necessary to complete their tasks. This minimizes the risk associated with
unauthorized access or misuse.

After you develop policies, you can create logical groups and user roles within your organization.
This allows you to assign permissions, control usage, and help implement robust access control
mechanisms, preventing unauthorized access to sensitive information. Begin with high-level
groupings of people. Typically, this aligns with organizational units and job roles (for example, a
systems administrator in the IT Department, ﬁnancial controller, or business analysts). The groups
categorize people that do similar tasks and need similar access. Roles deﬁne what a group must
do. It is easier to manage permissions for groups and roles than for individual users. Roles and
groups assign permissions consistently and systematically across all users, preventing errors and
inconsistencies.

When a user’s role changes, administrators can adjust access at the role or group level, rather than
reconﬁguring individual user accounts. For example, a systems administrator in IT requires access to
create all resources, but an analytics team member only needs to create analytics resources.

Implementation steps

• Implement groups: Using the groups of users deﬁned in your organizational policies, implement
the corresponding groups, if necessary. For best practices on users, groups and authentication,
see the Security Pillar of the AWS Well-Architected Framework.
• Implement roles and policies: Using the actions deﬁned in your organizational policies, create
the required roles and access policies. For best practices on roles and policies, see the Security
Pillar of the AWS Well-Architected Framework.

Resources

Related documents:

• AWS managed policies for job functions

Expenditure and usage awareness 809

AWS Well-Architected Framework Framework

you identify anomalous spend and root causes, so you can quickly take action. First, create a cost
monitor in AWS Cost Anomaly Detection, then choose your alerting preference by setting up a
dollar threshold (such as an alert on anomalies with impact greater than $1,000). Once you receive
alerts, you can analyze the root cause behind the anomaly and impact on your costs. You can also
monitor and perform your own anomaly analysis in AWS Cost Explorer.

Enforce governance policies in AWS through AWS Identity and Access Management and AWS
Organizations Service Control Policies (SCP). IAM allows you to securely manage access to AWS
services and resources. Using IAM, you can control who can create or manage AWS resources,
the type of resources that can be created, and where they can be created. This minimizes the
possibility of resources being created outside of the defined policy. Use the roles and groups
created previously and assign IAM policies to enforce the correct usage. SCP offers central control
over the maximum available permissions for all accounts in your organization, keeping your
accounts stay within your access control guidelines. SCPs are available only in an organization
that has all features turned on, and you can configure the SCPs to either deny or allow actions for
member accounts by default. For more details on implementing access management, see the Well-
Architected Security Pillar whitepaper.

Governance can also be implemented through management of AWS service quotas. By

ensuring service quotas are set with minimum overhead and accurately maintained, you can
minimize resource creation outside of your organization’s requirements. To achieve this, you
must understand how quickly your requirements can change, understand projects in progress
(both creation and decommission of resources), and factor in how fast quota changes can be
implemented. Service quotas can be used to increase your quotas when required.

Implementation steps

• Implement notifications on spend: Using your defined organization policies, create AWS
Budgets to notify you when spending is outside of your policies. Configure multiple cost budgets,
one for each account, which notify you about overall account spending. Configure additional cost
budgets within each account for smaller units within the account. These units vary depending
on your account structure. Some common examples are AWS Regions, workloads (using tags),
or AWS services. Configure an email distribution list as the recipient for notifications, and not an
individual's email account. You can configure an actual budget for when an amount is exceeded,
or use a forecasted budget for notifying on forecasted usage. You can also preconfigure AWS
Budget Actions that can enforce specific IAM or SCP policies, or stop target Amazon EC2 or
Amazon RDS instances. Budget Actions can be started automatically or require workflow
approval.

Expenditure and usage awareness 811

AWS Well-Architected Framework Framework

• Restrict the access of IAM Identity to speciﬁc Amazon EC2 resources

• Create an IAM Policy to restrict Amazon EC2 usage by family
• Well-Architected Labs: Cost and Usage Governance (Level 100)
• Well-Architected Labs: Cost and Usage Governance (Level 200)
• Slack integrations for Cost Anomaly Detection using AWS Chatbot

COST02-BP06 Track project lifecycle

Track, measure, and audit the lifecycle of projects, teams, and environments to avoid using and
paying for unnecessary resources.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

By eﬀectively tracking the project lifecycle, organizations can achieve better cost control through
enhanced planning, management, and resource optimization. The insights gained through tracking
are invaluable for making informed decisions that contribute to the cost-eﬀectiveness and overall
success of the project.

Tracking the entire lifecycle of the workload helps you understand when workloads or workload
components are no longer required. The existing workloads and components may appear to be
in use, but when AWS releases new services or features, they can be decommissioned or adopted.
Check the previous stages of workloads. After a workload is in production, previous environments
can be decommissioned or greatly reduced in capacity until they are required again.

You can tag resources with a timeframe or reminder to pin the time that the workload was
reviewed. For example, if the development environment was last reviewed months ago, it could be
a good time to review it again to explore if new services can be adopted or if the environment is
in use. You can group and tag your applications with myApplications on AWS to manage and track
metadata such as criticality, environment, last reviewed, and cost center. You can both track your
workload's lifecycle and monitor and manage the cost, health, security posture, and performance
of your applications.

AWS provides various management and governance services you can use for entity lifecycle
tracking. You can use AWS Conﬁg or AWS Systems Manager to provide a detailed inventory of
your AWS resources and conﬁguration. It is recommended that you integrate with your existing
project or asset management systems to keep track of active projects and products within your
organization. Combining your current system with the rich set of events and metrics provided by

Expenditure and usage awareness 813

AWS Well-Architected Framework Framework

Related examples:

• Control access to AWS Regions using IAM policies

Related Tools

• AWS Conﬁg
• AWS Systems Manager
• AWS Budgets
• AWS Organizations
• AWS CloudFormation

COST 3. How do you monitor your cost and usage?

Establish policies and procedures to monitor and appropriately allocate your costs. This permits
you to measure and improve the cost eﬃciency of this workload.

Best practices
• COST03-BP01 Conﬁgure detailed information sources
• COST03-BP02 Add organization information to cost and usage
• COST03-BP03 Identify cost attribution categories
• COST03-BP04 Establish organization metrics
• COST03-BP05 Conﬁgure billing and cost management tools
• COST03-BP06 Allocate costs based on workload metrics

COST03-BP01 Conﬁgure detailed information sources

Set up cost management and reporting tools for enhanced analysis and transparency of cost
and usage data. Conﬁgure your workload to create log entries that facilitate the tracking and
segregation of costs and usage.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Detailed billing information such as hourly granularity in cost management tools allow
organizations to track their consumptions with further details and help them to identify some of

Expenditure and usage awareness 815

AWS Well-Architected Framework Framework

option if you want to quickly deploy a dashboard of your cost and usage data without the ability
for customization.

If desired, you can still export CUR in legacy mode, where you can integrate other processing
services such as AWS Glue to prepare the data for analysis and perform data analysis with Amazon
Athena using SQL to query the data.

Implementation steps

• Create data exports: Create customized exports with the data you want and control the schema
of your exports. Create billing and cost management data exports using basic SQL, and visualize
your billing and cost management data by integrating with Amazon QuickSight. You can also
export your data in standard mode to analyze your data with other processing tools like Amazon
Athena.
• Configure the cost and usage report: Using the billing console, configure at least one cost
and usage report. Configure a report with hourly granularity that includes all identifiers and
resource IDs. You can also create other reports with different granularities to provide higher-level
summary information.
• Configure hourly granularity in Cost Explorer: To access cost and usage data with hourly
granularity for the past 14 days, consider enabling hourly and resource level data in the billing
console.
• Configure application logging: Verify that your application logs each business outcome that
it delivers so it can be tracked and measured. Ensure that the granularity of this data is at least
hourly so it matches with the cost and usage data. For more details on logging and monitoring,
see Well-Architected Operational Excellence Pillar.

Resources

Related documents:

• AWS Data Exports

• AWS Glue
• Amazon QuickSight
• AWS Cost Management Pricing
• Tagging AWS resources
• Analyzing your costs with Cost Explorer

Expenditure and usage awareness 817

AWS Well-Architected Framework Framework

in AWS Organizations to deﬁne rules for how tags can be used on AWS resources in your accounts
in AWS Organizations. Tag Policies allow you to easily adopt a standardized approach for tagging
AWS resources

AWS Tag Editor allows you to add, delete, and manage tags of multiple resources. With Tag Editor,
you search for the resources that you want to tag, and then manage tags for the resources in your
search results.

AWS Cost Categories allows you to assign organization meaning to your costs, without requiring
tags on resources. You can map your cost and usage information to unique internal organization
structures. You deﬁne category rules to map and categorize costs using billing dimensions, such as
accounts and tags. This provides another level of management capability in addition to tagging.
You can also map speciﬁc accounts and tags to multiple projects.

Implementation steps

• Define a tagging schema: Gather all stakeholders from across your business to define a schema.
This typically includes people in technical, financial, and management roles. Define a list of tags
that all resources must have, as well as a list of tags that resources should have. Verify that the
tag names and values are consistent across your organization.
• Tag resources: Using your defined cost attribution categories, place tags on all resources in your
workloads according to the categories. Use tools such as the CLI, Tag Editor, or AWS Systems
Manager to increase efficiency.
• Implement AWS Cost Categories: You can create Cost Categories without implementing
tagging. Cost categories use the existing cost and usage dimensions. Create category rules from
your schema and implement them into cost categories.
• Automate tagging: To verify that you maintain high levels of tagging across all resources,
automate tagging so that resources are automatically tagged when they are created. Use services
such as AWS CloudFormation to verify that resources are tagged when created. You can also
create a custom solution to tag automatically using Lambda functions or use a microservice that
scans the workload periodically and removes any resources that are not tagged, which is ideal for
test and development environments.
• Monitor and report on tagging: To verify that you maintain high levels of tagging across your
organization, report and monitor the tags across your workloads. You can use AWS Cost Explorer
to view the cost of tagged and untagged resources, or use services such as Tag Editor. Regularly
review the number of untagged resources and take action to add tags until you reach the desired
level of tagging.

Expenditure and usage awareness 819

AWS Well-Architected Framework Framework

Work with your ﬁnance team and other relevant stakeholders to understand the requirements
of how costs must be allocated within your organization during your regular cadence calls.
Workload costs must be allocated throughout the entire lifecycle, including development,
testing, production, and decommissioning. Understand how the costs incurred for learning, staﬀ
development, and idea creation are attributed in the organization. This can be helpful to correctly
allocate accounts used for this purpose to training and development budgets instead of generic IT
cost budgets.

After defining your cost attribution categories with stakeholders in your organization, use AWS
Cost Categories to group your cost and usage information into meaningful categories in the AWS
Cloud, such as cost for a specific project, or AWS accounts for departments or business units. You
can create custom categories and map your cost and usage information into these categories based
on rules you define using various dimensions such as account, tag, service, or charge type. Once
cost categories are set up, you can view your cost and usage information by these categories, which
allows your organization to make better strategic and purchasing decisions. These categories are
visible in AWS Cost Explorer, AWS Budgets, and AWS Cost and Usage Report as well.

For example, create cost categories for your business units (DevOps team), and under each
category create multiple rules (rules for each sub category) with multiple dimensions (AWS
accounts, cost allocation tags, services or charge type) based on your defined groupings. With
cost categories, you can organize your costs using a rule-based engine. The rules that you
configure organize your costs into categories. Within these rules, you can filter with using multiple
dimensions for each category such as specific AWS accounts, AWS services, or charge types. You
can then use these categories across multiple products in the AWS Billing and Cost Management
and Cost Management console. This includes AWS Cost Explorer, AWS Budgets, AWS Cost and
Usage Report, and AWS Cost Anomaly Detection.

As an example, the following diagram displays how to group your costs and usage information in
your organization by having multiple teams (cost category), multiple environments (rules), and
each environment having multiple resources or assets (dimensions).

Expenditure and usage awareness 821

AWS Well-Architected Framework Framework

• Define AWS Cost Categories: Create cost categories to organize your cost and usage information
with using AWS Cost Categories and map your AWS cost and usage into meaningful categories.
Multiple categories can be assigned to a resource, and a resource can be in multiple different
categories, so define as many categories as needed so that you can manage your costs within the
categorized structure using AWS Cost Categories.

Resources

Related documents:

• Tagging AWS resources

• Using Cost Allocation Tags
• Analyzing your costs with AWS Budgets
• Analyzing your costs with Cost Explorer
• Managing AWS Cost and Usage Reports
• AWS Cost Categories
• Managing your costs with AWS Cost Categories
• Creating cost categories
• Tagging cost categories
• Splitting charges within cost categories
• AWS Cost Categories Features

Related examples:

• Organize your cost and usage data with AWS Cost Categories
• Managing your costs with AWS Cost Categories
• Well-Architected Labs: Cost and Usage Visualization
• Well-Architected Labs: Cost Categories

COST03-BP04 Establish organization metrics

Establish the organization metrics that are required for this workload. Example metrics of a
workload are customer reports produced, or web pages served to customers.

Level of risk exposed if this best practice is not established: High

Expenditure and usage awareness 823

AWS Well-Architected Framework Framework

Level of risk exposed if this best practice is not established: High

Implementation guidance

To establish strong accountability, consider your account strategy ﬁrst as part of your cost
allocation strategy. Get this right, and you may not need to go any further. Otherwise, there can be
unawareness and further pain points.

To encourage accountability of cloud spend, grant users access to tools that provide visibility
into their costs and usage. AWS recommends that you conﬁgure all workloads and teams for the
following purposes:

• Organize: Establish your cost allocation and governance baseline with your own tagging strategy
and taxonomy. Create multiple AWS Accounts with tools such as AWS Control Tower or AWS
Organization. Tag the supported AWS resources and categorize them meaningfully based on
your organization structure (business units, departments, or projects). Tag account names for
specific cost centers and map them with AWS Cost Categories to group accounts for business
units to their cost centers so that business unit owner can see multiple accounts' consumption in
one place.
• Access: Track organization-wide billing information in consolidated billing. Verify the right
stakeholders and business owners have access.
• Control: Build effective governance mechanisms with the right guardrails to prevent unexpected
scenarios when using Service Control Policies (SCP), tag policies, IAM policies and budget alerts.
For example, you can allow teams to create specific resources in preferred regions only by using
effective control mechanisms and prevent resource creations without specific tag (such as cost-
center).
• Current state: Configure a dashboard that shows current levels of cost and usage. The dashboard
should be available in a highly visible place within the work environment like an operations
dashboard. You can export data and use the Cost and Usage Dashboard from the AWS Cost
Optimization Hub or any supported product to create this visibility. You may need to create
different dashboards for different personas. For example, manager dashboard may differ from an
engineering dashboard.
• Notifications: Provide notifications when cost or usage exceeds defined limits and anomalies
occur with AWS Budgets or AWS Cost Anomaly Detection.
• Reports: Summarize all cost and usage information. Raise awareness and accountability of your
cloud spend with detailed, attributable cost data. Create reports that are relevant to the team
consuming them and contain recommendations.

Expenditure and usage awareness 825

AWS Well-Architected Framework Framework

• Configure AWS Cost Anomaly Detection: Use AWS Cost Anomaly Detection for your accounts,
core services or cost categories you created to monitor your cost and usage and detect unusual
spends. You can receive alerts individually in aggregated reports and receive alerts in an email or
an Amazon SNS topic which allows you to analyze and determine the root cause of the anomaly
and identify the factor that is driving the cost increase.
• Use cost analysis tools: Configure AWS Cost Explorer for your workload and accounts to
visualize your cost data for further analysis. Create a dashboard for the workload that tracks
overall spend, key usage metrics for the workload, and forecast of future costs based on your
historical cost data.
• Use cost-saving analysis tools: Use AWS Cost Optimization Hub to identify savings
opportunities with tailored recommendations including deleting unused resources, rightsizing,
savings Plans, reservations and compute optimizer recommendations.
• Configure advanced tools: You can optionally create visuals to facilitate interactive analysis
and sharing of cost insights. With Data Exports on AWS Cost Optimization Hub, you can create
cost and usage dashboard powered by Amazon QuickSight for your organization that provides
additional detail and granularity. You can also implement advanced analysis capability with
using data exports in Amazon Athena for advanced queries, and create dashboards on Amazon
QuickSight. Work with AWS Partners to adopt cloud management solutions for consolidated
cloud bill monitoring and optimization.

Resources

Related documents:

• What is AWS Billing and Cost Management and Cost Management?

• Establishing your best practice AWS environment
• Best Practices for Tagging AWS Resources
• Tagging your AWS resources
• AWS Cost Categories
• Analyzing your costs with AWS Budgets
• Analyzing your costs with AWS Cost Explorer
• What is AWS Data Exports?

Related videos:

Expenditure and usage awareness 827

AWS Well-Architected Framework Framework

Implementation steps

• Allocate costs to workload metrics: Using the defined metrics and configured tags, create
a metric that combines the workload output and workload cost. Use analytics services such
as Amazon Athena and Amazon QuickSight to create an efficiency dashboard for the overall
workload and any components.

Resources

Related documents:

• Tagging AWS resources

• Analyzing your costs with AWS Budgets
• Analyzing your costs with Cost Explorer
• Managing AWS Cost and Usage Reports

Related examples:

• Improve cost visibility of Amazon ECS and AWS Batch with AWS Split Cost Allocation Data

COST 4. How do you decommission resources?

Implement change control and resource management from project inception to end-of-life. This
ensures you shut down or terminates unused resources to reduce waste.

Best practices
• COST04-BP01 Track resources over their lifetime
• COST04-BP02 Implement a decommissioning process
• COST04-BP03 Decommission resources
• COST04-BP04 Decommission resources automatically
• COST04-BP05 Enforce data retention policies

COST04-BP01 Track resources over their lifetime

Deﬁne and implement a method to track resources and their associations with systems over their
lifetime. You can use tagging to identify the workload or function of the resource.

Expenditure and usage awareness 829

AWS Well-Architected Framework Framework

• AWS Auto Scaling

• AWS Trusted Advisor

• AWS Trusted Advisor Cost Optimization Checks

• Tagging AWS resources

• Publishing Custom Metrics

Related videos:

• How to optimize costs using AWS Trusted Advisor

Related examples:

• Organize AWS resources

• Optimize cost using AWS Trusted Advisor

COST04-BP02 Implement a decommissioning process

Implement a process to identify and decommission unused resources.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Implement a standardized process across your organization to identify and remove unused
resources. The process should deﬁne the frequency searches are performed and the processes to
remove the resource to verify that all organization requirements are met.

Implementation steps

• Create and implement a decommissioning process: Work with the workload developers and
owners to build a decommissioning process for the workload and its resources. The process
should cover the method to verify if the workload is in use, and also if each of the workload
resources are in use. Detail the steps necessary to decommission the resource, removing them
from service while ensuring compliance with any regulatory requirements. Any associated
resources should be included, such as licenses or attached storage. Notify the workload owners
that the decommissioning process has been started.

Expenditure and usage awareness 831

AWS Well-Architected Framework Framework

If the resource is an object in Amazon S3 Glacier storage and if you delete an archive before
meeting the minimum storage duration, you will be charged a prorated early deletion fee.
Amazon S3 Glacier minimum storage duration depends on the storage class used. For a summary
of minimum storage duration for each storage class, see Performance across the Amazon S3
storage classes. For detail on how early deletion fees are calculated, see Amazon S3 pricing.

The following simple decommissioning process ﬂowchart outlines the decommissioning steps.
Before decommissioning resources, verify that resources you have identiﬁed for decommissioning
are not being used by the organization.

Resource decommissioning ﬂow.

Resources

Related documents:

• AWS Auto Scaling

• AWS Trusted Advisor

• AWS CloudTrail

Related videos:

Expenditure and usage awareness 833

AWS Well-Architected Framework Framework

Related examples:

• Well-Architected Labs: Decommission resources (Level 100)

COST04-BP04 Decommission resources automatically

Design your workload to gracefully handle resource termination as you identify and decommission
non-critical resources, resources that are not required, or resources with low utilization.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

Use automation to reduce or remove the associated costs of the decommissioning process.
Designing your workload to perform automated decommissioning will reduce the overall workload
costs during its lifetime. You can use Amazon EC2 Auto Scaling or Application Auto Scaling to
perform the decommissioning process. You can also implement custom code using the API or SDK
to decommission workload resources automatically.

Modern applications are built serverless-ﬁrst, a strategy that prioritizes the adoption of serverless
services. AWS developed serverless services for all three layers of your stack: compute, integration,
and data stores. Using serverless architecture will allow you to save costs during low-traﬃc periods
with scaling up and down automatically.

Implementation steps

• Implement Amazon EC2 Auto Scaling or Application Auto Scaling: For resources that are
supported, configure them with Amazon EC2 Auto Scaling or Application Auto Scaling. These
services can help you optimize your utilization and cost efficiencies when consuming AWS
services. When demand drops, these services will automatically remove any excess resource
capacity so you avoid overspending.
• Configure CloudWatch to terminate instances: Instances can be configured to terminate
using CloudWatch alarms. Using the metrics from the decommissioning process, implement an
alarm with an Amazon Elastic Compute Cloud action. Verify the operation in a non-production
environment before rolling out.
• Implement code within the workload: You can use the AWS SDK or AWS CLI to decommission
workload resources. Implement code within the application that integrates with AWS and
terminates or removes resources that are no longer used.

Expenditure and usage awareness 835

AWS Well-Architected Framework Framework

Manager to automate the creation and deletion of Amazon Elastic Block Store snapshots and
Amazon EBS-backed Amazon Machine Images (AMIs), and use Amazon S3 Intelligent-Tiering or an
Amazon S3 lifecycle conﬁguration to manage the lifecycle of your Amazon S3 objects. You can also
implement custom code using the API or SDK to create lifecycle policies and policy rules for objects
to be deleted automatically.

Implementation steps

• Use Amazon Data Lifecycle Manager: Use lifecycle policies on Amazon Data Lifecycle Manager
to automate deletion of Amazon EBS snapshots and Amazon EBS-backed AMIs.
• Set up lifecycle configuration on a bucket: Use Amazon S3 lifecycle configuration on a bucket
to define actions for Amazon S3 to take during an object's lifecycle, as well as deletion at the end
of the object's lifecycle, based on your business requirements.

Resources

Related documents:

• AWS Trusted Advisor

• Amazon Data Lifecycle Manager
• How to set lifecycle conﬁguration on Amazon S3 bucket

Related videos:

• Automate Amazon EBS Snapshots with Amazon Data Lifecycle Manager

• Empty an Amazon S3 bucket using a lifecycle conﬁguration rule

Related examples:

• Empty an Amazon S3 bucket using a lifecycle conﬁguration rule

• Well-Architected Lab: Decommission resources automatically (Level 100)

Cost-eﬀective resources
Questions
• COST 5. How do you evaluate cost when you select services?

Cost-eﬀective resources 837

AWS Well-Architected Framework Framework

systematic approach to cost identiﬁcation and understanding is fundamental for establishing a

realistic and robust cost plan for the organization.

When selecting services for your workload, it is key that you understand your organization
priorities. Create a balance between cost optimization and other AWS Well-Architected Framework
pillars, such as performance and reliability. This process should be conducted systematically
and regularly to reflect changes in the organization's objectives, market conditions, and
operational dynamics. A fully cost-optimized workload is the solution that is most aligned to
your organization’s requirements, not necessarily the lowest cost. Meet with all teams in your
organization, such as product, business, technical, and finance to collect information. Evaluate the
impact of tradeoffs between competing interests or alternative approaches to help make informed
decisions when determining where to focus efforts or choosing a course of action.

For example, accelerating speed to market for new features may be emphasized over cost
optimization, or you may choose a relational database for non-relational data to simplify the
eﬀort to migrate a system, rather than migrating to a database optimized for your data type and
updating your application.

Implementation steps

• Identify organization requirements for cost: Meet with team members from your organization,
including those in product management, application owners, development and operational
teams, management, and ﬁnancial roles. Prioritize the Well-Architected pillars for this workload
and its components. The output should be a list of the pillars in order. You can also add a weight
to each pillar to indicate how much additional focus it has, or how similar the focus is between
two pillars.
• Address the technical debt and document it: During the workload review, address the technical
debt. Document a backlog item to revisit the workload in the future, with the goal of refactoring
or re-architecting to optimize it further. It's essential to clearly communicate the trade-oﬀs that
were made to other stakeholders.

Resources

Related best practices:

• REL11-BP07 Architect your product to meet availability targets and uptime service level
agreements (SLAs)
• OPS01-BP06 Evaluate tradeoﬀs

Cost-eﬀective resources 839

AWS Well-Architected Framework Framework

AWS Cost Explorer and the AWS Cost and Usage Reports (CUR) can analyze the cost of a proof
of concept (PoC) or running environment. You can also use AWS Pricing Calculator to estimate
workload costs.

Write a workflow to be followed by technical teams to review their workloads. Keep this workflow
simple, but also cover all the necessary steps to make sure the teams understand each component
of the workload and its pricing. Your organization can then follow and customize this workflow
based on the specific needs of each team.

1. List each service in use for your workload: This is a good starting point. Identify all of the
services currently in use and where costs are originate from.
2. Understand how pricing works for those services: Understand the pricing model of each
service. Different AWS services have different pricing models based on factors like usage volume,
data transfer, and feature-specific pricing.
3. Focus on the services that have unexpected workload costs and that do not align with
your expected usage and business outcome: Identify outliers or services where the cost is
not proportional to the value or usage with using AWS Cost Explorer or AWS Cost and Usage
Reports. It's important to correlate costs with business outcomes to prioritize optimization
efforts.
4. AWS Cost Explorer, CloudWatch Logs, VPC Flow Logs, and Amazon S3 Storage Lens to
understand the root cause of those high costs: These tools are instrumental in the diagnosis of
high costs. Each service offers a different lens to view and analyze usage and costs. For instance,
Cost Explorer helps determine overall cost trends, CloudWatch Logs provides operational
insights, VPC Flow Logs displays IP traffic, and Amazon S3 Storage Lens is useful for storage
analytics.
5. Use AWS Budgets to set budgets for certain amounts for services or accounts: Setting budgets
is a proactive way to manage costs. Use AWS Budgets to set custom budget thresholds and
receive alerts when costs exceed those thresholds.
6. Configure Amazon CloudWatch alarms to send billing and usage alerts: Set up monitoring
and alerts for cost and usage metrics. CloudWatch alarms can notify you when certain
thresholds are breached, which improves intervention response time.

Facilitate notable enhancement and ﬁnancial savings over time through strategic review of all
workload components and irrespective of their present attributes. The eﬀort invested in this review
process should be deliberate, with careful consideration of the potential advantages that might be
realized.

Cost-eﬀective resources 841

AWS Well-Architected Framework Framework

to lift and shift (also known as rehost) your databases from your on-premises environment to the
cloud as rapidly as possible and optimize later. It is worth exploring the possible savings attained
by using managed services on AWS that may remove or reduce license costs. Managed services on
AWS remove the operational and administrative burden of maintaining a service, such as patching
or upgrading the OS, and allow you to focus on innovation and business.

Since managed services operate at cloud scale, they can oﬀer a lower cost per transaction or
service. You can make potential optimizations in order to achieve some tangible beneﬁt, without
changing the core architecture of the application. For example, you may be looking to reduce the
amount of time you spend managing database instances by migrating to a database-as-a-service
platform like Amazon Relational Database Service (Amazon RDS) or migrating your application to a
fully managed platform like AWS Elastic Beanstalk.

Usually, managed services have attributes that you can set to ensure suﬃcient capacity. You
must set and monitor these attributes so that your excess capacity is kept to a minimum and
performance is maximized. You can modify the attributes of AWS Managed Services using the
AWS Management Console or AWS APIs and SDKs to align resource needs with changing demand.
For example, you can increase or decrease the number of nodes on an Amazon EMR cluster (or an
Amazon Redshift cluster) to scale out or in.

You can also pack multiple instances on an AWS resource to activate higher density usage. For
example, you can provision multiple small databases on a single Amazon Relational Database
Service (Amazon RDS) database instance. As usage grows, you can migrate one of the databases to
a dedicated Amazon RDS database instance using a snapshot and restore process.

When provisioning workloads on managed services, you must understand the requirements of
adjusting the service capacity. These requirements are typically time, eﬀort, and any impact to
normal workload operation. The provisioned resource must allow time for any changes to occur,
provision the required overhead to allow this. The ongoing eﬀort required to modify services
can be reduced to virtually zero by using APIs and SDKs that are integrated with system and
monitoring tools, such as Amazon CloudWatch.

Amazon RDS, Amazon Redshift, and Amazon ElastiCache provide a managed database service.
Amazon Athena, Amazon EMR, and Amazon OpenSearch Service provide a managed analytics
service.

AMS is a service that operates AWS infrastructure on behalf of enterprise customers and partners.
It provides a secure and compliant environment that you can deploy your workloads onto. AMS

Cost-eﬀective resources 843

AWS Well-Architected Framework Framework

• Consolidate data from identical SQL Server databases into a single Amazon RDS for SQL Server
database using AWS DMS
• Deliver data at scale to Amazon Managed Streaming for Apache Kafka (Amazon MSK)
• Migrate an ASP.NET web application to AWS Elastic Beanstalk

COST05-BP04 Select software with cost-eﬀective licensing

Open-source software eliminates software licensing costs, which can contribute signiﬁcant costs to
workloads. Where licensed software is required, avoid licenses bound to arbitrary attributes such
as CPUs, look for licenses that are bound to output or outcomes. The cost of these licenses scales
more closely to the beneﬁt they provide.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

Open source originated in the context of software development to indicate that the software
complies with certain free distribution criteria. Open source software is composed of source code
that anyone can inspect, modify, and enhance. Based on business requirements, skill of engineers,
forecasted usage, or other technology dependencies, organizations can consider using open source
software on AWS to minimize their license costs. In other words, the cost of software licenses can
be reduced through the use of open source software. This can have signiﬁcant impact on workload
costs as the size of the workload scales.

Measure the beneﬁts of licensed software against the total cost to optimize your workload. Model
any changes in licensing and how they would impact your workload costs. If a vendor changes the
cost of your database license, investigate how that impacts the overall eﬃciency of your workload.
Consider historical pricing announcements from your vendors for trends of licensing changes
across their products. Licensing costs may also scale independently of throughput or usage, such
as licenses that scale by hardware (CPU bound licenses). These licenses should be avoided because
costs can rapidly increase without corresponding outcomes.

For instance, operating an Amazon EC2 instance in us-east-1 with a Linux operating system allows
you to cut costs by approximately 45%, compared to running another Amazon EC2 instance that
runs on Windows.

The AWS Pricing Calculator offers a comprehensive way to compare the costs of various resources
with different license options, such as Amazon RDS instances and different database engines.

Cost-eﬀective resources 845

AWS Well-Architected Framework Framework

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Consider the cost of services and options when selecting all components. This includes using
application level and managed services, such as Amazon Relational Database Service (Amazon
RDS), Amazon DynamoDB, Amazon Simple Notiﬁcation Service (Amazon SNS), and Amazon Simple
Email Service (Amazon SES) to reduce overall organization cost.

Use serverless and containers for compute, such as AWS Lambda and Amazon Simple Storage
Service (Amazon S3) for static websites. Containerize your application if possible and use AWS
Managed Container Services such as Amazon Elastic Container Service (Amazon ECS) or Amazon
Elastic Kubernetes Service (Amazon EKS).

Minimize license costs by using open-source software, or software that does not have license fees
(for example, Amazon Linux for compute workloads or migrate databases to Amazon Aurora).

You can use serverless or application-level services such as Lambda, Amazon Simple Queue Service
(Amazon SQS), Amazon SNS, and Amazon SES. These services remove the need for you to manage
a resource and provide the function of code execution, queuing services, and message delivery. The
other beneﬁt is that they scale in performance and cost in line with usage, allowing eﬃcient cost
allocation and attribution.

Using event-driven architecture is also possible with serverless services. Event-driven architectures
are push-based, so everything happens on demand as the event presents itself in the router.
This way, you’re not paying for continuous polling to check for an event. This means less
network bandwidth consumption, less CPU utilization, less idle ﬂeet capacity, and fewer SSL/TLS
handshakes.

For more information on serverless, see Well-Architected Serverless Application lens whitepaper.

Implementation steps

• Select each service to optimize cost: Using your prioritized list and analysis, select each option
that provides the best match with your organizational priorities. Instead of increasing the
capacity to meet the demand, consider other options which may give you better performance
with lower cost. For example, if you need to review expected traﬃc for your databases on AWS,
consider either increasing the instance size or using Amazon ElastiCache services (Redis or
Memcached) to provide cached mechanisms for your databases.

Cost-eﬀective resources 847

AWS Well-Architected Framework Framework

review is change in usage patterns. Signiﬁcant changes in usage can indicate that alternate services
would be more optimal.

If you need to move data into AWS Cloud, you can select any wide variety of services AWS offers
and partner tools to help you migrate your data sets, whether they are files, databases, machine
images, block volumes, or even tape backups. For example, to move a large amount of data to
and from AWS or process data at the edge, you can use one of the AWS purpose-built devices
to cost effectively move petabytes of data offline. Another example is for higher data transfer
rates, a direct connect service may be cheaper than a VPN which provides the required consistent
connectivity for your business.

Based on the cost analysis for diﬀerent usage over time, review your scaling activity. Analyze
the result to see if the scaling policy can be tuned to add instances with multiple instance types
and purchase options. Review your settings to see if the minimum can be reduced to serve user
requests but with a smaller ﬂeet size, and add more resources to meet the expected high demand.

Perform cost analysis for diﬀerent usage over time by discussing with stakeholders in your
organization and use AWS Cost Explorer’s forecast feature to predict the potential impact of
service changes. Monitor usage level launches using AWS Budgets, CloudWatch billing alarms and
AWS Cost Anomaly Detection to identify and implement the most cost-eﬀective services sooner.

Implementation steps

• Define predicted usage patterns: Working with your organization, such as marketing and
product owners, document what the expected and predicted usage patterns will be for the
workload. Discuss with business stakeholders about both historical and forecasted cost and
usage increases and make sure increases align with business requirements. Identify calendar
days, weeks, or months where you expect more users to use your AWS resources, which indicate
that you should increase the capacity of the existing resources or adopt additional services to
reduce the cost and increase performance.
• Perform cost analysis at predicted usage: Using the usage patterns defined, perform analysis
at each of these points. The analysis effort should reflect the potential outcome. For example,
if the change in usage is large, a thorough analysis should be performed to verify any costs and
changes. In other words, when cost increases, usage should increase for business as well.

Resources

Related documents:

Cost-eﬀective resources 849

AWS Well-Architected Framework Framework

Implementation guidance

Perform cost modelling for your workload and each of its components to understand the balance
between resources, and ﬁnd the correct size for each resource in the workload, given a speciﬁc level
of performance. Understanding cost considerations can inform your organizational business case
and decision-making process when evaluating the value realization outcomes for planned workload
deployment.

Perform benchmark activities for the workload under different predicted loads and compare the
costs. The modelling effort should reflect potential benefit; for example, time spent is proportional
to component cost or predicted saving. For best practices, refer to the Review section of the
Performance Efficiency Pillar of the AWS Well-Architected Framework.

As an example, to create cost modeling for a workload consisting of compute resources, AWS
Compute Optimizer can assist with cost modelling for running workloads. It provides right-
sizing recommendations for compute resources based on historical usage. Make sure CloudWatch
Agents are deployed to the Amazon EC2 instances to collect memory metrics which help you with
more accurate recommendations within AWS Compute Optimizer. This is the ideal data source
for compute resources because it is a free service that uses machine learning to make multiple
recommendations depending on levels of risk.

There are multiple services you can use with custom logs as data sources for rightsizing operations
for other services and workload components, such as AWS Trusted Advisor, Amazon CloudWatch
and Amazon CloudWatch Logs. AWS Trusted Advisor checks resources and ﬂags resources with low
utilization which can help you right size your resources and create cost modelling.

The following are recommendations for cost modelling data and metrics:

• The monitoring must accurately reﬂect the user experience. Select the correct granularity for the
time period and thoughtfully choose the maximum or 99th percentile instead of the average.
• Select the correct granularity for the time period of analysis that is required to cover any
workload cycles. For example, if a two-week analysis is performed, you might be overlooking a
monthly cycle of high utilization, which could lead to under-provisioning.
• Choose the right AWS services for your planned workload by considering your existing
commitments, selected pricing models for other workloads, and ability to innovate faster and
focus on your core business value.

Implementation steps

Cost-eﬀective resources 851

AWS Well-Architected Framework Framework

Implementation guidance

Amazon EC2 provides a wide selection of instance types with different levels of CPU, memory,
storage, and networking capacity to fit different use cases. These instance types feature different
blends of CPU, memory, storage, and networking capabilities, giving you versatility when selecting
the right resource combination for your projects. Every instance type comes in multiple sizes,
so that you can adjust your resources based on your workload’s demands. To determine which
instance type you need, gather details about the system requirements of the application or
software that you plan to run on your instance. These details should include the following:

• Operating system
• Number of CPU cores
• GPU cores
• Amount of system memory (RAM)
• Storage type and space
• Network bandwidth requirement

Identify the purpose of compute requirements and which instance is needed, and then explore the
various Amazon EC2 instance families. Amazon oﬀers the following instance type families:

• General Purpose
• Compute Optimized
• Memory Optimized
• Storage Optimized
• Accelerated Computing
• HPC Optimized

For a deeper understanding of the speciﬁc purposes and use cases that a particular Amazon EC2
instance family can fulﬁll, see AWS Instance types.

System requirements gathering is critical for you to select the speciﬁc instance family and instance
type that best serves your needs. Instance type names are comprised of the family name and the
instance size. For example, the t2.micro instance is from the T2 family and is micro-sized.

Select resource size or type based on workload and resource characteristics (for example, compute,
memory, throughput, or write intensive). This selection is typically made using cost modelling,

Cost-eﬀective resources 853

AWS Well-Architected Framework Framework

networking services. This can be done with a feedback loop such as automatic scaling or by custom
code in the workload.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

Create a feedback loop within the workload that uses active metrics from the running workload to
make changes to that workload. You can use a managed service, such as AWS Auto Scaling, which
you configure to perform the right sizing operations for you. AWS also provides APIs, SDKs, and
features that allow resources to be modified with minimal effort. You can program a workload to
stop-and-start an Amazon EC2 instance to allow a change of instance size or instance type. This
provides the benefits of right-sizing while removing almost all the operational cost required to
make the change.

Some AWS services have built in automatic type or size selection, such as Amazon Simple Storage
Service Intelligent-Tiering. Amazon S3 Intelligent-Tiering automatically moves your data between
two access tiers, frequent access and infrequent access, based on your usage patterns.

Implementation steps

• Increase your observability by configuring workload metrics: Capture key metrics for the
workload. These metrics provide an indication of the customer experience, such as workload
output, and align to the differences between resource types and sizes, such as CPU and memory
usage. For compute resource, analyze performance data to right size your Amazon EC2 instances.
Identify idle instances and ones that are underutilized. Key metrics to look for are CPU usage
and memory utilization (for example, 40% CPU utilization at 90% of the time as explained in
Rightsizing with AWS Compute Optimizer and Memory Utilization Enabled). Identify instances
with a maximum CPU usage and memory utilization of less than 40% over a four-week period.
These are the instances to right size to reduce costs. For storage resources such as Amazon
S3, you can use Amazon S3 Storage Lens, which allows you to see 28 metrics across various
categories at the bucket level, and 14 days of historical data in the dashboard by default. You can
filter your Amazon S3 Storage Lens dashboard by summary and cost optimization or events to
analyze specific metrics.
• View rightsizing recommendations: Use the rightsizing recommendations in AWS Compute
Optimizer and the Amazon EC2 rightsizing tool in the Cost Management console, or review
AWS Trusted Advisor right-sizing your resources to make adjustments on your workload. It is
important to use the right tools when right-sizing different resources and follow right-sizing
guidelines whether it is an Amazon EC2 instance, AWS storage classes, or Amazon RDS instance

Cost-eﬀective resources 855

AWS Well-Architected Framework Framework

• Amazon S3 Storage Lens

• Amazon S3 Intelligent-Tiering
• Amazon EFS Infrequent Access
• Launch an Amazon EC2 Instance Using the SDK

Related videos:

• Right Size Your Services

Related examples:

• Attribute based Instance Type Selection for Auto Scaling for Amazon EC2 Fleet
• Optimizing Amazon Elastic Container Service for cost using scheduled scaling
• Predictive scaling with Amazon EC2 Auto Scaling
• Optimize Costs and Gain Visibility into Usage with Amazon S3 Storage Lens
• Well-Architected Labs: Rightsizing Recommendations (Level 100)

COST06-BP04 Consider using shared resources

For already-deployed services at the organization level for multiple business units, consider using
shared resources to increase utilization and reduce total cost of ownership (TCO). Using shared
resources can be a cost-eﬀective option to centralize the management and costs by using existing
solutions, sharing components, or both. Manage common functions like monitoring, backups, and
connectivity either within an account boundary or in a dedicated account. You can also reduce cost
by implementing standardization, reducing duplication, and reducing complexity.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Where multiple workloads cause the same function, use existing solutions and shared components
to improve management and optimize costs. Consider using existing resources (especially shared
ones), such as non-production database servers or directory services, to mitigate cloud costs by
following security best practices and organizational regulations. For optimal value realization and
eﬃciency, it is crucial to allocate costs back (using showback and chargeback) to the pertinent
areas of the business driving consumption.

Cost-eﬀective resources 857

AWS Well-Architected Framework Framework

should know where you have incurred costs at the resource, workload, team, or organization level,
as this knowledge enhances your understanding of the value delivered at the applicable level when
compared to the business outcomes achieved. Ultimately, organizations beneﬁt from cost savings
as a result of sharing cloud infrastructure. Encourage cost allocation on shared cloud resources to
optimize cloud spend.

Implementation steps

• Evaluate existing resources: Review existing workloads that use similar services for your
workload. Depending on the workload’s components, consider existing platforms if business
logic or technical requirement allow.
• Use resource sharing in AWS RAM and restrict accordingly: Use AWS RAM to share resources
with other AWS accounts within your organization. When you share resources, you don’t need
to duplicate resources in multiple accounts, which minimizes the operational burden of resource
maintenance. This process also helps you securely share the resources that you have created with
roles and users in your account, as well as with other AWS accounts.
• Tag resources: Tag resources that are candidates for cost reporting and categorize them within
cost categories. Activate these cost related resource tags for cost allocation to provide visibility
of AWS resources usage. Focus on creating an appropriate level of granularity with respect to
cost and usage visibility, and inﬂuence cloud consumption behaviors through cost allocation
reporting and KPI tracking.

Resources

Related best practices:

• SEC03-BP08 Share resources securely within your organization

Related documents:

• What is AWS Resource Access Manager?

• AWS services that you can use with AWS Organizations
• Shareable AWS resources
• AWS Cost and Usage (CUR) Queries

Related videos:

Cost-eﬀective resources 859

AWS Well-Architected Framework Framework

determine the most appropriate pricing model. Often your pricing model consists of a combination
of multiple options, as determined by your availability

On-Demand Instances allow you pay for compute or database capacity by the hour or by
the second (60 seconds minimum) depending on which instances you run, without long-term
commitments or upfront payments.

Savings Plans are a ﬂexible pricing model that oﬀers low prices on Amazon EC2, Lambda, and AWS
Fargate (Fargate) usage, in exchange for a commitment to a consistent amount of usage (measured
in dollars per hour) over one year or three years terms.

Spot Instances are an Amazon EC2 pricing mechanism that allows you request spare compute
capacity at discounted hourly rate (up to 90% oﬀ the on-demand price) without upfront
commitment.

Reserved Instances allow you up to 75 percent discount by prepaying for capacity. For more
details, see Optimizing costs with reservations.

You might choose to include a Savings Plan for the resources associated with the production,
quality, and development environments. Alternatively, because sandbox resources are only
powered on when needed, you might choose an on-demand model for the resources in that
environment. Use Amazon Spot Instances to reduce Amazon EC2 costs or use Compute Savings
Plans to reduce Amazon EC2, Fargate, and Lambda cost. The AWS Cost Explorer recommendations
tool provides opportunities for commitment discounts with Saving plans.

If you have been purchasing Reserved Instances for Amazon EC2 in the past or have established
cost allocation practices inside your organization, you can continue using Amazon EC2 Reserved
Instances for the time being. However, we recommend working on a strategy to use Savings
Plans in the future as a more ﬂexible cost savings mechanism. You can refresh Savings Plans (SP)
Recommendations in AWS Cost Management to generate new Savings Plans Recommendations
at any time. Use Reserved Instances (RI) to reduce Amazon RDS, Amazon Redshift, Amazon
ElastiCache, and Amazon OpenSearch Service costs. Saving Plans and Reserved Instances
are available in three options: all upfront, partial upfront and no upfront payments. Use the
recommendations provided in AWS Cost Explorer RI and SP purchase recommendations.

To find opportunities for Spot workloads, use an hourly view of your overall usage, and look
for regular periods of changing usage or elasticity. You can use Spot Instances for various fault-
tolerant and flexible applications. Examples include stateless web servers, API endpoints, big data
and analytics applications, containerized workloads, CI/CD, and other flexible workloads.

Cost-eﬀective resources 861

AWS Well-Architected Framework Framework

• Instance purchasing options

• AWS Enterprise

Related videos:

• Save up to 90% and run production workloads on Spot

Related examples:

• Well-Architected Lab: Cost Explorer

• Well-Architected Lab: Cost Visualization

• Well-Architected Lab: Pricing Models

COST07-BP02 Choose Regions based on cost

Resource pricing may be diﬀerent in each Region. Identify Regional cost diﬀerences and only
deploy in Regions with higher costs to meet latency, data residency and data sovereignty
requirements. Factoring in Region cost helps you pay the lowest overall price for this workload.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

The AWS Cloud Infrastructure is global, hosted in multiple locations world-wide, and built around
AWS Regions, Availability Zones, Local Zones, AWS Outposts, and Wavelength Zones. A Region
is a physical location in the world and each Region is a separate geographic area where AWS has
multiple Availability Zones. Availability Zones which are multiple isolated locations within each
Region consist of one or more discrete data centers, each with redundant power, networking, and
connectivity.

Each AWS Region operates within local market conditions, and resource pricing is different in each
Region due to differences in the cost of land, fiber, electricity, and taxes, for example. Choose
a specific Region to operate a component of or your entire solution so that you can run at the
lowest possible price globally. Use AWS Calculator to estimate the costs of your workload in various
Regions by searching services by location type (Region, wave length zone and local zone) and
Region.

Cost-eﬀective resources 863

AWS Well-Architected Framework Framework

to not to use multiple Regions. If there are no obligations to restrict you to use single Region,
then use multiple Regions.
• Analyze required data transfer: Consider data transfer costs when selecting Regions. Keep your
data close to your customer and close to the resources. Select less costly AWS Regions where
data ﬂows and where there is minimal data transfer. Depending on your business requirements
for data transfer, you can use Amazon CloudFront, AWS PrivateLink, AWS Direct Connect, and
AWS Virtual Private Network to reduce your networking costs, improve performance, and
enhance security.

Resources

Related documents:

• Accessing Reserved Instance recommendations

• Amazon EC2 pricing
• Instance purchasing options
• Region Table

Related videos:

• Save up to 90% and run production workloads on Spot

Related examples:

• Overview of Data Transfer Costs for Common Architectures

• Cost Considerations for Global Deployments
• What to Consider when Selecting a Region for your Workloads
• Well-Architected Labs: Restrict service usage by Region (Level 200)

COST07-BP03 Select third-party agreements with cost-eﬃcient terms

Cost efficient agreements and terms ensure the cost of these services scales with the benefits they
provide. Select agreements and pricing that scale when they provide additional benefits to your
organization.

Level of risk exposed if this best practice is not established: Medium

Cost-eﬀective resources 865

AWS Well-Architected Framework Framework

• Instance purchasing options

Related videos:

• Save up to 90% and run production workloads on Spot

COST07-BP04 Implement pricing models for all components of this workload

Permanently running resources should utilize reserved capacity such as Savings Plans or Reserved
Instances. Short-term capacity is conﬁgured to use Spot Instances, or Spot Fleet. On-Demand
Instances are only used for short-term workloads that cannot be interrupted and do not run long
enough for reserved capacity, between 25% to 75% of the period, depending on the resource type.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

To improve cost eﬃciency, AWS provides multiple commitment recommendations based on your
past usage. You can use these recommendations to understand what you can save, and how the
commitment will be used. You can use these services as On-Demand, Spot, or make a commitment
for a certain period of time and reduce your on-demand costs with Reserved Instances (RIs) and
Savings Plans (SPs). You need to understand not only each workload components and multiple
AWS services, but also commitment discounts, purchase options, and Spot Instances for these
services to optimize your workload.

Consider the requirements of your workload’s components, and understand the diﬀerent pricing
models for these services. Deﬁne the availability requirement of these components. Determine
if there are multiple independent resources that perform the function in the workload, and what
the workload requirements are over time. Compare the cost of the resources using the default On-
Demand pricing model and other applicable models. Factor in any potential changes in resources or
workload components.

For example, let’s look at this Web Application Architecture on AWS. This sample workload consists
of multiple AWS services, such as Amazon Route 53, AWS WAF, Amazon CloudFront, Amazon EC2
instances, Amazon RDS instances, Load Balancers, Amazon S3 storage, and Amazon Elastic File
System (Amazon EFS). You need to review each of these services, and identify potential cost saving
opportunities with diﬀerent pricing models. Some of them may be eligible for RIs or SPs, while
some of them may be only available by on-demand. As the following image shows, some of the
AWS services can be committed using RIs or SPs.

Cost-eﬀective resources 867

AWS Well-Architected Framework Framework

• Savings Plans Supported Services

Related videos:

• Save up to 90% and run production workloads on Spot

Related examples:

• What should you consider before purchasing Savings Plans?

• How can I use Cost Explorer to analyze my spending and usage?

COST07-BP05 Perform pricing model analysis at the management account level

Check billing and cost management tools and see recommended discounts with commitments and
reservations to perform regular analysis at the management account level.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

Performing regular cost modeling helps you implement opportunities to optimize across multiple
workloads. For example, if multiple workloads use On-Demand Instances at an aggregate level,
the risk of change is lower, and implementing a commitment-based discount can achieve a lower
overall cost. It is recommended to perform analysis in regular cycles of two weeks to one month.
This allows you to make small adjustment purchases, so the coverage of your pricing models
continues to evolve with your changing workloads and their components.

Use the AWS Cost Explorer recommendations tool to ﬁnd opportunities for commitment discounts
in your management account. Recommendations at the management account level are calculated
considering usage across all of the accounts in your AWS organization that have Reserve Instances
(RI) or Savings Plans (SP). They're also calculated when discount sharing is activated to recommend
a commitment that maximizes savings across accounts.

While purchasing at the management account level optimizes for max savings in many cases, there
may be situations where you might consider purchasing SPs at the linked account level, like when
you want the discounts to apply ﬁrst to usage in that particular linked account. Member account
recommendations are calculated at the individual account level, to maximize savings for each
isolated account. If your account owns both RI and SP commitments, they will be applied in this
order:

Cost-eﬀective resources 869

AWS Well-Architected Framework Framework

correct recommendations with the required discounts and risk by following the Well-Architected
labs.

Resources

Related documents:

• How does AWS pricing work?

• Instance purchasing options
• Saving Plan Overview
• Saving Plan recommendations
• Accessing Reserved Instance recommendations
• Understanding your Saving Plans recommendation
• How Savings Plans apply to your AWS usage
• Saving Plans with Consolidated Billing
• Turning on shared reserved instances and Savings Plans discounts

Related videos:

• Save up to 90% and run production workloads on Spot

Related examples:

• AWS Well-Architected Lab: Pricing Models (Level 200)

• AWS Well-Architected Labs: Pricing Model Analysis (Level 200)
• What should I consider before purchasing a Savings Plan?
• How can I use rolling Savings Plans to reduce commitment risk?
• When to Use Spot Instances

COST 8. How do you plan for data transfer charges?

Verify that you plan and monitor data transfer charges so that you can make architectural decisions
to minimize costs. A small yet eﬀective architectural change can drastically reduce your operational
costs over time.

Best practices

Cost-eﬀective resources 871

AWS Well-Architected Framework Framework

Implementation steps

• Identify requirements: What is the primary goal and business requirements for the planned data
transfer between source and destination? What is the expected business outcome at the end?
Gather business requirements and deﬁne expected outcome.

• Identify source and destination: What is the data source and destination for the data transfer,
such as within AWS Regions, to AWS services, or out to the internet?

• Data transfer within an AWS Region

• Data transfer between AWS Regions

• Data transfer out to the internet

• Identify data classiﬁcations: What is the data classiﬁcation for this data transfer? What kind of
data is it? How big is the data? How frequently must data be transferred? Is data sensitive?
• Identify AWS services or tools to use: Which AWS services are used for this data transfer? Is it
possible to use an already-provisioned service for another workload?

• Calculate data transfer costs: Use AWS Pricing the data transfer modeling you created
previously to calculate the data transfer costs for the workload. Calculate the data transfer costs
at diﬀerent usage levels, for both increases and reductions in workload usage. Where there are
multiple options for the workload architecture, calculate the cost for each option for comparison.

• Link costs to outcomes: For each data transfer cost incurred, specify the outcome that it
achieves for the workload. If it is transfer between components, it may be for decoupling, if it is
between Availability Zones it may be for redundancy.

• Create data transfer modeling: After gathering all information, create a conceptual base data
transfer modeling for multiple use cases and diﬀerent workloads.

Resources

Related documents:

• AWS caching solutions

• AWS Pricing

• Amazon EC2 Pricing

• Amazon VPC pricing

• Understanding data transfer charges

Cost-eﬀective resources 873

AWS Well-Architected Framework Framework

are not, create new NAT gateways in the same Availability Zone as the resource to reduce cross-
AZ data transfer charges.
• Use AWS Direct Connect AWS Direct Connect bypasses the public internet and establishes a
direct, private connection between your on-premises network and AWS. This can be more cost-
effective and consistent than transferring large volumes of data over the internet.
• Avoid transferring data across Regional boundaries: Data transfers between AWS Regions
(from one Region to another) typically incur charges. It should be a very thoughtful decision to
pursue a multi-Region path. For more detail, see Multi-Region scenarios.
• Monitor data transfer: Use Amazon CloudWatch and VPC flow logs to capture details about your
data transfer and network usage. Analyze captured network traffic information in your VPCs,
such as IP address or range going to and from network interfaces.
• Analyze your network usage: Use metering and reporting tools such as AWS Cost Explorer,
CUDOS Dashboards, or CloudWatch to understand data transfer cost of your workload.

Implementation steps

• Select components for data transfer: Using the data transfer modeling explained in COST08-
BP01 Perform data transfer modeling, focus on where the largest data transfer costs are or
where they would be if the workload usage changes. Look for alternative architectures or
additional components that remove or reduce the need for data transfer (or lower its cost).

Resources

Related best practices:

• COST08-BP01 Perform data transfer modeling

• COST08-BP03 Implement services to reduce data transfer costs

Related documents:

• Cloud Data Migration

• AWS caching solutions
• Deliver content faster with Amazon CloudFront

Related examples:

Cost-eﬀective resources 875

AWS Well-Architected Framework Framework

• NAT gateways provide built-in scaling and management for reducing costs as opposed to a
standalone NAT instance. Place NAT gateways in the same Availability Zones as high traﬃc
instances and consider using VPC endpoints for the instances that need to access Amazon
DynamoDB or Amazon S3 to reduce the data transfer and processing costs.

• Use AWS Snow Family devices which have computing resources to collect and process data at
the edge. AWS Snow Family devices (Snowcone, Snowball and Snowmobile) allow you to move
petabytes of data to the AWS Cloud cost eﬀectively and oﬄine.

Implementation steps

• Implement services: Select applicable AWS network services based on your service workload
type using the data transfer modeling and reviewing VPC Flow Logs. Look at where the largest
costs and highest volume ﬂows are. Review the AWS services and assess whether there is a
service that reduces or removes the transfer, speciﬁcally networking and content delivery. Also
look for caching services where there is repeated access to data or large amounts of data.

Resources

Related documents:

• AWS Direct Connect

• AWS Explore Our Products

• AWS caching solutions

• Amazon CloudFront
• AWS Snow Family

• Amazon CloudFront Security Savings Bundle

Related videos:

• Monitoring and Optimizing Your Data Transfer Costs

• AWS Cost Optimization Series: CloudFront

• How can I reduce data transfer charges for my NAT gateway?

Related examples:

Cost-eﬀective resources 877

AWS Well-Architected Framework Framework

Implementation guidance

Analyzing workload demand for cloud computing involves understanding the patterns and
characteristics of computing tasks that are initiated in the cloud environment. This analysis helps
users optimize resource allocation, manage costs, and verify that performance meets required
levels.

Know the requirements of the workload. Your organization's requirements should indicate the
workload response times for requests. The response time can be used to determine if the demand
is managed, or if the supply of resources should change to meet the demand.

The analysis should include the predictability and repeatability of the demand, the rate of change
in demand, and the amount of change in demand. Perform the analysis over a long enough period
to incorporate any seasonal variance, such as end-of-month processing or holiday peaks.

Analysis effort should reflect the potential benefits of implementing scaling. Look at the expected
total cost of the component and any increases or decreases in usage and cost over the workload's
lifetime.

The following are some key aspects to consider when performing workload demand analysis for
cloud computing:

1. Resource utilization and performance metrics: Analyze how AWS resources are being used over
time. Determine peak and oﬀ-peak usage patterns to optimize resource allocation and scaling
strategies. Monitor performance metrics such as response times, latency, throughput, and error
rates. These metrics help assess the overall health and eﬃciency of the cloud infrastructure.

2. User and application scaling behaviour: Understand user behavior and how it affects workload
demand. Examining the patterns of user traffic assists in enhancing the delivery of content
and the responsiveness of applications. Analyze how workloads scale with increasing demand.
Determine whether auto-scaling parameters are configured correctly and effectively for
handling load fluctuations.

3. Workload types: Identify the different types of workloads running in the cloud, such as batch
processing, real-time data processing, web applications, databases, or machine learning. Each
type of workload may have different resource requirements and performance profiles.

4. Service-level agreements (SLAs): Compare actual performance with SLAs to ensure compliance
and identify areas that need improvement.

Manage demand and supply resources 879

AWS Well-Architected Framework Framework

• AWS X-Ray
• AWS Auto Scaling

• AWS Instance Scheduler

• Getting started with Amazon SQS

• AWS Cost Explorer

• Amazon QuickSight

Related videos:

Related examples:

• Monitor, Track and Analyze for cost optimization

• Searching and analyzing logs in CloudWatch

COST09-BP02 Implement a buﬀer or throttle to manage demand

Buffering and throttling modify the demand on your workload, smoothing out any peaks.
Implement throttling when your clients perform retries. Implement buffering to store the request
and defer processing until a later time. Verify that your throttles and buffers are designed so clients
receive a response in the required time.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Implementing a buffer or throttle is crucial in cloud computing in order to manage demand and
reduce the provisioned capacity required for your workload. For optimal performance, it's essential
to gauge the total demand, including peaks, the pace of change in requests, and the necessary
response time. When clients have the ability to resend their requests, it becomes practical to apply
throttling. Conversely, for clients lacking retry functionalities, the ideal approach is implementing
a buffer solution. Such buffers streamline the influx of requests and optimize the interaction of
applications with varied operational speeds.

Manage demand and supply resources 881

AWS Well-Architected Framework Framework

Buffering and throttling can smooth out any peaks by modifying the demand on your workload.
Use throttling when clients retry actions and use buffering to hold the request and process it later.
When working with a buffer-based approach, architect your workload to service the request in the
required time, verify that you are able to handle duplicate requests for work. Analyze the overall
demand, rate of change, and required response time to right size the throttle or buffer required.

Implementation steps

• Analyze the client requirements: Analyze the client requests to determine if they are capable
of performing retries. For clients that cannot perform retries, buﬀers need to be implemented.
Analyze the overall demand, rate of change, and required response time to determine the size of
throttle or buﬀer required.

• Implement a buﬀer or throttle: Implement a buﬀer or throttle in the workload. A queue

such as Amazon Simple Queue Service (Amazon SQS) can provide a buﬀer to your workload
components. Amazon API Gateway can provide throttling for your workload components.

Resources

Related best practices:

• SUS02-BP06 Implement buﬀering or throttling to ﬂatten the demand curve

• REL05-BP02 Throttle requests

Related documents:

• AWS Auto Scaling

• AWS Instance Scheduler

• Amazon API Gateway

• Amazon Simple Queue Service

• Getting started with Amazon SQS

• Amazon Kinesis

Related videos:

• Choosing the Right Messaging Service for Your Distributed App

Manage demand and supply resources 883

AWS Well-Architected Framework Framework

Cost optimization with AWS Instance Scheduler.

You can also easily conﬁgure schedules for your Amazon EC2 instances across your accounts
and Regions with a simple user interface (UI) using AWS Systems Manager Quick Setup. You can
schedule Amazon EC2 or Amazon RDS instances with AWS Instance Scheduler and you can stop
and start existing instances. However, you cannot stop and start instances which are part of your
Auto Scaling group (ASG) or that manage services such as Amazon Redshift or Amazon OpenSearch
Service. Auto Scaling groups have their own scheduling for the instances in the group and these
instances are created.

AWS Auto Scaling helps you adjust your capacity to maintain steady, predictable performance
at the lowest possible cost to meet changing demand. It is a fully managed and free service to
scale the capacity of your application that integrates with Amazon EC2 instances and Spot Fleets,
Amazon ECS, Amazon DynamoDB, and Amazon Aurora. Auto Scaling provides automatic resource
discovery to help ﬁnd resources in your workload that can be conﬁgured, it has built-in scaling
strategies to optimize performance, costs, or a balance between the two, and provides predictive
scaling to assist with regularly occurring spikes.

There are multiple scaling options available to scale your Auto Scaling group:

• Maintain current instance levels at all times

• Scale manually
• Scale based on a schedule
• Scale based on demand
• Use predictive scaling

Manage demand and supply resources 885

AWS Well-Architected Framework Framework

during demand spikes to maintain performance and decrease capacity when demand subsides to
reduce costs.

Demand-based dynamic scaling policies

• Simple/Step Scaling: Monitors metrics and adds/removes instances as per steps deﬁned by the
customers manually.

• Target Tracking: Thermostat-like control mechanism that automatically adds or removes

instances to maintain metrics at a customer deﬁned target.

When architecting with a demand-based approach keep in mind two key considerations. First,
understand how quickly you must provision new resources. Second, understand that the size of
margin between supply and demand will shift. You must be ready to cope with the rate of change
in demand and also be ready for resource failures.

Time-based supply: A time-based approach aligns resource capacity to demand that is predictable
or well-deﬁned by time. This approach is typically not dependent upon utilization levels of the
resources. A time-based approach ensures that resources are available at the speciﬁc time they
are required and can be provided without any delays due to start-up procedures and system or
consistency checks. Using a time-based approach, you can provide additional resources or increase
capacity during busy periods.

Manage demand and supply resources 887

AWS Well-Architected Framework Framework

When architecting with a time-based approach keep in mind two key considerations. First,
how consistent is the usage pattern? Second, what is the impact if the pattern changes? You
can increase the accuracy of predictions by monitoring your workloads and by using business
intelligence. If you see signiﬁcant changes in the usage pattern, you can adjust the times to ensure
that coverage is provided.

Implementation steps

• Configure scheduled scaling: For predictable changes in demand, time-based scaling can
provide the correct number of resources in a timely manner. It is also useful if resource
creation and configuration is not fast enough to respond to changes on demand. Using the
workload analysis configure scheduled scaling using AWS Auto Scaling. To configure time-
based scheduling, you can use predictive scaling of scheduled scaling to increase the number
of Amazon EC2 instances in your Auto Scaling groups in advance according to expected or
predictable load changes.
• Configure predictive scaling: Predictive scaling allows you to increase the number of Amazon
EC2 instances in your Auto Scaling group in advance of daily and weekly patterns in traffic flows.
If you have regular traffic spikes and applications that take a long time to start, you should
consider using predictive scaling. Predictive scaling can help you scale faster by initializing
capacity before projected load compared to dynamic scaling alone, which is reactive in nature.
For example, if users start using your workload with the start of the business hours and don’t use
after hours, then predictive scaling can add capacity before the business hours which eliminates
delay of dynamic scaling to react to changing traffic.

• Configure dynamic automatic scaling: To configure scaling based on active workload metrics,
use Auto Scaling. Use the analysis and configure Auto Scaling to launch on the correct resource
levels, and verify that the workload scales in the required time. You can launch and automatically
scale a fleet of On-Demand Instances and Spot Instances within a single Auto Scaling group.
In addition to receiving discounts for using Spot Instances, you can use Reserved Instances or a
Savings Plan to receive discounted rates of the regular On-Demand Instance pricing. All of these
factors combined help you to optimize your cost savings for Amazon EC2 instances and help you
get the desired scale and performance for your application.

Resources

Related documents:

• AWS Auto Scaling

Manage demand and supply resources 889

AWS Well-Architected Framework Framework

COST10-BP01 Develop a workload review process

Develop a process that defines the criteria and process for workload review. The review effort
should reflect potential benefit. For example, core workloads or workloads with a value of over ten
percent of the bill are reviewed quarterly or every six months, while workloads below ten percent
are reviewed annually.

Level of risk exposed if this best practice is not established: High

Implementation guidance

To have the most cost-efficient workload, you must regularly review the workload to know if there
are opportunities to implement new services, features, and components. To achieve overall lower
costs the process must be proportional to the potential amount of savings. For example, workloads
that are 50% of your overall spend should be reviewed more regularly, and more thoroughly, than
workloads that are five percent of your overall spend. Factor in any external factors or volatility.
If the workload services a specific geography or market segment, and change in that area is
predicted, more frequent reviews could lead to cost savings. Another factor in review is the effort
to implement changes. If there are significant costs in testing and validating changes, reviews
should be less frequent.

Factor in the long-term cost of maintaining outdated and legacy, components and resources and
the inability to implement new features into them. The current cost of testing and validation may
exceed the proposed benefit. However, over time, the cost of making the change may significantly
increase as the gap between the workload and the current technologies increases, resulting in even
larger costs. For example, the cost of moving to a new programming language may not currently
be cost effective. However, in five years time, the cost of people skilled in that language may
increase, and due to workload growth, you would be moving an even larger system to the new
language, requiring even more effort than previously.

Break down your workload into components, assign the cost of the component (an estimate
is sufficient), and then list the factors (for example, effort and external markets) next to each
component. Use these indicators to determine a review frequency for each workload. For example,
you may have webservers as a high cost, low change effort, and high external factors, resulting
in high frequency of review. A central database may be medium cost, high change effort, and low
external factors, resulting in a medium frequency of review.

Deﬁne a process to evaluate new services, design patterns, resource types, and conﬁgurations to
optimize your workload cost as they become available. Similar to performance pillar review and

Optimize over time 891

AWS Well-Architected Framework Framework

• AWS Support Proactive Services

• Regular workload reviews for SAP workloads

COST10-BP02 Review and analyze this workload regularly

Existing workloads are regularly reviewed based on each deﬁned process to ﬁnd out if new services
can be adopted, existing services can be replaced, or workloads can be re-architected.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

AWS is constantly adding new features so you can experiment and innovate faster with the latest
technology. AWS What's New details how AWS is doing this and provides a quick overview of AWS
services, features, and Regional expansion announcements as they are released. You can dive
deeper into the launches that have been announced and use them for your review and analyze
of your existing workloads. To realize the beneﬁts of new AWS services and features, you review
on your workloads and implement new services and features as required. This means you may
need to replace existing services you use for your workload, or modernize your workload to adopt
these new AWS services. For example, you might review your workloads and replace the messaging
component with Amazon Simple Email Service. This removes the cost of operating and maintaining
a ﬂeet of instances, while providing all the functionality at a reduced cost.

To analyze your workload and highlight potential opportunities, you should consider not only
new services but also new ways of building solutions. Review the This is My Architecture videos
on AWS to learn about other customers’ architecture designs, their challenges and their solutions.
Check the All-In series to ﬁnd out real world applications of AWS services and customer stories.
You can also watch the Back to Basics video series that explains, examines, and breaks down basic
cloud architecture pattern best practices. Another source is How to Build This videos, which are
designed to assist people with big ideas on how to bring their minimum viable product (MVP) to
life using AWS services. It is a way for builders from all over the world who have a strong idea to
gain architectural guidance from experienced AWS Solutions Architects. Finally, you can review the
Getting Started resource materials, which has step by step tutorials.

Before starting your review process, follow your business’ requirements for the workload,
security and data privacy requirements in order to use speciﬁc service or Region and performance
requirements while following your agreed review process.

Implementation steps

Optimize over time 893

AWS Well-Architected Framework Framework

operations through automation. Assess the time and associated costs required for operational
eﬀorts and implement automation for administrative tasks to minimize manual eﬀort wherever
feasible.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

Automating operations reduces the frequency of manual tasks, improves efficiency, and benefits
customers by delivering a consistent and reliable experience when deploying, administering, or
operating workloads. You can free up infrastructure resources from manual operational tasks
and use them for higher value tasks and innovations, which improves business value. Enterprises
require a proven, tested way to manage their workloads in the cloud. That solution must be secure,
fast, and cost effective, with minimum risk and maximum reliability.

Start by prioritizing your operational activities based on required effort by looking at overall
operations cost. For example, how long does it take to deploy new resources in the cloud, make
optimization changes to existing ones, or implement necessary configurations? Look at the total
cost of human actions by factoring in cost of operations and management. Prioritize automations
for admin tasks to reduce the human effort.

Review effort should reflect the potential benefit. For example, examine time spent performing
tasks manually as opposed to automatically. Prioritize automating repetitive, high value, time
consuming and complex activities. Activities that pose a high value or high risk of human error
are typically the better place to start automating as the risk often poses an unwanted additional
operational cost (like operations team working extra hours).

Use automation tools like AWS Systems Manager or AWS Conﬁg to streamline operations,
compliance, monitoring, lifecycle, and termination processes. With AWS services, tools, and
third-party products, you can customize the automations you implement to meet your speciﬁc
requirement. Following table shows some of the core operation functions and capabilities you can
achieve with AWS services to automate administration and operation:

• AWS Audit Manager: Continually audit your AWS usage to simplify risk and compliance
assessment
• AWS Backup: Centrally manage and automate data protection.
• AWS Config: Configure compute resources, asses, audit, evaluate configurations and resource
inventory.
• AWS CloudFormation: Launch highly available resources with Infrastructure as Code.

Optimize over time 895

AWS Well-Architected Framework Framework

with the capabilities of AWS Config and AWS CloudFormation, you can efficiently manage and
automate configuration compliance at scale for hundreds of member accounts. You can review
changes in configurations and relationships between AWS resources and dive into the history of
a resource configuration.

• Automate monitoring tasks AWS provides various tools that you can use to monitor services.
You can conﬁgure these tools to automate monitoring tasks. Create and implement a monitoring
plan that collects monitoring data from all the parts in your workload so that you can more
easily debug a multi-point failure if one occurs. For example, you can use the automated
monitoring tools to observe Amazon EC2 and report back to you when something is wrong for
system status checks, instance status checks, and Amazon CloudWatch alarms.

• Automate maintenance and operations: Run routine operations automatically without

human intervention. Using AWS services and tools, you can choose which AWS automations to
implement and customize for your speciﬁc requirements. For example, use EC2 Image Builder for
building, testing, and deployment of virtual machine and container images for use on AWS or
on-premises or patching your EC2 instances with AWS SSM. If your desired action cannot be done
with AWS services or you need more complex actions with ﬁltering resources, then automate
your operations with using AWS Command Line Interface (AWS CLI) or AWS SDK tools. AWS CLI
provides the ability to automate the entire process of controlling and managing AWS services
with scripts without using the AWS Management Console. Select your preferred AWS SDKs to
interact with AWS services. For other code examples, see AWS SDK Code examples repository.

• Create a continual lifecycle with automations: It is important that you establish and preserve
mature lifecycle policies not only for regulations or redundancy but also for cost optimization.
You can use AWS Backup to centrally manage and automate data protection of data stores, such
as your buckets, volumes, databases, and ﬁle systems. You can also use Amazon Data Lifecycle
Manager to automate the creation, retention, and deletion of EBS snapshots and EBS-backed
AMIs.

• Delete unnecessary resources: It's quite common to accumulate unused resources in sandbox
or development AWS accounts. Developers create and experiment with various services and
resources as part of the normal development cycle, and then they don't delete those resources
when they're no longer needed. Unused resources can incur unnecessary and sometimes high
costs for the organization. Deleting these resources can reduce the costs of operating these
environments. Make sure your data is not needed or backed up if you are not sure. You can use
AWS CloudFormation to clean up deployed stacks, which automatically deletes most resources
deﬁned in the template. Alternatively, you can create an automation for the deletion of AWS
resources using tools like aws-nuke.

Optimize over time 897

AWS Well-Architected Framework Framework

Best practice areas

• Region selection
• Alignment to demand
• Software and architecture
• Data
• Hardware and services
• Process and culture

Region selection
Question
• SUS 1 How do you select Regions for your workload?

SUS 1 How do you select Regions for your workload?

The choice of Region for your workload significantly affects its KPIs, including performance, cost,
and carbon footprint. To effectively improve these KPIs, you should choose Regions for your
workloads based on both business requirements and sustainability goals.

Best practices
• SUS01-BP01 Choose Region based on both business requirements and sustainability goals

SUS01-BP01 Choose Region based on both business requirements and sustainability goals

Choose a Region for your workload based on both your business requirements and sustainability
goals to optimize its KPIs, including performance, cost, and carbon footprint.

Common anti-patterns:

• You select the workload’s Region based on your own location.

• You consolidate all workload resources into one geographic location.

Beneﬁts of establishing this best practice: Placing a workload close to Amazon renewable energy
projects or Regions with low published carbon intensity can help to lower the carbon footprint of a
cloud workload.

Region selection 899

AWS Well-Architected Framework Framework

Related videos:

• AWS re:Invent 2023 - Sustainability innovation in AWS Global Infrastructure

• AWS re:Invent 2023 - Sustainable architecture: Past, present, and future
• AWS re:Invent 2022 - Delivering sustainable, high-performing architectures
• AWS re:Invent 2022 - Architecting sustainably and reducing your AWS carbon footprint
• AWS re:Invent 2022 - Sustainability in AWS global infrastructure

Alignment to demand
Question
• SUS 2 How do you align cloud resources to your demand?

SUS 2 How do you align cloud resources to your demand?

Best practices
• SUS02-BP01 Scale workload infrastructure dynamically
• SUS02-BP02 Align SLAs with sustainability goals
• SUS02-BP03 Stop the creation and maintenance of unused assets
• SUS02-BP04 Optimize geographic placement of workloads based on their networking
requirements
• SUS02-BP05 Optimize team member resources for activities performed
• SUS02-BP06 Implement buﬀering or throttling to ﬂatten the demand curve

SUS02-BP01 Scale workload infrastructure dynamically

Use elasticity of the cloud and scale your infrastructure dynamically to match supply of cloud
resources to demand and avoid overprovisioned capacity in your workload.

Alignment to demand 901

AWS Well-Architected Framework Framework

Implementation steps

• Elasticity matches the supply of resources you have against the demand for those resources.
Instances, containers, and functions provide mechanisms for elasticity, either in combination
with automatic scaling or as a feature of the service. AWS provides a range of auto scaling
mechanisms to ensure that workloads can scale down quickly and easily during periods of low
user load. Here are some examples of auto scaling mechanisms:

Auto scaling mechanism Where to use

Amazon EC2 Auto Scaling Use to verify you have the correct number of
Amazon EC2 instances available to handle
the user load for your application.

Application Auto Scaling Use to automatically scale the resources for

individual AWS services beyond Amazon EC2,
such as Lambda functions or Amazon Elastic
Container Service (Amazon ECS) services.

Kubernetes Cluster Autoscaler Use to automatically scale Kubernetes

clusters on AWS.

• Scaling is often discussed related to compute services like Amazon EC2 instances or AWS Lambda
functions. Consider the conﬁguration of non-compute services like Amazon DynamoDB read and
write capacity units or Amazon Kinesis Data Streams shards to match the demand.
• Verify that the metrics for scaling up or down are validated against the type of workload being
deployed. If you are deploying a video transcoding application, 100% CPU utilization is expected
and should not be your primary metric. You can use a customized metric (such as memory
utilization) for your scaling policy if required. To choose the right metrics, consider the following
guidance for Amazon EC2:
• The metric should be a valid utilization metric and describe how busy an instance is.
• The metric value must increase or decrease proportionally to the number of instances in the
Auto Scaling group.
• Use dynamic scaling instead of manual scaling for your Auto Scaling group. We also recommend
that you use target tracking scaling policies in your dynamic scaling.
• Verify that workload deployments can handle both scale-out and scale-in events. Create test
scenarios for scale-in events to verify that the workload behaves as expected and does not aﬀect

Alignment to demand 903

AWS Well-Architected Framework Framework

SUS02-BP02 Align SLAs with sustainability goals

Review and optimize workload service-level agreements (SLA) based on your sustainability goals
to minimize the resources required to support your workload while continuing to meet business
needs.

Common anti-patterns:

• Workload SLAs are unknown or ambiguous.

• You deﬁne your SLA just for availability and performance.
• You use the same design pattern (like Multi-AZ architecture) for all your workloads.

Beneﬁts of establishing this best practice: Aligning SLAs with sustainability goals leads to optimal
resource usage while meeting business needs.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

SLAs define the level of service expected from a cloud workload, such as response time, availability,
and data retention. They influence the architecture, resource usage, and environmental impact of
a cloud workload. At a regular cadence, review SLAs and make trade-offs that significantly reduce
resource usage in exchange for acceptable decreases in service levels.

Implementation steps

• Understand sustainability goals: Identify sustainability goals in your organization, such as

carbon reduction or improving resource utilization.
• Review SLAs: Evaluate your SLAs to assess if they support your business requirements. If you are
exceeding SLAs, perform further review.
• Understand trade-offs: Understand the trade-offs across your workload’s complexity (like high
volume of concurrent users), performance (like latency), and sustainability impact (like required
resources). Typically, prioritizing two of the factors comes at the expense of the third.
• Adjust SLAs: Adjust your SLAs by making trade-offs that significantly reduce sustainability
impacts in exchange for acceptable decreases in service levels.
• Sustainability and reliability: Highly available workloads tend to consume more resources.
• Sustainability and performance: Using more resources to boost performance could have a
higher environmental impact.

Alignment to demand 905

AWS Well-Architected Framework Framework

Common anti-patterns:

• You do not analyze your application for assets that are redundant or no longer required.

• You do not remove assets that are redundant or no longer required.

Beneﬁts of establishing this best practice: Removing unused assets frees resources and improves
the overall eﬃciency of the workload.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

Unused assets consume cloud resources like storage space and compute power. By identifying
and eliminating these assets, you can free up these resources, resulting in a more eﬃcient cloud
architecture. Perform regular analysis on application assets such as pre-compiled reports, datasets,
static images, and asset access patterns to identify redundancy, underutilization, and potential
decommission targets. Remove those redundant assets to reduce the resource waste in your
workload.

Implementation steps

• Conduct an inventory: Conduct a comprehensive inventory to identify all assets within your
workload.

• Analyze usage: Use continuous monitoring to identify static assets that are no longer required.

• Remove unused assets: Develop a plan to remove assets that are no longer required.

• Before removing any asset, evaluate the impact of removing it on the architecture.

• Consolidate overlapping generated assets to remove redundant processing.

• Update your applications to no longer produce and store assets that are not required.

• Communicate with third parties: Instruct third parties to stop producing and storing assets
managed on your behalf that are no longer required. Ask to consolidate redundant assets.

• Use lifecycle policies: Use lifecycle policies to automatically delete unused assets.

• You can use Amazon S3 Lifecycle to manage your objects throughout their lifecycle.

• You can use Amazon Data Lifecycle Manager to automate the creation, retention, and deletion
of Amazon EBS snapshots and Amazon EBS-backed AMIs.

• Review and optimize: Regularly review your workload to identify and remove any unused assets.
Alignment to demand 907
AWS Well-Architected Framework Framework

Analyze the network access patterns in your workload to identify how to use these cloud location
options and reduce the distance network traﬃc must travel.

Implementation steps

• Analyze network access patterns in your workload to identify how users use your application.

• Use monitoring tools, such as Amazon CloudWatch and AWS CloudTrail, to gather data on
network activities.

• Analyze the data to identify the network access pattern.

• Select the Regions for your workload deployment based on the following key elements:

• Your Sustainability goal: as explained in Region selection.

• Where your data is located: For data-heavy applications (such as big data and machine
learning), application code should run as close to the data as possible.
• Where your users are located: For user-facing applications, choose a Region (or Regions) close
to your workload’s users.

• Other constraints: Consider constraints such as cost and compliance as explained in What to
Consider when Selecting a Region for your Workloads.

• Use local caching or AWS Caching Solutions for frequently used assets to improve performance,
reduce data movement, and lower environmental impact.

Service When to use

Amazon CloudFront Use to cache static content such as images,

scripts, and videos, as well as dynamic
content such as API responses or web
applications.

Amazon ElastiCache Use to cache content for web applications.

DynamoDB Accelerator Use to add in-memory acceleration to your

DynamoDB tables.

• Use services that can help you run code closer to users of your workload:

Alignment to demand 909

AWS Well-Architected Framework Framework

• Scaling network performance on next-gen Amazon EC2 instances

• AWS Local Zones Explainer Video

• AWS Outposts: Overview and How it Works

• AWS re:Invent 2023 - A migration strategy for edge and on-premises workloads

• AWS re:Invent 2021 - AWS Outposts: Bringing the AWS experience on premises

• AWS re:Invent 2020 - AWS Wavelength: Run apps with ultra-low latency at 5G edge

• AWS re:Invent 2022 - AWS Local Zones: Building applications for a distributed edge

• AWS re:Invent 2021 - Building low-latency websites with Amazon CloudFront

• AWS re:Invent 2022 - Improve performance and availability with AWS Global Accelerator

• AWS re:Invent 2022 - Build your global wide area network using AWS

• AWS re:Invent 2020: Global traﬃc management with Amazon Route 53

Related examples:

• AWS Networking Workshops

• Architecting for sustainability - Minimize data movement across networks

SUS02-BP05 Optimize team member resources for activities performed

Optimize resources provided to team members to minimize the environmental sustainability

impact while supporting their needs.

Common anti-patterns:

• You ignore the impact of devices used by your team members on the overall eﬃciency of your
cloud application.

• You manually manage and update resources used by team members.

Beneﬁts of establishing this best practice: Optimizing team member resources improves the
overall eﬃciency of cloud-enabled applications.

Level of risk exposed if this best practice is not established: Low

Alignment to demand 911

AWS Well-Architected Framework Framework

Related videos:

• Managing cost for Amazon WorkSpaces on AWS

SUS02-BP06 Implement buﬀering or throttling to ﬂatten the demand curve

Buﬀering and throttling ﬂatten the demand curve and reduce the provisioned capacity required for
your workload.

Common anti-patterns:

• You process the client requests immediately while it is not needed.

• You do not analyze the requirements for client requests.

Beneﬁts of establishing this best practice: Flattening the demand curve reduce the required
provisioned capacity for the workload. Reducing the provisioned capacity means less energy
consumption and less environmental impact.

Level of risk exposed if this best practice is not established: Low

Implementation guidance

Flattening the workload demand curve can help you to reduce the provisioned capacity for a
workload and reduce its environmental impact. Assume a workload with the demand curve shown
in below ﬁgure. This workload has two peaks, and to handle those peaks, the resource capacity
as shown by orange line is provisioned. The resources and energy used for this workload is not
indicated by the area under the demand curve, but the area under the provisioned capacity line, as
provisioned capacity is needed to handle those two peaks.

Demand curve with two distinct peaks that require high provisioned capacity.

Alignment to demand 913

AWS Well-Architected Framework Framework

Resources

Related documents:

• Getting started with Amazon SQS

• Application integration Using Queues and Messages
• Managing and monitoring API throttling in your workloads
• Throttling a tiered, multi-tenant REST API at scale using API Gateway
• Application integration Using Queues and Messages

Related videos:

• AWS re:Invent 2022 - Application integration patterns for microservices

• AWS re:Invent 2023 - Smart savings: Amazon EC2 cost-optimization strategies
• AWS re:Invent 2023 - Advanced integration patterns & trade-oﬀs for loosely coupled systems

Software and architecture

Question
• SUS 3 How do you take advantage of software and architecture patterns to support your
sustainability goals?

SUS 3 How do you take advantage of software and architecture patterns to

support your sustainability goals?

Implement patterns for performing load smoothing and maintaining consistent high utilization
of deployed resources to minimize the resources consumed. Components might become idle from
lack of use because of changes in user behavior over time. Revise patterns and architecture to
consolidate under-utilized components to increase overall utilization. Retire components that are
no longer required. Understand the performance of your workload components, and optimize the
components that consume the most resources. Be aware of the devices that your customers use to
access your services, and implement patterns to minimize the need for device upgrades.

Best practices
• SUS03-BP01 Optimize software and architecture for asynchronous and scheduled jobs

Software and architecture 915

AWS Well-Architected Framework Framework

Implementation steps

• Analyze the demand for your workload to determine how to respond to those.
• For requests or jobs that don’t require synchronous responses, use queue-driven architectures
and auto scaling workers to maximize utilization. Here are some examples of when you might
consider queue-driven architecture:

Queuing mechanism Description

AWS Batch job queues AWS Batch jobs are submitted to a job queue
where they reside until they can be scheduled
to run in a compute environment.

Amazon Simple Queue Service and Amazon Pairing Amazon SQS and Spot Instances to
EC2 Spot Instances build fault tolerant and eﬃcient architecture.

• For requests or jobs that can be processed anytime, use scheduling mechanisms to process jobs
in batch for more eﬃciency. Here are some examples of scheduling mechanisms on AWS:

Scheduling mechanism Description

Amazon EventBridge Scheduler A capability from Amazon EventBridge

that allows you to create, run, and manage
scheduled tasks at scale.

AWS Glue time-based schedule Deﬁne a time-based schedule for your

crawlers and jobs in AWS Glue.

Amazon Elastic Container Service (Amazon Amazon ECS supports creating scheduled
ECS) scheduled tasks tasks. Scheduled tasks use Amazon EventBrid
ge rules to run tasks either on a schedule or
in a response to an EventBridge event.

Instance Scheduler Conﬁgure start and stop schedules for

your Amazon EC2 and Amazon Relational
Database Service instances.

• If you use polling and webhooks mechanisms in your architecture, replace those with events. Use
event-driven architectures to build highly eﬃcient workloads.

Software and architecture 917

AWS Well-Architected Framework Framework

SUS03-BP02 Remove or refactor workload components with low or no use

Remove components that are unused and no longer required, and refactor components with little
utilization to minimize waste in your workload.

Common anti-patterns:

• You do not regularly check the utilization level of individual components of your workload.
• You do not check and analyze recommendations from AWS rightsizing tools such as AWS
Compute Optimizer.

Beneﬁts of establishing this best practice: Removing unused components minimizes waste and
improves the overall eﬃciency of your cloud workload.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Review your workload to identify idle or unused components. This is an iterative improvement
process which can be initiated by changes in demand or the release of a new cloud service. For
example, a signiﬁcant drop in AWS Lambda function run time can be an indicator of a need to
lower the memory size. Also, as AWS releases new services and features, the optimal services and
architecture for your workload may change.

Continually monitor workload activity and look for opportunities to improve the utilization level
of individual components. By removing idle components and performing rightsizing activities, you
meet your business requirements with the fewest cloud resources.

Implementation steps

• Have an inventory of your AWS resources. In AWS, you can turn on AWS Resource Explorer to
explore and organize your AWS resources. For more details, see AWS re:Invent 2022 - How to
manage resources and applications at scale on AWS.
• Monitor and capture the utilization metrics for critical components of your workload (like CPU
utilization, memory utilization, or network throughput in Amazon CloudWatch metrics).
• Identify unused or under-utilized components in your architecture.
• For stable workloads, check AWS rightsizing tools such as AWS Compute Optimizer at regular
intervals to identify idle, unused, or underutilized components.

Software and architecture 919

AWS Well-Architected Framework Framework

• You usually respond to performance issues by increasing the resources.

• Your code review and development process does not track performance changes.

Beneﬁts of establishing this best practice: Using eﬃcient code minimizes resource usage and
improves performance.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

It is crucial to examine every functional area, including the code for a cloud architected application,
to optimize its resource usage and performance. Continually monitor your workload’s performance
in build environments and production and identify opportunities to improve code snippets that
have particularly high resource usage. Adopt a regular review process to identify bugs or anti-
patterns within your code that use resources ineﬃciently. Leverage simple and eﬃcient algorithms
that produce the same results for your use case.

Implementation steps

• Use efficient programming language: Use an efficient operating system and programming
language for the workload. For details on energy efficient programming languages (including
Rust), see Sustainability with Rust.
• Use an AI coding companion: Consider using an AI coding companion such as Amazon
CodeWhisperer to efficiently write code.
• Automate code reviews: While developing your workloads, adopt an automated code review
process to improve quality and identify bugs and anti-patterns.
• Automate code reviews with Amazon CodeGuru Reviewer
• Detecting concurrency bugs with Amazon CodeGuru
• Raising code quality for Python applications using Amazon CodeGuru
• Use a code profiler: Use a code profiler to identify the areas of code that use the most time or
resources as targets for optimization.
• Reducing your organization's carbon footprint with Amazon CodeGuru Profiler
• Understanding memory usage in your Java application with Amazon CodeGuru Profiler
• Improving customer experience and reducing cost with Amazon CodeGuru Profiler
• Monitor and optimize: Use continuous monitoring resources to identify components with high
resource requirements or suboptimal configuration.

Software and architecture 921

AWS Well-Architected Framework Framework

• You manually manage and update resources used by customers.

Beneﬁts of establishing this best practice: Implementing software patterns and features that are
optimized for customer device can reduce the overall environmental impact of cloud workload.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Implementing software patterns and features that are optimized for customer devices can reduce
the environmental impact in several ways:

• Implementing new features that are backward compatible can reduce the number of hardware
replacements.
• Optimizing an application to run eﬃciently on devices can help to reduce their energy
consumption and extend their battery life (if they are powered by battery).
• Optimizing an application for devices can also reduce the data transfer over the network.

Understand the devices and equipment used in your architecture, their expected lifecycle, and the
impact of replacing those components. Implement software patterns and features that can help
to minimize the device energy consumption, the need for customers to replace the device and also
upgrade it manually.

Implementation steps

• Conduct an inventory: Inventory the devices used in your architecture. Devices can be mobile,
tablet, IOT devices, smart light, or even smart devices in a factory.
• Use energy-efficient devices: Consider using energy-efficient devices in your architecture. Use
power management configurations on devices to enter low power mode when not in use.
• Run efficient applications: Optimize the application running on the devices:
• Use strategies such as running tasks in the background to reduce their energy consumption.
• Account for network bandwidth and latency when building payloads, and implement
capabilities that help your applications work well on low bandwidth, high latency links.
• Convert payloads and files into optimized formats required by devices. For example, you
can use Amazon Elastic Transcoder or AWS Elemental MediaConvert to convert large, high
quality digital media files into formats that users can play back on mobile devices, tablets, web
browsers, and connected televisions.

Software and architecture 923

AWS Well-Architected Framework Framework

SUS03-BP05 Use software patterns and architectures that best support data access and storage
patterns

Understand how data is used within your workload, consumed by your users, transferred, and
stored. Use software patterns and architectures that best support data access and storage to
minimize the compute, networking, and storage resources required to support the workload.

Common anti-patterns:

• You assume that all workloads have similar data storage and access patterns.
• You only use one tier of storage, assuming all workloads ﬁt within that tier.
• You assume that data access patterns will stay consistent over time.
• Your architecture supports a potential high data access burst, which results in the resources
remaining idle most of the time.

Beneﬁts of establishing this best practice: Selecting and optimizing your architecture based on
data access and storage patterns will help decrease development complexity and increase overall
utilization. Understanding when to use global tables, data partitioning, and caching will help you
decrease operational overhead and scale based on your workload needs.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Use software and architecture patterns that aligns best to your data characteristics and access
patterns. For example, use modern data architecture on AWS that allows you to use purpose-
built services optimized for your unique analytics use cases. These architecture patterns allow for
eﬃcient data processing and reduce the resource usage.

Implementation steps

• Analyze your data characteristics and access patterns to identify the correct conﬁguration for
your cloud resources. Key characteristics to consider include:
• Data type: structured, semi-structured, unstructured
• Data growth: bounded, unbounded
• Data durability: persistent, ephemeral, transient
• Access patterns reads or writes, update frequency, spiky, or consistent
• Use architecture patterns that best support data access and storage patterns.

Software and architecture 925

AWS Well-Architected Framework Framework

• AWS re:Invent 2023 - Optimizing storage price and performance with Amazon S3
• AWS re:Invent 2023 - Building and optimizing a data lake on Amazon S3
• AWS re:Invent 2023 - Advanced event-driven patterns with Amazon EventBridge

Related examples:

• AWS Purpose Built Databases Workshop

• AWS Modern Data Architecture Immersion Day
• Build a Data Mesh on AWS

Data
Question
• SUS 4 How do you take advantage of data management policies and patterns to support your
sustainability goals?

SUS 4 How do you take advantage of data management policies and patterns to
support your sustainability goals?

Implement data management practices to reduce the provisioned storage required to support your
workload, and the resources required to use it. Understand your data, and use storage technologies
and configurations that more effectively support the business value of the data and how it’s used.
Lifecycle data to more efficient, less performant storage when requirements decrease, and delete
data that’s no longer required.

Best practices
• SUS04-BP01 Implement a data classification policy
• SUS04-BP02 Use technologies that support data access and storage patterns
• SUS04-BP03 Use policies to manage the lifecycle of your datasets
• SUS04-BP04 Use elasticity and automation to expand block storage or file system
• SUS04-BP05 Remove unneeded or redundant data
• SUS04-BP06 Use shared file systems or storage to access common data
• SUS04-BP07 Minimize data movement across networks

Data 927
AWS Well-Architected Framework Framework

• Periodically review: Periodically review and audit your environment for untagged and
unclassiﬁed data. Use automation to identify this data, and classify and tag the data
appropriately. As an example, see Data Catalog and crawlers in AWS Glue.
• Establish a data catalog: Establish a data catalog that provides audit and governance
capabilities.
• Documentation: Document data classiﬁcation policies and handling procedures for each data
class.

Resources

Related documents:

• Leveraging AWS Cloud to Support Data Classiﬁcation

• Tag policies from AWS Organizations

Related videos:

• AWS re:Invent 2022 - Enabling agility with data governance on AWS

• AWS re:Invent 2023 - Data protection and resilience with AWS storage

SUS04-BP02 Use technologies that support data access and storage patterns

Use storage technologies that best support how your data is accessed and stored to minimize the
resources provisioned while supporting your workload.

Common anti-patterns:

Beneﬁts of establishing this best practice: Selecting and optimizing your storage technologies
based on data access and storage patterns will help you reduce the required cloud resources to
meet your business needs and improve the overall eﬃciency of cloud workload.

Level of risk exposed if this best practice is not established: Low

Data 929
AWS Well-Architected Framework Framework

Type Technology Key characteristics

types of compute solutions

. Amazon EFS automatically
grows and shrinks storage
and is performance-optimi
zed to deliver consistent low
latencies.

Shared ﬁle system Amazon FSx Built on the latest AWS

compute solutions to
support four commonly used
file systems: NetApp ONTAP,
OpenZFS, Windows File
Server, and Lustre. Amazon
FSx latency, throughput, and
IOPS vary per file system
and should be considered
when selecting the right file
system for your workload
needs.

Block storage Amazon Elastic Block Store Scalable, high-performance

Data 931
AWS Well-Architected Framework Framework

Resources

Related documents:

• Amazon EBS volume types

• Amazon EC2 instance store
• Amazon S3 Intelligent-Tiering
• Amazon EBS I/O Characteristics
• Using Amazon S3 storage classes
• What is Amazon S3 Glacier?

Related videos:

• AWS re:Invent 2023 - Improve Amazon EBS eﬃciency and be more cost-eﬃcient
• AWS re:Invent 2023 - Optimizing storage price and performance with Amazon S3
• AWS re:Invent 2023 - Building and optimizing a data lake on Amazon S3
• AWS re:Invent 2022 - Building modern data architectures on AWS
• AWS re:Invent 2022 - Modernize apps with purpose-built databases
• AWS re:Invent 2022 - Building data mesh architectures on AWS
• AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations
• AWS re:Invent 2023 - Advanced data modeling with Amazon DynamoDB

Related examples:

• Amazon S3 Examples
• AWS Purpose Built Databases Workshop
• Databases for Developers
• AWS Modern Data Architecture Immersion Day
• Build a Data Mesh on AWS

SUS04-BP03 Use policies to manage the lifecycle of your datasets

Manage the lifecycle of all of your data and automatically enforce deletion to minimize the total
storage required for your workload.

Data 933
AWS Well-Architected Framework Framework

Storage service How to set automated lifecycle policies

that have not been accessed to lower-cos

t access tiers. You can leverage Amazon S3
Storage Lens metrics to identify optimizat
ion opportunities and gaps in lifecycle
management.

Amazon Elastic Block Store You can use Amazon Data Lifecycle Manager
to automate the creation, retention, and
deletion of Amazon EBS snapshots and
Amazon EBS-backed AMIs.

Amazon Elastic File System Amazon EFS lifecycle management automatic

ally manages ﬁle storage for your ﬁle
systems.

Amazon Elastic Container Registry Amazon ECR lifecycle policies automate the
cleanup of your container images by expiring
images based on age or count.

AWS Elemental MediaStore You can use an object lifecycle policy that
governs how long objects should be stored in
the MediaStore container.

• Delete unused volumes, snapshots, and data that is out of its retention period. Leverage native
service features like Amazon DynamoDB Time To Live or Amazon CloudWatch log retention for
deletion.

• Aggregate and compress data where applicable based on lifecycle rules.

Resources

Related documents:

• Optimize your Amazon S3 Lifecycle rules with Amazon S3 Storage Class Analysis

• Evaluating Resources with AWS Conﬁg Rules

Related videos:

Data 935
AWS Well-Architected Framework Framework

• Amazon EFS performance

• Amazon EBS volume performance on Linux instances
• Set target levels of utilization for your data volumes, and resize volumes outside of expected
ranges.
• Right size read-only volumes to fit the data.
• Migrate data to object stores to avoid provisioning the excess capacity from fixed volume sizes
on block storage.
• Regularly review elastic volumes and file systems to terminate idle volumes and shrink over-
provisioned resources to fit the current data size.

Resources

Related documents:

• Extend the ﬁle system after resizing an EBS volume

• Modify a volume using Amazon EBS Elastic Volumes
• Amazon FSx Documentation
• What is Amazon Elastic File System?

Related videos:

• Deep Dive on Amazon EBS Elastic Volumes

• Amazon EBS and Snapshot Optimization Strategies for Better Performance and Cost Savings
• Optimizing Amazon EFS for cost and performance, using best practices

SUS04-BP05 Remove unneeded or redundant data

Remove unneeded or redundant data to minimize the storage resources required to store your
datasets.

Common anti-patterns:

• You duplicate data that can be easily obtained or recreated.

• You back up all data without considering its criticality.
• You only delete data irregularly, on operational events, or not at all.

Data 937
AWS Well-Architected Framework Framework

• Use data virtualization capabilities on AWS to maintain data at its source and avoid data
duplication.
• Cloud Native Data Virtualization on AWS
• Optimize Data Pattern Using Amazon Redshift Data Sharing
• Use backup technology that can make incremental backups.
• Leverage the durability of Amazon S3 and replication of Amazon EBS to meet your durability
goals instead of self-managed technologies (such as a redundant array of independent disks
(RAID)).
• Centralize log and trace data, deduplicate identical log entries, and establish mechanisms to tune
verbosity when needed.
• Pre-populate caches only where justiﬁed.
• Establish cache monitoring and automation to resize the cache accordingly.
• Remove out-of-date deployments and assets from object stores and edge caches when pushing
new versions of your workload.

Resources

Related documents:

• Change log data retention in CloudWatch Logs

• Data deduplication on Amazon FSx for Windows File Server
• Features of Amazon FSx for ONTAP including data deduplication
• Invalidating Files on Amazon CloudFront
• Using AWS Backup to back up and restore Amazon EFS ﬁle systems
• What is Amazon CloudWatch Logs?
• Working with backups on Amazon RDS
• Integrate and deduplicate datasets using AWS Lake Formation

Related videos:

• Amazon Redshift Data Sharing Use Cases

Related examples:

Data 939
AWS Well-Architected Framework Framework

Storage option When to use

Amazon EBS Multi-Attach Amazon EBS Multi-Attach allows you to

attach a single Provisioned IOPS SSD (io1 or
io2) volume to multiple instances that are in
the same Availability Zone.

Amazon EFS See When to Choose Amazon EFS.

Amazon FSx See Choosing an Amazon FSx File System.

Amazon S3 Applications that do not require a ﬁle system

structure and are designed to work with
object storage can use Amazon S3 as a
massively scalable, durable, low-cost object
storage solution.

• Copy data to or fetch data from shared ﬁle systems only as needed. As an example, you can
create an Amazon FSx for Lustre ﬁle system backed by Amazon S3 and only load the subset of
data required for processing jobs to Amazon FSx.
• Delete data as appropriate for your usage patterns as outlined in SUS04-BP03 Use policies to
manage the lifecycle of your datasets.
• Detach volumes from clients that are not actively using them.

Resources

Related documents:

• Linking your ﬁle system to an Amazon S3 bucket

• Using Amazon EFS for AWS Lambda in your serverless applications
• Amazon EFS Intelligent-Tiering Optimizes Costs for Workloads with Changing Access Patterns
• Using Amazon FSx with your on-premises data repository

• Storage cost optimization with Amazon EFS

Data 941
AWS Well-Architected Framework Framework

• Use services that can help you run code closer to users of your workload.

Service When to use

Lambda@Edge Use for compute-heavy operations that are

run when objects are not in the cache.

CloudFront Functions Use for simple use cases such as HTTP(s)

request/response manipulations that can be
initiated by short-lived functions.

AWS IoT Greengrass Run local compute, messaging, and data

caching for connected devices.

Resources

Related documents:

• Optimizing your AWS Infrastructure for Sustainability, Part III: Networking

• AWS Global Infrastructure

• Amazon CloudFront Key Features including the CloudFront Global Edge Network

• Compressing HTTP requests in Amazon OpenSearch Service

• Intermediate data compression with Amazon EMR

• Loading compressed data ﬁles from Amazon S3 into Amazon Redshift

• Serving compressed ﬁles with Amazon CloudFront

Related videos:

• Demystifying data transfer on AWS

Related examples:

• Architecting for sustainability - Minimize data movement across networks

Data 943
AWS Well-Architected Framework Framework

Resources

Related best practices:

• REL09-BP01 Identify and back up all data that needs to be backed up, or reproduce the data
from sources
• REL09-BP03 Perform data backup automatically
• REL13-BP02 Use deﬁned recovery strategies to meet the recovery objectives

Related documents:

• Using AWS Backup to back up and restore Amazon EFS ﬁle systems
• Amazon EBS snapshots
• Working with backups on Amazon Relational Database Service
• APN Partner: partners that can help with backup
• AWS Marketplace: products that can be used for backup
• Backing Up Amazon EFS
• Backing Up Amazon FSx for Windows File Server
• Backup and Restore for Amazon ElastiCache (Redis OSS)

Related videos:

• AWS re:Invent 2023 - Backup and disaster recovery strategies for increased resilience
• AWS re:Invent 2023 - What's new with AWS Backup
• AWS re:Invent 2021 - Backup, disaster recovery, and ransomware protection with AWS

Related examples:

• Well-Architected Lab - Backup data

Hardware and services

Question
• SUS 5 How do you select and use cloud hardware and services in your architecture to support
your sustainability goals?

Hardware and services 945

AWS Well-Architected Framework Framework

use rightsizing guidelines from AWS tools to eﬃciently operate your cloud resource and meet your
business needs.

Implementation steps

• Choose the instances type: Choose the right instances type to best fit your needs. To learn about
how to choose Amazon Elastic Compute Cloud instances and use mechanisms such as attribute-
based instance selection, see the following:
• How do I choose the appropriate Amazon EC2 instance type for my workload?
• Attribute-based instance type selection for Amazon EC2 Fleet.
• Create an Auto Scaling group using attribute-based instance type selection.
• Scale: Use small increments to scale variable workloads.
• Use multiple compute purchase options: Balance instance flexibility, scalability, and cost
savings with multiple compute purchase options.
• Amazon EC2 On-Demand Instances are best suited for new, stateful, and spiky workloads
which can’t be instance type, location, or time flexible.
• Amazon EC2 Spot Instances are a great way to supplement the other options for applications
that are fault tolerant and flexible.
• Leverage Compute Savings Plans for steady state workloads that allow flexibility if your needs
(like AZ, Region, instance families, or instance types) change.
• Use instance and Availability Zone diversity: Maximize application availability and take
advantage of excess capacity by diversifying your instances and Availability Zones.
• Rightsize instances: Use the rightsizing recommendations from AWS tools to make
adjustments on your workload. For more information, see Optimizing your cost with Rightsizing
Recommendations and Right Sizing: Provisioning Instances to Match Workloads
• Use rightsizing recommendations in AWS Cost Explorer or AWS Compute Optimizer to identify
rightsizing opportunities.
• Negotiate service-level agreements (SLAs): Negotiate SLAs that permit temporarily reducing
capacity while automation deploys replacement resources.

Resources

Related documents:

• Optimizing your AWS Infrastructure for Sustainability, Part I: Compute

Hardware and services 947
AWS Well-Architected Framework Framework

Implementation guidance

Using efficient instances in cloud workload is crucial for lower resource usage and cost-
effectiveness. Continually monitor the release of new instance types and take advantage of energy
efficiency improvements, including those instance types designed to support specific workloads
such as machine learning training and inference, and video transcoding.

Implementation steps

• Learn and explore instance types: Find instance types that can lower your workload's
environmental impact.

• Subscribe to What's New with AWS to stay up-to-date with the latest AWS technologies and
instances.

• Learn about diﬀerent AWS instance types.

• Learn about AWS Graviton-based instances which oﬀer the best performance per watt
of energy use in Amazon EC2 by watching re:Invent 2020 - Deep dive on AWS Graviton2
processor-powered Amazon EC2 instances and Deep dive into AWS Graviton3 and Amazon EC2
C7g instances.

• Use instance types with the least impact: Plan and transition your workload to instance types
with the least impact.

• Define a process to evaluate new features or instances for your workload. Take advantage
of agility in the cloud to quickly test how new instance types can improve your workload
environmental sustainability. Use proxy metrics to measure how many resources it takes you to
complete a unit of work.
• If possible, modify your workload to work with different numbers of vCPUs and different
amounts of memory to maximize your choice of instance type.

• Consider transitioning your workload to Graviton-based instances to improve the performance

eﬃciency of your workload. For more information on moving workloads to AWS Graviton, see
AWS Graviton Fast Start and Considerations when transitioning workloads to AWS Graviton-
based Amazon Elastic Compute Cloud instances.

• Consider selecting the AWS Graviton option in your usage of AWS managed services.

• Migrate your workload to Regions that oﬀer instances with the least sustainability impact and
still meet your business requirements.

• For machine learning workloads, take advantage of purpose-built hardware that is speciﬁc to
your workload such as AWS Trainium, AWS Inferentia, and Amazon EC2 DL1. AWS Inferentia
Hardware and services 949
AWS Well-Architected Framework Framework

• AWS re:Invent 2023 - New Amazon Elastic Compute Cloud generative AI capabilities in AWS
Management Console
• AWS re:Invent 2023 = What's new with Amazon Elastic Compute Cloud

• AWS re:Invent 2023 - Smart savings: Amazon Elastic Compute Cloud cost-optimization strategies

• AWS re:Invent 2021 - Deep dive into AWS Graviton3 and Amazon EC2 C7g instances

• AWS re:Invent 2022 - Build a cost-, energy-, and resource-eﬃcient compute environment

Related examples:

• Solution: Guidance for Optimizing Deep Learning Workloads for Sustainability on AWS

• Migrating Amazon Relational Database Service Databases to Graviton

SUS05-BP03 Use managed services

Use managed services to operate more eﬃciently in the cloud.

Common anti-patterns:

• You use Amazon EC2 instances with low utilization to run your applications.

• Your in-house team only manages the workload, without time to focus on innovation or
simpliﬁcations.

• You deploy and maintain technologies for tasks that can run more eﬃciently on managed
services.

Beneﬁts of establishing this best practice:

• Using managed services shifts the responsibility to AWS, which has insights across millions of
customers that can help drive new innovations and eﬃciencies.

• Managed service distributes the environmental impact of the service across many users because
of the multi-tenet control planes.

Level of risk exposed if this best practice is not established: Medium

Hardware and services 951

AWS Well-Architected Framework Framework

5. Replace self-hosted services: Use your migration plan to replace self-hosted services with
managed service.
6. Monitor and adjust: Continually monitor the service after the migration is complete to make
adjustments as required and optimize the service.

Resources

Related documents:

• AWS Cloud Products

• AWS Total Cost of Ownership (TCO) Calculator
• Amazon DocumentDB
• Amazon Elastic Kubernetes Service (EKS)
• Amazon Managed Streaming for Apache Kafka (Amazon MSK)

Related videos:

• AWS re:Invent 2021 - Cloud operations at scale with AWS Managed Services
• AWS re:Invent 2023 - Best practices for operating on AWS

SUS05-BP04 Optimize your use of hardware-based compute accelerators

Optimize your use of accelerated computing instances to reduce the physical infrastructure
demands of your workload.

Common anti-patterns:

• You are not monitoring GPU usage.

• You are using a general-purpose instance for workload while a purpose-built instance can deliver
higher performance, lower cost, and better performance per watt.
• You are using hardware-based compute accelerators for tasks where they’re more eﬃcient using
CPU-based alternatives.

Beneﬁts of establishing this best practice: By optimizing the use of hardware-based accelerators,
you can reduce the physical-infrastructure demands of your workload.

Hardware and services 953

AWS Well-Architected Framework Framework

• Let's Architect! Architecting with custom chips and accelerators

• How do I choose the appropriate Amazon EC2 instance type for my workload?

• Amazon EC2 VT1 Instances

• Choose the best AI accelerator and model compilation for computer vision inference with
Amazon SageMaker

Related videos:

• AWS re:Invent 2021 - How to select Amazon EC2 GPU instances for deep learning

• AWS Online Tech Talks - Deploying Cost-Eﬀective Deep Learning Inference

• AWS re:Invent 2023 - Cutting-edge AI with AWS and NVIDIA

• AWS re:Invent 2022 - [NEW LAUNCH!] Introducing AWS Inferentia2-based Amazon EC2 Inf2
instances

• AWS re:Invent 2022 - Accelerate deep learning and innovate faster with AWS Trainium

• AWS re:Invent 2022 - Deep learning on AWS with NVIDIA: From training to deployment

Process and culture

Question

• SUS 6 How do your organizational processes support your sustainability goals?

SUS 6 How do your organizational processes support your sustainability goals?

Look for opportunities to reduce your sustainability impact by making changes to your
development, test, and deployment practices.

Best practices

• SUS06-BP01 Adopt methods that can rapidly introduce sustainability improvements

• SUS06-BP02 Keep your workload up-to-date

• SUS06-BP03 Increase utilization of build environments

• SUS06-BP04 Use managed device farms for testing

Process and culture 955

AWS Well-Architected Framework Framework

• Streamline the process: Continually improve and streamline your development processes. As an
example, Automate your software delivery process using continuous integration and delivery (CI/
CD) pipelines to test and deploy potential improvements to reduce the level of eﬀort and limit
errors caused by manual processes.
• Training and awareness: Run training programs for your team members to educate them about
sustainability and how their activities impact your organizational sustainability goals.
• Assess and adjust: Continually assess the impact of improvements and make adjustments as
needed.

Resources

Related documents:

• AWS enables sustainability solutions

• Scalable agile development practices based on AWS CodeCommit

Related videos:

• AWS re:Invent 2023 - Sustainable architecture: Past, present, and future

• AWS re:Invent 2022 - Delivering sustainable, high-performing architectures
• AWS re:Invent 2022 - Architecting sustainably and reducing your AWS carbon footprint
• AWS re:Invent 2022 - Sustainability in AWS global infrastructure
• AWS re:Invent 2023 - What’s new with AWS observability and operations

Related examples:

• Well-Architected Lab - Turning cost & usage reports into eﬃciency reports

SUS06-BP02 Keep your workload up-to-date

Keep your workload up-to-date to adopt eﬃcient features, remove issues, and improve the overall
eﬃciency of your workload.

Common anti-patterns:

• You assume your current architecture is static and will not be updated over time.

Process and culture 957

AWS Well-Architected Framework Framework

Workload component How to update

Machine images Use EC2 Image Builder to manage updates to

Amazon Machine Images (AMIs) for Linux or
Windows server images.

Container images Use Amazon Elastic Container Registry

(Amazon ECR) with your existing pipeline to
manage Amazon Elastic Container Service
(Amazon ECS) images.

AWS Lambda AWS Lambda includes version management

features.

• Use automation: Automate updates to reduce the level of eﬀort to deploy new features and
limit errors caused by manual processes.

• You can use CI/CD to automatically update AMIs, container images, and other artifacts related
to your cloud application.

• You can use tools such as AWS Systems Manager Patch Manager to automate the process of
system updates, and schedule the activity using AWS Systems Manager Maintenance Windows.

Resources

Related documents:

• AWS Architecture Center

• What's New with AWS

• AWS Developer Tools

Related videos:

• AWS re:Invent 2022 - Optimize your AWS workloads with best-practice guidance

• All Things Patch: AWS Systems Manager

Related examples:

Process and culture 959

AWS Well-Architected Framework Framework

• Maximize utilization: Use strategies to maximize the utilization of development and test
environments.
• Use minimum viable representative environments to develop and test potential improvements.

• Use serverless technologies if possible.

• Use On-Demand Instances to supplement your developer devices.

• Use instance types with burst capacity, Spot Instances, and other technologies to align build
capacity with use.

• Adopt native cloud services for secure instance shell access rather than deploying ﬂeets of
bastion hosts.

• Automatically scale your build resources depending on your build jobs.

Resources

Related documents:

• AWS Systems Manager Session Manager

• Amazon EC2 Burstable performance instances

• What is AWS CloudFormation?

• What is AWS CodeBuild?

• Instance Scheduler on AWS

Related videos:

• AWS re:Invent 2023 - Continuous integration and delivery for AWS

SUS06-BP04 Use managed device farms for testing

Use managed device farms to eﬃciently test a new feature on a representative set of hardware.

Common anti-patterns:

• You manually test and deploy your application on individual physical devices.

• You do not use app testing service to test and interact with your apps (for example, Android, iOS,
and web apps) on real, physical devices.

Process and culture 961

AWS Well-Architected Framework Framework

• Viewing the CloudWatch RUM dashboard

Related videos:

• AWS re:Invent 2023 - Improve your mobile and web app quality using AWS Device Farm
• AWS re:Invent 2021 - Optimize applications through end user insights with Amazon CloudWatch
RUM

Related examples:

• AWS Device Farm Sample App for Android

• AWS Device Farm Sample App for iOS
• Appium Web tests for AWS Device Farm

Process and culture 963

AWS Well-Architected Framework Framework

AWS Glossary
For the latest AWS terminology, see the AWS glossary in the AWS Glossary Reference.

965

Aws Well Architected Framework
No ratings yet
Aws Well Architected Framework
54 pages
AcademyCloudFoundations Module 09?
No ratings yet
AcademyCloudFoundations Module 09?
67 pages
AWS Well-Architected Framework
No ratings yet
AWS Well-Architected Framework
440 pages
AWS High Performance Computing
No ratings yet
AWS High Performance Computing
47 pages
Wellarchitected Operational Excellence Pillar
No ratings yet
Wellarchitected Operational Excellence Pillar
37 pages
AWS Well-Architected Framework
100% (1)
AWS Well-Architected Framework
85 pages
AWS Well-Architected Framework
100% (1)
AWS Well-Architected Framework
56 pages
AcademyCloudFoundations Module 09
No ratings yet
AcademyCloudFoundations Module 09
67 pages
WAFR Proposal - PortPro
No ratings yet
WAFR Proposal - PortPro
15 pages
AWS Well Architected Framework Brief BEST
No ratings yet
AWS Well Architected Framework Brief BEST
9 pages
AWS Well-Architected - Build Secure, Efficient Cloud Applications
No ratings yet
AWS Well-Architected - Build Secure, Efficient Cloud Applications
7 pages
Module 9
No ratings yet
Module 9
67 pages
From Monolith To Microservices
0% (1)
From Monolith To Microservices
2 pages
Technology: Ericsson
No ratings yet
Technology: Ericsson
12 pages
GRPC For WCF Developers (Mark Rendle at Al)
No ratings yet
GRPC For WCF Developers (Mark Rendle at Al)
110 pages
SquareOps - Well Architected Review
No ratings yet
SquareOps - Well Architected Review
7 pages
Introduction To Amazon API Gateway
No ratings yet
Introduction To Amazon API Gateway
10 pages
Apexon Fact Sheet WAR
No ratings yet
Apexon Fact Sheet WAR
4 pages
AcademyCloudFoundations - Clod Arc
No ratings yet
AcademyCloudFoundations - Clod Arc
33 pages
Architectural Patterns For Microservices - A Systematic Mapping Study
No ratings yet
Architectural Patterns For Microservices - A Systematic Mapping Study
13 pages
Aws Architecture Core Concepts Slides
No ratings yet
Aws Architecture Core Concepts Slides
32 pages
Diagnostic Questions: Managing Implementation and Ensuring Solution and Operations Reliability
No ratings yet
Diagnostic Questions: Managing Implementation and Ensuring Solution and Operations Reliability
11 pages
STP Containers TC
No ratings yet
STP Containers TC
134 pages
AWS Well Architected
No ratings yet
AWS Well Architected
7 pages
profileMunsifa-Khan-Barbhuyanpublication363124255 Performance Analysis of Improved Mobility Mode
No ratings yet
profileMunsifa-Khan-Barbhuyanpublication363124255 Performance Analysis of Improved Mobility Mode
557 pages
Greenlegis Wellarchitected
No ratings yet
Greenlegis Wellarchitected
84 pages
Welcome To Introduction To The AWS Well-Architected Framework
No ratings yet
Welcome To Introduction To The AWS Well-Architected Framework
31 pages
Session 4 - Pillars of Cloud Architected Framework
No ratings yet
Session 4 - Pillars of Cloud Architected Framework
36 pages
Arcitura Cloud AI Architect Training & Certification Guide
No ratings yet
Arcitura Cloud AI Architect Training & Certification Guide
23 pages
Study Session
No ratings yet
Study Session
48 pages
Wellarchitected Security Pillar
No ratings yet
Wellarchitected Security Pillar
46 pages
Cloud Computing With AWS
No ratings yet
Cloud Computing With AWS
5 pages
Module 9 Cloud Architecture
No ratings yet
Module 9 Cloud Architecture
48 pages
Reliability Pillar - AWS Well-Architected Framework
No ratings yet
Reliability Pillar - AWS Well-Architected Framework
75 pages
Cloud Computing: Worldskills Standards Specification
No ratings yet
Cloud Computing: Worldskills Standards Specification
8 pages
The 5 Pillars of AWS
No ratings yet
The 5 Pillars of AWS
7 pages
Well Architected Framework Questions1
No ratings yet
Well Architected Framework Questions1
82 pages
AWS Well-Architected Framework
No ratings yet
AWS Well-Architected Framework
86 pages
FALLSEM2024-25 BECE355L TH VL2024250103208 2024-09-26 Reference-Material-I
No ratings yet
FALLSEM2024-25 BECE355L TH VL2024250103208 2024-09-26 Reference-Material-I
14 pages
The Comple DevOps Bootcamp With Azure Cloud
No ratings yet
The Comple DevOps Bootcamp With Azure Cloud
32 pages
Architecting For The Cloud: Archived
100% (1)
Architecting For The Cloud: Archived
50 pages
AWS Well Architected Framework
No ratings yet
AWS Well Architected Framework
58 pages
Introduction To Microservices: Architecture Principles, How To, Patterns, Examples
No ratings yet
Introduction To Microservices: Architecture Principles, How To, Patterns, Examples
78 pages
Deepak Kumar Sahoo Resume
No ratings yet
Deepak Kumar Sahoo Resume
3 pages
AWS Module 5 (First 3 Topics)
No ratings yet
AWS Module 5 (First 3 Topics)
38 pages
AWS Serverless Applications Lens
No ratings yet
AWS Serverless Applications Lens
60 pages
Donham 2018
No ratings yet
Donham 2018
11 pages
Resume-Abhishek Mishra-1
No ratings yet
Resume-Abhishek Mishra-1
7 pages
The Top 6 Microservices Patterns
82% (11)
The Top 6 Microservices Patterns
19 pages
Chapter 9 AWS Well Architected Framework and Best Practices
No ratings yet
Chapter 9 AWS Well Architected Framework and Best Practices
19 pages
Slides-Overview of Web Techccies
No ratings yet
Slides-Overview of Web Techccies
13 pages
Financial Services Industry Lens: AWS Well-Architected Framework
No ratings yet
Financial Services Industry Lens: AWS Well-Architected Framework
46 pages
Introduction - Welcome To STP Containers On AWS - Labs
No ratings yet
Introduction - Welcome To STP Containers On AWS - Labs
3 pages
Well-Architected Review - Introduction
No ratings yet
Well-Architected Review - Introduction
19 pages
PIRATE KING Resume - White
No ratings yet
PIRATE KING Resume - White
2 pages
Aspect Layered Architecture Domain-Driven Design (DDD) Vertical Slice Architecture Clean Architecture
No ratings yet
Aspect Layered Architecture Domain-Driven Design (DDD) Vertical Slice Architecture Clean Architecture
3 pages
IAT 2 Answer Key
No ratings yet
IAT 2 Answer Key
26 pages
AWS Operational Excellence Pillar PDF
100% (1)
AWS Operational Excellence Pillar PDF
23 pages
Architecting Distributed Transactional Applications
No ratings yet
Architecting Distributed Transactional Applications
43 pages
Module 09
No ratings yet
Module 09
67 pages
Ace The AWS Well Architected Framework - Learn, Measure
No ratings yet
Ace The AWS Well Architected Framework - Learn, Measure
131 pages
System Design - WhatsApp
No ratings yet
System Design - WhatsApp
4 pages
Car Rental Management System
No ratings yet
Car Rental Management System
69 pages
Divya Sharvani Resume
No ratings yet
Divya Sharvani Resume
3 pages
Are You Well-Architected
No ratings yet
Are You Well-Architected
19 pages
Robert Wang Staff So Ware Engineer: Mazon
No ratings yet
Robert Wang Staff So Ware Engineer: Mazon
2 pages
AWS Well-Architected ISV Funding Guide
No ratings yet
AWS Well-Architected ISV Funding Guide
5 pages
9.1 AWS-WAFW-Summary-Only PDF
No ratings yet
9.1 AWS-WAFW-Summary-Only PDF
3 pages
AWS Well-Architected Framework
No ratings yet
AWS Well-Architected Framework
76 pages
Questions
No ratings yet
Questions
4 pages
AWS Cloud Foundations Module 4 Student Guide
No ratings yet
AWS Cloud Foundations Module 4 Student Guide
64 pages
AWS Reliability Pillar
No ratings yet
AWS Reliability Pillar
62 pages
E Commerce System
No ratings yet
E Commerce System
4 pages
AcademyCloudArchitecting Module 02
No ratings yet
AcademyCloudArchitecting Module 02
47 pages
Chapter 4
No ratings yet
Chapter 4
12 pages
Transcript For Introduction To AWS Well-Architected Framework
No ratings yet
Transcript For Introduction To AWS Well-Architected Framework
2 pages
AcademyCloudFoundations Module 09
No ratings yet
AcademyCloudFoundations Module 09
66 pages
Akbar Shaik
No ratings yet
Akbar Shaik
4 pages
ACAv3 EN M02 IntroCloudArch Instructor Deck
No ratings yet
ACAv3 EN M02 IntroCloudArch Instructor Deck
46 pages
Security Pillar: AWS Well-Architected Framework
No ratings yet
Security Pillar: AWS Well-Architected Framework
37 pages
AWS Cloud Integration
No ratings yet
AWS Cloud Integration
9 pages
SAMPLEnTopic 1+
No ratings yet
SAMPLEnTopic 1+
26 pages
Kubernetes Interview Questions Guide Ed1
100% (1)
Kubernetes Interview Questions Guide Ed1
30 pages
Architecting Aws Days 2 4
No ratings yet
Architecting Aws Days 2 4
4 pages
AWS DevOps Engineer Professional Certification Guide: Hands-on guide to understand, analyze, and solve 150 scenario-based questions (English Edition)
From Everand
AWS DevOps Engineer Professional Certification Guide: Hands-on guide to understand, analyze, and solve 150 scenario-based questions (English Edition)
Sumit Kapoor
No ratings yet
AWS Well-Architected Partner Program FAQ
No ratings yet
AWS Well-Architected Partner Program FAQ
5 pages
Enterprise Architecture
No ratings yet
Enterprise Architecture
8 pages
AWS Certified Cloud Practitioner - Practice Paper 2: AWS Certified Cloud Practitioner, #2
From Everand
AWS Certified Cloud Practitioner - Practice Paper 2: AWS Certified Cloud Practitioner, #2
Tech Interviews
5/5 (2)
AZ-900 Azure Fundamentals Practice Paper 2: AZ-900 Azure Fundamentals, #2
From Everand
AZ-900 Azure Fundamentals Practice Paper 2: AZ-900 Azure Fundamentals, #2
Tech Interviews
No ratings yet
Microsoft AZURE® AZ-104 Administrator Practice Tests
From Everand
Microsoft AZURE® AZ-104 Administrator Practice Tests
iCertify Training
No ratings yet