AWS Well-Architected Framework - Framework
AWS Well-Architected Framework - Framework
Amazon's trademarks and trade dress may not be used in connection with any product or service
that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any
manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are
the property of their respective owners, who may or may not be affiliated with, connected to, or
sponsored by Amazon.
AWS Well-Architected Framework Framework
iv
AWS Well-Architected Framework Framework
The AWS Well-Architected Framework helps you understand the pros and cons of decisions you
make while building systems on AWS. By using the Framework you will learn architectural best
practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable
systems in the cloud.
Introduction
The AWS Well-Architected Framework helps you understand the pros and cons of decisions you
make while building systems on AWS. Using the Framework helps you learn architectural best
practices for designing and operating secure, reliable, efficient, cost-effective, and sustainable
workloads in the AWS Cloud. It provides a way for you to consistently measure your architectures
against best practices and identify areas for improvement. The process for reviewing an
architecture is a constructive conversation about architectural decisions, and is not an audit
mechanism. We believe that having well-architected systems greatly increases the likelihood of
business success.
AWS Solutions Architects have years of experience architecting solutions across a wide variety
of business verticals and use cases. We have helped design and review thousands of customers’
architectures on AWS. From this experience, we have identified best practices and core strategies
for architecting systems in the cloud.
The AWS Well-Architected Framework documents a set of foundational questions that help you to
understand if a specific architecture aligns well with cloud best practices. The framework provides
a consistent approach to evaluating systems against the qualities you expect from modern cloud-
based systems, and the remediation that would be required to achieve those qualities. As AWS
continues to evolve, and we continue to learn more from working with our customers, we will
continue to refine the definition of well-architected.
This framework is intended for those in technology roles, such as chief technology officers
(CTOs), architects, developers, and operations team members. It describes AWS best practices and
strategies to use when designing and operating a cloud workload, and provides links to further
implementation details and architectural patterns. For more information, see the AWS Well-
Architected homepage.
Introduction 1
AWS Well-Architected Framework Framework
Name Description
• A component is the code, configuration, and AWS Resources that together deliver against a
requirement. A component is often the unit of technical ownership, and is decoupled from other
components.
• The term workload is used to identify a set of components that together deliver business value.
A workload is usually the level of detail that business and technology leaders communicate
about.
Definitions 3
AWS Well-Architected Framework Framework
example, verifying that teams are meeting internal standards. We mitigate these risks in two ways.
First, we have practices (ways of doing things, process, standards, and accepted norms) that focus
on allowing each team to have that capability, and we put in place experts who verify that teams
raise the bar on the standards they need to meet. Second, we implement mechanisms that carry
out automated checks to verify standards are being met.
“Good intentions never work, you need good mechanisms to make anything happen” —
Jeff Bezos.
This means replacing a human's best efforts with mechanisms (often automated) that check for
compliance with rules or process. This distributed approach is supported by the Amazon leadership
principles, and establishes a culture across all roles that works back from the customer. Working
backward is a fundamental part of our innovation process. We start with the customer and what
they want, and let that define and guide our efforts. Customer-obsessed teams build products in
response to a customer need.
For architecture, this means that we expect every team to have the capability to create
architectures and to follow best practices. To help new teams gain these capabilities or existing
teams to raise their bar, we activate access to a virtual community of principal engineers who
can review their designs and help them understand what AWS best practices are. The principal
engineering community works to make best practices visible and accessible. One way they do this,
for example, is through lunchtime talks that focus on applying best practices to real examples.
These talks are recorded and can be used as part of onboarding materials for new team members.
AWS best practices emerge from our experience running thousands of systems at internet scale.
We prefer to use data to define best practice, but we also use subject matter experts, like principal
engineers, to set them. As principal engineers see new best practices emerge, they work as a
community to verify that teams follow them. In time, these best practices are formalized into our
internal review processes, and also into mechanisms that enforce compliance. The Well-Architected
Framework is the customer-facing implementation of our internal review process, where we
have codified our principal engineering thinking across field roles, like Solutions Architecture and
internal engineering teams. The Well-Architected Framework is a scalable mechanism that lets you
take advantage of these learnings.
On architecture 5
AWS Well-Architected Framework Framework
improvements can be made and can help develop organizational experience in dealing with
events.
Design principles
The following are design principles for operational excellence in the cloud:
• Organize teams around business outcomes: The ability of a team to achieve business outcomes
comes from leadership vision, effective operations, and a business-aligned operating model.
Leadership should be fully invested and committed to a CloudOps transformation with a suitable
cloud operating model that incentivizes teams to operate in the most efficient way and meet
business outcomes. The right operating model uses people, process, and technology capabilities
to scale, optimize for productivity, and differentiate through agility, responsiveness, and
adaptation. The organization's long-term vision is translated into goals that are communicated
across the enterprise to stakeholders and consumers of your cloud services. Goals and
operational KPIs are aligned at all levels. This practice sustains the long-term value derived from
implementing the following design principles.
• Implement observability for actionable insights: Gain a comprehensive understanding
of workload behavior, performance, reliability, cost, and health. Establish key performance
indicators (KPIs) and leverage observability telemetry to make informed decisions and take
prompt action when business outcomes are at risk. Proactively improve performance, reliability,
and cost based on actionable observability data.
• Safely automate where possible: In the cloud, you can apply the same engineering discipline
that you use for application code to your entire environment. You can define your entire
workload and its operations (applications, infrastructure, configuration, and procedures) as code,
and update it. You can then automate your workload’s operations by initiating them in response
to events. In the cloud, you can employ automation safety by configuring guardrails, including
rate control, error thresholds, and approvals. Through effective automation, you can achieve
consistent responses to events, limit human error, and reduce operator toil.
• Make frequent, small, reversible changes: Design workloads that are scalable and loosely
coupled to permit components to be updated regularly. Automated deployment techniques
together with smaller, incremental changes reduces the blast radius and allows for faster reversal
when failures occur. This increases confidence to deliver beneficial changes to your workload
while maintaining quality and adapting quickly to changes in market conditions.
• Refine operations procedures frequently: As you evolve your workloads, evolve your operations
appropriately. As you use operations procedures, look for opportunities to improve them. Hold
regular reviews and validate that all procedures are effective and that teams are familiar with
them. Where gaps are identified, update procedures accordingly. Communicate procedural
Design principles 9
AWS Well-Architected Framework Framework
Best practices
Note
All operational excellence questions have the OPS prefix as a shorthand for the pillar.
Topics
• Organization
• Prepare
• Operate
• Evolve
Organization
Your teams must have a shared understanding of your entire workload, their role in it, and shared
business goals to set the priorities that will achieve business success. Well-defined priorities will
maximize the benefits of your efforts. Evaluate internal and external customer needs involving
key stakeholders, including business, development, and operations teams, to determine where to
focus efforts. Evaluating customer needs will verify that you have a thorough understanding of
the support that is required to achieve business outcomes. Verify that you are aware of guidelines
or obligations defined by your organizational governance and external factors, such as regulatory
compliance requirements and industry standards that may mandate or emphasize specific focus.
Validate that you have mechanisms to identify changes to internal governance and external
compliance requirements. If no requirements are identified, validate that you have applied due
diligence to this determination. Review your priorities regularly so that they can be updated as
needs change.
Evaluate threats to the business (for example, business risk and liabilities, and information security
threats) and maintain this information in a risk registry. Evaluate the impact of risks, and tradeoffs
between competing interests or alternative approaches. For example, accelerating speed to market
for new features may be emphasized over cost optimization, or you may choose a relational
database for non-relational data to simplify the effort to migrate a system without refactoring.
Manage benefits and risks to make informed decisions when determining where to focus efforts.
Some risks or choices may be acceptable for a time, it may be possible to mitigate associated risks,
Best practices 11
AWS Well-Architected Framework Framework
assumptions, and reduce the risk of confirmation bias. Grow inclusion, diversity, and accessibility
within your teams to gain beneficial perspectives.
If there are external regulatory or compliance requirements that apply to your organization,
you should use the resources provided by AWS Cloud Compliance to help educate your teams
so that they can determine the impact on your priorities. The Well-Architected Framework
emphasizes learning, measuring, and improving. It provides a consistent approach for you to
evaluate architectures, and implement designs that will scale over time. AWS provides the
AWS Well-Architected Tool to help you review your approach before development, the state
of your workloads before production, and the state of your workloads in production. You can
compare workloads to the latest AWS architectural best practices, monitor their overall status,
and gain insight into potential risks. AWS Trusted Advisor is a tool that provides access to a core
set of checks that recommend optimizations that may help shape your priorities. Business and
Enterprise Support customers receive access to additional checks focusing on security, reliability,
performance, cost-optimization, and sustainability that can further help shape their priorities.
AWS can help you educate your teams about AWS and its services to increase their understanding
of how their choices can have an impact on your workload. Use the resources provided by AWS
Support (AWS Knowledge Center, AWS Discussion Forums, and AWS Support Center) and AWS
Documentation to educate your teams. Reach out to AWS Support through AWS Support Center
for help with your AWS questions. AWS also shares best practices and patterns that we have
learned through the operation of AWS in The Amazon Builders' Library. A wide variety of other
useful information is available through the AWS Blog and The Official AWS Podcast. AWS Training
and Certification provides some training through self-paced digital courses on AWS fundamentals.
You can also register for instructor-led training to further support the development of your teams’
AWS skills.
Use tools or services that permit you to centrally govern your environments across accounts,
such as AWS Organizations, to help manage your operating models. Services like AWS Control
Tower expand this management capability by allowing you to define blueprints (supporting your
operating models) for the setup of accounts, apply ongoing governance using AWS Organizations,
and automate provisioning of new accounts. Managed Services providers such as AWS Managed
Services, AWS Managed Services Partners, or Managed Services Providers in the AWS Partner
Network, provide expertise implementing cloud environments, and support your security and
compliance requirements and business goals. Adding Managed Services to your operating model
can save you time and resources, and lets you keep your internal teams lean and focused on
strategic outcomes that will differentiate your business, rather than developing new skills and
capabilities.
Best practices 13
AWS Well-Architected Framework Framework
Prepare
To prepare for operational excellence, you have to understand your workloads and their expected
behaviors. You will then be able to design them to provide insight to their status and build the
procedures to support them.
Design your workload so that it provides the information necessary for you to understand its
internal state (for example, metrics, logs, events, and traces) across all components in support of
observability and investigating issues. Observability goes beyond simple monitoring, providing
a comprehensive understanding of a system's internal workings based on its external outputs.
Rooted in metrics, logs, and traces, observability offers profound insights into system behavior and
dynamics. With effective observability, teams can discern patterns, anomalies, and trends, allowing
them to proactively address potential issues and maintain optimal system health. Identifying key
performance indicators (KPIs) is pivotal to ensure alignment between monitoring activities and
business objectives. This alignment ensures that teams are making data-driven decisions using
metrics that genuinely matter, optimizing both system performance and business outcomes.
Furthermore, observability empowers businesses to be proactive rather than reactive. Teams can
understand the cause-and-effect relationships within their systems, predicting and preventing
issues rather than just reacting to them. As workloads evolve, it's essential to revisit and refine the
observability strategy, ensuring it remains relevant and effective.
Adopt approaches that improve the flow of changes into production and that achieves refactoring,
fast feedback on quality, and bug fixing. These accelerate beneficial changes entering production,
limit issues deployed, and activate rapid identification and remediation of issues introduced
through deployment activities or discovered in your environments.
Adopt approaches that provide fast feedback on quality and achieves rapid recovery from changes
that do not have desired outcomes. Using these practices mitigates the impact of issues introduced
through the deployment of changes. Plan for unsuccessful changes so that you are able to respond
faster if necessary and test and validate the changes you make. Be aware of planned activities
in your environments so that you can manage the risk of changes impacting planned activities.
Emphasize frequent, small, reversible changes to limit the scope of change. This results in faster
troubleshooting and remediation with the option to roll back a change. It also means you are able
to get the benefit of valuable changes more frequently.
Evaluate the operational readiness of your workload, processes, procedures, and personnel to
understand the operational risks related to your workload. Use a consistent process (including
manual or automated checklists) to know when you are ready to go live with your workload or
Best practices 15
AWS Well-Architected Framework Framework
OPS 7: How do you know that you are ready to support a workload?
Evaluate the operational readiness of your workload, processes and procedures, and personnel
to understand the operational risks related to your workload.
Operate
Observability allows you to focus on meaningful data and understand your workload's interactions
and output. By concentrating on essential insights and eliminating unnecessary data, you maintain
a straightforward approach to understanding workload performance. It's essential not only
to collect data but also to interpret it correctly. Define clear baselines, set appropriate alert
thresholds, and actively monitor for any deviations. A shift in a key metric, especially when
correlated with other data, can pinpoint specific problem areas. With observability, you're better
equipped to foresee and address potential challenges, ensuring that your workload operates
smoothly and meets business needs.
Best practices 17
AWS Well-Architected Framework Framework
Prepare and validate procedures for responding to events to minimize their disruption to your
workload.
All of the metrics you collect should be aligned to a business need and the outcomes they support.
Develop scripted responses to well-understood events and automate their performance in
response to recognizing the event.
Evolve
Learn, share, and continuously improve to sustain operational excellence. Dedicate work cycles
to making nearly continuous incremental improvements. Perform post-incident analysis of all
customer impacting events. Identify the contributing factors and preventative action to limit or
prevent recurrence. Communicate contributing factors with affected communities as appropriate.
Regularly evaluate and prioritize opportunities for improvement (for example, feature requests,
issue remediation, and compliance requirements), including both the workload and operations
procedures.
Include feedback loops within your procedures to rapidly identify areas for improvement and
capture learnings from running operations.
Share lessons learned across teams to share the benefits of those lessons. Analyze trends within
lessons learned and perform cross-team retrospective analysis of operations metrics to identify
opportunities and methods for improvement. Implement changes intended to bring about
improvement and evaluate the results to determine success.
On AWS, you can export your log data to Amazon S3 or send logs directly to Amazon S3 for long-
term storage. Using AWS Glue, you can discover and prepare your log data in Amazon S3 for
analytics, and store associated metadata in the AWS Glue Data Catalog. Amazon Athena, through
its native integration with AWS Glue, can then be used to analyze your log data, querying it using
standard SQL. Using a business intelligence tool like Amazon QuickSight, you can visualize, explore,
and analyze your data. Discovering trends and events of interest that may drive improvement.
Best practices 19
AWS Well-Architected Framework Framework
• Definition
• Best practices
• Resources
Design principles
In the cloud, there are a number of principles that can help you strengthen your workload security:
• Implement a strong identity foundation: Implement the principle of least privilege and
enforce separation of duties with appropriate authorization for each interaction with your AWS
resources. Centralize identity management, and aim to eliminate reliance on long-term static
credentials.
• Maintain traceability: Monitor, alert, and audit actions and changes to your environment in
real time. Integrate log and metric collection with systems to automatically investigate and take
action.
• Apply security at all layers: Apply a defense in depth approach with multiple security controls.
Apply to all layers (for example, edge of network, VPC, load balancing, every instance and
compute service, operating system, application, and code).
• Automate security best practices: Automated software-based security mechanisms improve
your ability to securely scale more rapidly and cost-effectively. Create secure architectures,
including the implementation of controls that are defined and managed as code in version-
controlled templates.
• Protect data in transit and at rest: Classify your data into sensitivity levels and use mechanisms,
such as encryption, tokenization, and access control where appropriate.
• Keep people away from data: Use mechanisms and tools to reduce or eliminate the need for
direct access or manual processing of data. This reduces the risk of mishandling or modification
and human error when handling sensitive data.
• Prepare for security events: Prepare for an incident by having incident management and
investigation policy and processes that align to your organizational requirements. Run incident
response simulations and use tools with automation to increase your speed for detection,
investigation, and recovery.
Definition
There are seven best practice areas for security in the cloud:
Design principles 21
AWS Well-Architected Framework Framework
Security
The following question focuses on these considerations for security. (For a list of security questions
and best practices, see the Appendix.).
To operate your workload securely, you must apply overarching best practices to every area of
security. Take requirements and processes that you have defined in operational excellence at an
organizational and workload level, and apply them to all areas.
Staying up to date with recommendations from AWS, industry sources, and threat intellige
nce helps you evolve your threat model and control objectives. Automating security processes,
testing, and validation allow you to scale your security operations.
In AWS, segregating different workloads by account, based on their function and compliance or
data sensitivity requirements, is a recommended approach.
Identity and access management are key parts of an information security program, ensuring that
only authorized and authenticated users and components are able to access your resources, and
only in a manner that you intend. For example, you should define principals (that is, accounts,
users, roles, and services that can perform actions in your account), build out policies aligned with
these principals, and implement strong credential management. These privilege-management
elements form the core of authentication and authorization.
In AWS, privilege management is primarily supported by the AWS Identity and Access Management
(IAM) service, which allows you to control user and programmatic access to AWS services and
resources. You should apply granular policies, which assign permissions to a user, group, role, or
resource. You also have the ability to require strong password practices, such as complexity level,
avoiding re-use, and enforcing multi-factor authentication (MFA). You can use federation with your
existing directory service. For workloads that require systems to have access to AWS, IAM allows for
secure access through roles, instance profiles, identity federation, and temporary credentials.
Best practices 23
AWS Well-Architected Framework Framework
Best practices 25
AWS Well-Architected Framework Framework
Capture and analyze events from logs and metrics to gain visibility. Take action on security
events and potential threats to help secure your workload.
Log management is important to a Well-Architected workload for reasons ranging from security
or forensics to regulatory or legal requirements. It is critical that you analyze logs and respond to
them so that you can identify potential security incidents. AWS provides functionality that makes
log management easier to implement by giving you the ability to define a data-retention lifecycle
or define where data will be preserved, archived, or eventually deleted. This makes predictable and
reliable data handling simpler and more cost effective.
Infrastructure protection
In AWS, you can implement stateful and stateless packet inspection, either by using AWS-native
technologies or by using partner products and services available through the AWS Marketplace.
You should use Amazon Virtual Private Cloud (Amazon VPC) to create a private, secured, and
scalable environment in which you can define your topology—including gateways, routing tables,
and public and private subnets.
Any workload that has some form of network connectivity, whether it’s the internet or a private
network, requires multiple layers of defense to help protect from external and internal network-
based threats.
Best practices 27
AWS Well-Architected Framework Framework
• AWS never initiates the movement of data between Regions. Content placed in a Region will
remain in that Region unless you explicitly use a feature or leverage a service that provides that
functionality.
Classification provides a way to categorize data, based on criticality and sensitivity in order to
help you determine appropriate protection and retention controls.
Protect your data at rest by implementing multiple controls, to reduce the risk of unauthorized
access or mishandling.
Protect your data in transit by implementing multiple controls to reduce the risk of unauthori
zed access or loss.
AWS provides multiple means for encrypting data at rest and in transit. We build features into our
services that make it easier to encrypt your data. For example, we have implemented server-side
encryption (SSE) for Amazon S3 to make it easier for you to store your data in an encrypted form.
You can also arrange for the entire HTTPS encryption and decryption process (generally known as
SSL termination) to be handled by Elastic Load Balancing (ELB).
Incident response
Even with extremely mature preventive and detective controls, your organization should still
put processes in place to respond to and mitigate the potential impact of security incidents. The
architecture of your workload strongly affects the ability of your teams to operate effectively
during an incident, to isolate or contain systems, and to restore operations to a known good state.
Putting in place the tools and access ahead of a security incident, then routinely practicing incident
Best practices 29
AWS Well-Architected Framework Framework
The cost and complexity to resolve defects is typically lower the earlier you are in the SDLC. The
easiest way to resolve issues is to not have them in the first place, which is why starting with
a threat model helps you focus on the right outcomes from the design phase. As your AppSec
program matures, you can increase the amount of testing that is performed using automation,
improve the fidelity of feedback to builders, and reduce the time needed for security reviews. All of
these actions improve the quality of the software you build, and increase the speed of delivering
features into production.
These implementation guidelines focus on four areas: organization and culture, security of the
pipeline, security in the pipeline, and dependency management. Each area provides a set of
principles that you can implement and provides an end-to-end view of how you design, develop,
build, deploy, and operate workloads.
In AWS, there are a number of approaches you can use when addressing your application security
program. Some of these approaches rely on technology while others focus on the people and
organizational aspects of your application security program.
SEC 11: How do you incorporate and validate the security properties of applications
throughout the design, development, and deployment lifecycle?
Training people, testing using automation, understanding dependencies, and validating the
security properties of tools and applications help to reduce the likelihood of security issues in
production workloads.
Resources
Refer to the following resources to learn more about our best practices for Security.
Documentation
Resources 31
AWS Well-Architected Framework Framework
around or repair the failure. With more sophisticated automation, it’s possible to anticipate and
remediate failures before they occur.
• Test recovery procedures: In an on-premises environment, testing is often conducted to prove
that the workload works in a particular scenario. Testing is not typically used to validate recovery
strategies. In the cloud, you can test how your workload fails, and you can validate your recovery
procedures. You can use automation to simulate different failures or to recreate scenarios that
led to failures before. This approach exposes failure pathways that you can test and fix before a
real failure scenario occurs, thus reducing risk.
• Scale horizontally to increase aggregate workload availability: Replace one large resource
with multiple small resources to reduce the impact of a single failure on the overall workload.
Distribute requests across multiple, smaller resources to verify that they don’t share a common
point of failure.
• Manage change through automation: Changes to your infrastructure should be made using
automation. The changes that must be managed include changes to the automation, which then
can be tracked and reviewed.
Definition
There are four best practice areas for reliability in the cloud:
• Foundations
• Workload architecture
• Change management
• Failure management
To achieve reliability, you must start with the foundations — an environment where Service Quotas
and network topology accommodate the workload. The workload architecture of the distributed
Definition 33
AWS Well-Architected Framework Framework
Workloads often exist in multiple environments. These include multiple cloud environments
(both publicly accessible and private) and possibly your existing data center infrastructure. Plans
must include network considerations such as intra- and inter-system connectivity, public IP
address management, private IP address management, and domain name resolution.
Workload architecture
A reliable workload starts with upfront design decisions for both software and infrastructure. Your
architecture choices will impact your workload behavior across all of the Well-Architected pillars.
For reliability, there are specific patterns you must follow.
With AWS, workload developers have their choice of languages and technologies to use. AWS SDKs
take the complexity out of coding by providing language-specific APIs for AWS services. These
SDKs, plus the choice of languages, permits developers to implement the reliability best practices
listed here. Developers can also read about and learn from how Amazon builds and operates
software in The Amazon Builders' Library.
Build highly scalable and reliable workloads using a service-oriented architecture (SOA) or
a microservices architecture. Service-oriented architecture (SOA) is the practice of making
software components reusable via service interfaces. Microservices architecture goes further to
make components smaller and simpler.
Best practices 35
AWS Well-Architected Framework Framework
Controlled changes are necessary to deploy new functionality, and to verify that the workloads
and the operating environment are running known software and can be patched or replaced in
a predictable manner. If these changes are uncontrolled, then it makes it difficult to predict the
effect of these changes, or to address issues that arise because of them.
When you architect a workload to automatically add and remove resources in response to changes
in demand, this not only increases reliability but also validates that business success doesn't
become a burden. With monitoring in place, your team will be automatically alerted when KPIs
deviate from expected norms. Automatic logging of changes to your environment permits you
to audit and quickly identify actions that might have impacted reliability. Controls on change
management certify that you can enforce the rules that deliver the reliability you need.
Failure management
In any system of reasonable complexity, it is expected that failures will occur. Reliability requires
that your workload be aware of failures as they occur and take action to avoid impact on
availability. Workloads must be able to both withstand failures and automatically repair issues.
With AWS, you can take advantage of automation to react to monitoring data. For example, when a
particular metric crosses a threshold, you can initiate an automated action to remedy the problem.
Also, rather than trying to diagnose and fix a failed resource that is part of your production
environment, you can replace it with a new one and carry out the analysis on the failed resource
out of band. Since the cloud allows you to stand up temporary versions of a whole system at low
cost, you can use automated testing to verify full recovery processes.
Back up data, applications, and configuration to meet your requirements for recovery time
objectives (RTO) and recovery point objectives (RPO).
Best practices 37
AWS Well-Architected Framework Framework
customers, even in the face of sustained problems. Your recovery processes should be as well
exercised as your normal production processes.
Resources
Refer to the following resources to learn more about our best practices for Reliability.
Documentation
• AWS Documentation
Whitepaper
Performance efficiency
The performance efficiency pillar includes the ability to use cloud resources efficiently to meet
performance requirements, and to maintain that efficiency as demand changes and technologies
evolve.
The performance efficiency pillar provides an overview of design principles, best practices, and
questions. You can find prescriptive guidance on implementation in the Performance Efficiency
Pillar whitepaper.
Topics
• Design principles
• Definition
• Best practices
• Resources
Resources 39
AWS Well-Architected Framework Framework
Reviewing your choices on a regular basis validates that you are taking advantage of the
continually evolving AWS Cloud. Monitoring verifies that you are aware of any deviance from
expected performance. Make trade-offs in your architecture to improve performance, such as using
compression or caching, or relaxing consistency requirements.
Best practices
Topics
• Architecture selection
• Compute and hardware
• Data management
• Networking and content delivery
• Process and culture
Architecture selection
The optimal solution for a particular workload varies, and solutions often combine multiple
approaches. Well-Architected workloads use multiple solutions and allow different features to
improve performance.
AWS resources are available in many types and configurations, which makes it easier to find an
approach that closely matches your needs. You can also find options that are not easily achievable
with on-premises infrastructure. For example, a managed service such as Amazon DynamoDB
provides a fully managed NoSQL database with single-digit millisecond latency at any scale.
The following question focuses on these considerations for performance efficiency. (For a list of
performance efficiency questions and best practices, see the Appendix.).
PERF 1: How do you select appropriate cloud resources and architecture patterns for your
workload?
Often, multiple approaches are required for more effective performance across a workload.
Well-Architected systems use multiple solutions and features to improve performance.
Best practices 41
AWS Well-Architected Framework Framework
of access (online, offline, archival), frequency of update (WORM, dynamic), and availability and
durability constraints. Well-Architected workloads use purpose-built data stores which allow
different features to improve performance.
• Object storage provides a scalable, durable platform to make data accessible from any internet
location for user-generated content, active archive, serverless computing, Big Data storage or
backup and recovery. Amazon Simple Storage Service (Amazon S3) is an object storage service
that offers industry-leading scalability, data availability, security, and performance. Amazon S3
is designed for 99.999999999% (11 9's) of durability, and stores data for millions of applications
for companies all around the world.
• Block storage provides highly available, consistent, low-latency block storage for each virtual
host and is analogous to direct-attached storage (DAS) or a Storage Area Network (SAN). Amazon
Elastic Block Store (Amazon EBS) is designed for workloads that require persistent storage
accessible by EC2 instances that helps you tune applications with the right storage capacity,
performance and cost.
• File storage provides access to a shared file system across multiple systems. File storage
solutions like Amazon Elastic File System (Amazon EFS) are ideal for use cases such as large
content repositories, development environments, media stores, or user home directories.
Amazon FSx makes it efficient and cost effective to launch and run popular file systems so
you can leverage the rich feature sets and fast performance of widely used open source and
commercially-licensed file systems.
PERF 3: How do you store, manage, and access data in your workload?
The more efficient storage solution for a system varies based on the kind of access operation
(block, file, or object), patterns of access (random or sequential), required throughput, frequency
of access (online, offline, archival), frequency of update (WORM, dynamic), and availability and
durability constraints. Well-architected systems use multiple storage solutions and turn on
different features to improve performance and use resources efficiently.
Best practices 43
AWS Well-Architected Framework Framework
key metrics are capturing time-to-first-byte or rendering. Other generally applicable metrics
include thread count, garbage collection rate, and wait states. Business metrics, such as the
aggregate cumulative cost per request, can alert you to ways to drive down costs. Carefully
consider how you plan to interpret metrics. For example, you could choose the maximum or 99th
percentile instead of the average.
• Load generation: You should create a series of test scripts that replicate synthetic or prerecorded
user journeys. These scripts should be idempotent and not coupled, and you might need to
include pre-warming scripts to yield valid results. As much as possible, your test scripts should
replicate the behavior of usage in production. You can use software or software-as-a-service
(SaaS) solutions to generate the load. Consider using AWS Marketplace solutions and Spot
Instances — they can be cost-effective ways to generate the load.
• Performance visibility: Key metrics should be visible to your team, especially metrics against
each build version. This allows you to see any significant positive or negative trend over time.
You should also display metrics on the number of errors or exceptions to make sure you are
testing a working system.
• Visualization: Use visualization techniques that make it clear where performance issues, hot
spots, wait states, or low utilization is occurring. Overlay performance metrics over architecture
diagrams — call graphs or code can help identify issues quickly.
• Regular review process: Architectures performing poorly is usually the result of a non-existent
or broken performance review process. If your architecture is performing poorly, implementing a
performance review process allows you to drive iterative improvement.
Best practices 45
AWS Well-Architected Framework Framework
The cost optimization pillar provides an overview of design principles, best practices, and
questions. You can find prescriptive guidance on implementation in the Cost Optimization Pillar
whitepaper.
Topics
• Design principles
• Definition
• Best practices
• Resources
Design principles
There are five design principles for cost optimization in the cloud:
• Implement Cloud Financial Management: To achieve financial success and accelerate business
value realization in the cloud, invest in Cloud Financial Management and Cost Optimization.
Your organization should dedicate time and resources to build capability in this new domain
of technology and usage management. Similar to your Security or Operational Excellence
capability, you need to build capability through knowledge building, programs, resources, and
processes to become a cost-efficient organization.
• Adopt a consumption model: Pay only for the computing resources that you require and
increase or decrease usage depending on business requirements, not by using elaborate
forecasting. For example, development and test environments are typically only used for eight
hours a day during the work week. You can stop these resources when they are not in use for a
potential cost savings of 75% (40 hours versus 168 hours).
• Measure overall efficiency: Measure the business output of the workload and the costs
associated with delivering it. Use this measure to know the gains you make from increasing
output and reducing costs.
• Stop spending money on undifferentiated heavy lifting: AWS does the heavy lifting of data
center operations like racking, stacking, and powering servers. It also removes the operational
burden of managing operating systems and applications with managed services. This permits
you to focus on your customers and business projects rather than on IT infrastructure.
• Analyze and attribute expenditure: The cloud makes it simple to accurately identify the
usage and cost of systems, which then permits transparent attribution of IT costs to individual
Design principles 47
AWS Well-Architected Framework Framework
With the adoption of cloud, technology teams innovate faster due to shortened approval,
procurement, and infrastructure deployment cycles. A new approach to financial management
in the cloud is required to realize business value and financial success. This approach is Cloud
Financial Management, and builds capability across your organization by implementing
organizational wide knowledge building, programs, resources, and processes.
Many organizations are composed of many different units with different priorities. The ability to
align your organization to an agreed set of financial objectives, and provide your organization the
mechanisms to meet them, will create a more efficient organization. A capable organization will
innovate and build faster, be more agile and adjust to any internal or external factors.
In AWS you can use Cost Explorer, and optionally Amazon Athena and Amazon QuickSight
with the Cost and Usage Report (CUR), to provide cost and usage awareness throughout your
organization. AWS Budgets provides proactive notifications for cost and usage. The AWS blogs
provide information on new services and features to verify you keep up to date with new service
releases.
The following question focuses on these considerations for cost optimization. (For a list of cost
optimization questions and best practices, see the Appendix.).
Implementing Cloud Financial Management helps organizations realize business value and
financial success as they optimize their cost and usage and scale on AWS.
When building a cost optimization function, use members and supplement the team with experts
in CFM and cost optimization. Existing team members will understand how the organization
currently functions and how to rapidly implement improvements. Also consider including people
with supplementary or specialist skill sets, such as analytics and project management.
When implementing cost awareness in your organization, improve or build on existing programs
and processes. It is much faster to add to what exists than to build new processes and programs.
This will result in achieving outcomes much faster.
Best practices 49
AWS Well-Architected Framework Framework
Implement change control and resource management from project inception to end-of-life. This
facilitates shutting down unused resources to reduce waste.
You can use cost allocation tags to categorize and track your AWS usage and costs. When you apply
tags to your AWS resources (such as EC2 instances or S3 buckets), AWS generates a cost and usage
report with your usage and your tags. You can apply tags that represent organization categories
(such as cost centers, workload names, or owners) to organize your costs across multiple services.
Verify that you use the right level of detail and granularity in cost and usage reporting and
monitoring. For high level insights and trends, use daily granularity with AWS Cost Explorer. For
deeper analysis and inspection use hourly granularity in AWS Cost Explorer, or Amazon Athena and
Amazon QuickSight with the Cost and Usage Report (CUR) at an hourly granularity.
Combining tagged resources with entity lifecycle tracking (employees, projects) makes it
possible to identify orphaned resources or projects that are no longer generating value to the
organization and should be decommissioned. You can set up billing alerts to notify you of
predicted overspending.
Cost-effective resources
Using the appropriate instances and resources for your workload is key to cost savings. For
example, a reporting process might take five hours to run on a smaller server but one hour to run
on a larger server that is twice as expensive. Both servers give you the same outcome, but the
smaller server incurs more cost over time.
A well-architected workload uses the most cost-effective resources, which can have a significant
and positive economic impact. You also have the opportunity to use managed services to reduce
costs. For example, rather than maintaining servers to deliver email, you can use a service that
charges on a per-message basis.
AWS offers a variety of flexible and cost-effective pricing options to acquire instances from Amazon
EC2 and other services in a way that more effectively fits your needs. On-Demand Instances
permit you to pay for compute capacity by the hour, with no minimum commitments required.
Savings Plans and Reserved Instances offer savings of up to 75% off On-Demand pricing. With Spot
Instances, you can leverage unused Amazon EC2 capacity and offer savings of up to 90% off On-
Best practices 51
AWS Well-Architected Framework Framework
By factoring in cost during service selection, and using tools such as Cost Explorer and AWS Trusted
Advisor to regularly review your AWS usage, you can actively monitor your utilization and adjust
your deployments accordingly.
When you move to the cloud, you pay only for what you need. You can supply resources to match
the workload demand at the time they’re needed, this decreases the need for costly and wasteful
over provisioning. You can also modify the demand, using a throttle, buffer, or queue to smooth
the demand and serve it with less resources resulting in a lower cost, or process it at a later time
with a batch service.
In AWS, you can automatically provision resources to match the workload demand. Auto Scaling
using demand or time-based approaches permit you to add and remove resources as needed. If
you can anticipate changes in demand, you can save more money and validate that your resources
match your workload needs. You can use Amazon API Gateway to implement throttling, or Amazon
SQS to implementing a queue in your workload. These will both permit you to modify the demand
on your workload components.
For a workload that has balanced spend and performance, verify that everything you pay for
is used and avoid significantly underutilizing instances. A skewed utilization metric in either
direction has an adverse impact on your organization, in either operational costs (degraded
performance due to over-utilization), or wasted AWS expenditures (due to over-provisioning).
When designing to modify demand and supply resources, actively think about the patterns of
usage, the time it takes to provision new resources, and the predictability of the demand pattern.
When managing demand, verify you have a correctly sized queue or buffer, and that you are
responding to workload demand in the required amount of time.
As AWS releases new services and features, it's a best practice to review your existing architectural
decisions to verify they continue to be the most cost effective. As your requirements change, be
aggressive in decommissioning resources, entire services, and systems that you no longer require.
Best practices 53
AWS Well-Architected Framework Framework
Sustainability
The Sustainability pillar focuses on environmental impacts, especially energy consumption
and efficiency, since they are important levers for architects to inform direct action to reduce
resource usage. You can find prescriptive guidance on implementation in the Sustainability Pillar
whitepaper.
Topics
• Design principles
• Definition
• Best practices
• Resources
Design principles
There are six design principles for sustainability in the cloud:
• Understand your impact: Measure the impact of your cloud workload and model the future
impact of your workload. Include all sources of impact, including impacts resulting from
customer use of your products, and impacts resulting from their eventual decommissioning and
retirement. Compare the productive output with the total impact of your cloud workloads by
reviewing the resources and emissions required per unit of work. Use this data to establish key
performance indicators (KPIs), evaluate ways to improve productivity while reducing impact, and
estimate the impact of proposed changes over time.
• Establish sustainability goals: For each cloud workload, establish long-term sustainability
goals such as reducing the compute and storage resources required per transaction. Model the
return on investment of sustainability improvements for existing workloads, and give owners the
resources they must invest in sustainability goals. Plan for growth, and architect your workloads
so that growth results in reduced impact intensity measured against an appropriate unit, such
as per user or per transaction. Goals help you support the wider sustainability goals of your
business or organization, identify regressions, and prioritize areas of potential improvement.
• Maximize utilization: Right-size workloads and implement efficient design to verify high
utilization and maximize the energy efficiency of the underlying hardware. Two hosts running
at 30% utilization are less efficient than one host running at 60% due to baseline power
consumption per host. At the same time, reduce or minimize idle resources, processing, and
storage to reduce the total energy required to power your workload.
Sustainability 55
AWS Well-Architected Framework Framework
Best practices
Topics
• Region selection
• Alignment to demand
• Software and architecture
• Data management
• Hardware and services
• Process and culture
Region selection
The choice of Region for your workload significantly affects its KPIs, including performance, cost,
and carbon footprint. To improve these KPIs, you should choose Regions for your workloads based
on both business requirements and sustainability goals.
The following question focuses on these considerations for sustainability. (For a list of
sustainability questions and best practices, see the Appendix.)
The choice of Region for your workload significantly affects its KPIs, including performan
ce, cost, and carbon footprint. To improve these KPIs, you should choose Regions for your
workloads based on both business requirements and sustainability goals.
Alignment to demand
The way users and applications consume your workloads and other resources can help you identify
improvements to meet sustainability goals. Scale infrastructure to continually match demand and
verify that you use only the minimum resources required to support your users. Align service levels
to customer needs. Position resources to limit the network required for users and applications to
consume them. Remove unused assets. Provide your team members with devices that support their
needs and minimize their sustainability impact.
Best practices 57
AWS Well-Architected Framework Framework
lack of use because of changes in user behavior over time. Revise patterns and architecture to
consolidate under-utilized components to increase overall utilization. Retire components that are
no longer required. Understand the performance of your workload components, and optimize the
components that consume the most resources. Be aware of the devices that your customers use to
access your services, and implement patterns to minimize the need for device upgrades.
SUS 3: How do you take advantage of software and architecture patterns to support your
sustainability goals?
Implement patterns for performing load smoothing and maintaining consistent high utilizati
on of deployed resources to minimize the resources consumed. Components might become idle
from lack of use because of changes in user behavior over time. Revise patterns and architect
ure to consolidate under-utilized components to increase overall utilization. Retire component
s that are no longer required. Understand the performance of your workload components, and
optimize the components that consume the most resources. Be aware of the devices that your
customers use to access your services, and implement patterns to minimize the need for device
upgrades.
Optimize software and architecture for asynchronous and scheduled jobs: Use efficient software
designs and architectures to minimize the average resources required per unit of work. Implement
mechanisms that result in even utilization of components to reduce resources that are idle between
tasks and minimize the impact of load spikes.
Remove or refactor workload components with low or no use: Monitor workload activity to identify
changes in utilization of individual components over time. Remove components that are unused
and no longer required, and refactor components with little utilization, to limit wasted resources.
Optimize areas of code that consume the most time or resources: Monitor workload activity to
identify application components that consume the most resources. Optimize the code that runs
within these components to minimize resource usage while maximizing performance.
Optimize impact on customer devices and equipment: Understand the devices and equipment
that your customers use to consume your services, their expected lifecycle, and the financial
and sustainability impact of replacing those components. Implement software patterns and
architectures to minimize the need for customers to replace devices and upgrade equipment. For
example, implement new features using code that is backward compatible with earlier hardware
Best practices 59
AWS Well-Architected Framework Framework
Remove unneeded or redundant data: Duplicate data only when necessary to minimize total
storage consumed. Use backup technologies that deduplicate data at the file and block level. Limit
the use of Redundant Array of Independent Drives (RAID) configurations except where required to
meet SLAs.
Use shared file systems or object storage to access common data: Adopt shared storage and
single sources of truth to avoid data duplication and reduce the total storage requirements
of your workload. Fetch data from shared storage only as needed. Detach unused volumes to
release resources. Minimize data movement across networks: Use shared storage and access data
from Regional data stores to minimize the total networking resources required to support data
movement for your workload.
Back up data only when difficult to recreate: To minimize storage consumption, only back up data
that has business value or is required to satisfy compliance requirements. Examine backup policies
and exclude ephemeral storage that doesn’t provide value in a recovery scenario.
Look for opportunities to reduce workload sustainability impacts by making changes to your
hardware management practices. Minimize the amount of hardware needed to provision and
deploy, and select the most efficient hardware and services for your individual workload.
SUS 5: How do you select and use cloud hardware and services in your architecture to
support your sustainability goals?
Look for opportunities to reduce workload sustainability impacts by making changes to your
hardware management practices. Minimize the amount of hardware needed to provision and
deploy, and select the most efficient hardware and services for your individual workload.
Use the minimum amount of hardware to meet your needs: Using the capabilities of the cloud, you
can make frequent changes to your workload implementations. Update deployed components as
your needs change.
Use instance types with the least impact: Continually monitor the release of new instance types
and take advantage of energy efficiency improvements, including those instance types designed to
support specific workloads such as machine learning training and inference, and video transcoding.
Best practices 61
AWS Well-Architected Framework Framework
Use managed device farms for testing: Managed device farms spread the sustainability impact
of hardware manufacturing and resource usage across multiple tenants. Managed device farms
offer diverse device types so you can support earlier, less popular hardware, and avoid customer
sustainability impact from unnecessary device upgrades.
Resources
Refer to the following resources to learn more about our best practices for sustainability.
Whitepaper
• Sustainability Pillar
Video
Resources 63
AWS Well-Architected Framework Framework
After you have done a review, you should have a list of issues that you can prioritize based on your
business context. You will also want to take into account the impact of those issues on the day-to-
day work of your team. If you address these issues early, you could free up time to work on creating
business value rather than solving recurring problems. As you address issues, you can update your
review to see how the architecture is improving.
While the value of a review is clear after you have done one, you may find that a new team might
be resistant at first. Here are some objections that can be handled through educating the team on
the benefits of a review:
• “We are too busy!” (Often said when the team is getting ready for a significant launch.)
• If you are getting ready for a big launch, you will want it to go smoothly. The review will
permit you to understand any problems you might have missed.
• We recommend that you carry out reviews early in the product lifecycle to uncover risks and
develop a mitigation plan aligned with the feature delivery roadmap.
• “We don’t have time to do anything with the results!” (Often said when there is an immovable
event, such as the Super Bowl, that they are targeting.)
• These events can’t be moved. Do you really want to go into it without knowing the risks in
your architecture? Even if you don’t address all of these issues you can still have playbooks for
handling them if they materialize.
• “We don’t want others to know the secrets of our solution implementation!”
• If you point the team at the questions in the Well-Architected Framework, they will see that
none of the questions reveal any commercial or technical proprietary information.
As you carry out multiple reviews with teams in your organization, you might identify thematic
issues. For example, you might see that a group of teams has clusters of issues in a particular
pillar or topic. You will want to look at all your reviews in a holistic manner, and identify any
mechanisms, training, or principal engineering talks that could help address those thematic issues.
65
AWS Well-Architected Framework Framework
Contributors
The following individuals and organizations contributed to this document:
67
AWS Well-Architected Framework Framework
Document revisions
To be notified about updates to this whitepaper, subscribe to the RSS feed.
Updates for new Framework Best practices updated with April 10, 2023
prescriptive guidance and
new best practices added.
New questions added to the
Security and Cost Optimizat
ion pillars.
69
AWS Well-Architected Framework Framework
Note
To subscribe to RSS updates, you must have an RSS plugin enabled for the browser that
you are using.
Framework versions:
• 2023-10-03 (current)
• 2023-04-10
• 2022-03-31
71
AWS Well-Architected Framework Framework
Everyone should understand their part in enabling business success. Have shared goals in order to
set priorities for resources. This will maximize the benefits of your efforts.
Best practices
• OPS01-BP01 Evaluate customer needs
• OPS01-BP02 Evaluate internal customer needs
• OPS01-BP03 Evaluate governance requirements
• OPS01-BP04 Evaluate compliance requirements
• OPS01-BP05 Evaluate threat landscape
• OPS01-BP06 Evaluate tradeoffs while managing benefits and risks
Involve key stakeholders, including business, development, and operations teams, to determine
where to focus efforts on external customer needs. This verifies that you have a thorough
understanding of the operations support that is required to achieve your desired business
outcomes.
Desired outcome:
Common anti-patterns:
• You have decided not to have customer support outside of core business hours, but you haven't
reviewed historical support request data. You do not know whether this will have an impact on
your customers.
• You are developing a new feature but have not engaged your customers to find out if it is
desired, if desired in what form, and without experimentation to validate the need and method
of delivery.
Organization 73
AWS Well-Architected Framework Framework
• You have decided to change IP address allocations for your product teams, without consulting
them, to make managing your network easier. You do not know the impact this will have on your
product teams.
• You are implementing a new development tool but have not engaged your internal customers to
find out if it is needed or if it is compatible with their existing practices.
• You are implementing a new monitoring system but have not contacted your internal customers
to find out if they have monitoring or reporting needs that should be considered.
Benefits of establishing this best practice: Evaluating and understanding internal customer needs
informs how you prioritize your efforts to deliver business value.
Implementation guidance
• Understand business needs: Business success is created by shared goals and understanding
across stakeholders including business, development, and operations teams.
• Review business goals, needs, and priorities of internal customers: Engage key stakeholders,
including business, development, and operations teams, to discuss goals, needs, and priorities
of internal customers. This ensures that you have a thorough understanding of the operational
support that is required to achieve business and customer outcomes.
Resources
Governance is the set of policies, rules, or frameworks that a company uses to achieve its business
goals. Governance requirements are generated from within your organization. They can affect the
types of technologies you choose or influence the way you operate your workload. Incorporate
Organization 75
AWS Well-Architected Framework Framework
instances. If teams need system access, they are required to use AWS Systems Manager Session
Manager. The cloud operations team regularly updates governance requirements as new services
become available.
Implementation steps
1. Identify the stakeholders for your workload, including any centralized teams.
2. Work with stakeholders to identify governance requirements.
3. Once you’ve generated a list, prioritize the improvement items, and begin implementing them
into your workload.
a. Use services like AWS Config to create governance-as-code and validate that governance
requirements are followed.
b. If you use AWS Organizations, you can leverage Service Control Policies to implement
governance requirements.
4. Provide documentation that validates the implementation.
Level of effort for the implementation plan: Medium. Implementing missing governance
requirements may result in rework of your workload.
Resources
• OPS01-BP04 Evaluate compliance requirements - Compliance is like governance but comes from
outside an organization.
Related documents:
Related videos:
• AWS Management and Governance: Configuration, Compliance, and Audit - AWS Online Tech
Talks
Organization 77
AWS Well-Architected Framework Framework
• Your software developers and architects are unaware of the compliance framework that your
organization must adhere to.
• The yearly Systems and Organizations Control (SOC2) Type II audit is happening soon and you
are unable to verify that controls are in place.
• Evaluating and understanding the compliance requirements that apply to your workload will
inform how you prioritize your efforts to deliver business value.
• You choose the right locations and technologies that are congruent with your compliance
framework.
• Designing your workload for auditability helps you to prove you are adhering to your compliance
framework.
Implementation guidance
Implementing this best practice means that you incorporate compliance requirements into your
architecture design process. Your team members are aware of the required compliance framework.
You validate compliance in line with the framework.
Customer example
AnyCompany Retail stores credit card information for customers. Developers on the card storage
team understand that they need to comply with the PCI-DSS framework. They’ve taken steps
to verify that credit card information is stored and accessed securely in line with the PCI-DSS
framework. Every year they work with their security team to validate compliance.
Implementation steps
1. Work with your security and governance teams to determine what industry, regulatory, or
internal compliance frameworks that your workload must adhere to. Incorporate the compliance
frameworks into your workload.
a. Validate continual compliance of AWS resources with services like AWS Compute Optimizer
and AWS Security Hub.
Organization 79
AWS Well-Architected Framework Framework
• AWS re:Invent 2020: Achieve compliance as code using AWS Compute Optimizer
• AWS re:Invent 2021 - Cloud compliance, assurance, and auditing
• AWS Summit ATL 2022 - Implementing compliance, assurance, and auditing on AWS (COP202)
Related examples:
Related services:
• AWS Artifact
• AWS Audit Manager
• AWS Compute Optimizer
• AWS Security Hub
Evaluate threats to the business (for example, competition, business risk and liabilities, operational
risks, and information security threats) and maintain current information in a risk registry. Include
the impact of risks when determining where to focus efforts.
AWS customers are eligible for a guided Well-Architected Review of their mission-critical workloads
to measure their architectures against AWS best practices. Enterprise Support customers are
eligible for an Operations Review, designed to help them to identify gaps in their approach to
operating in the cloud.
The cross-team engagement of these reviews helps to establish common understanding of your
workloads and how team roles contribute to success. The needs identified through the review can
help shape your priorities.
Organization 81
AWS Well-Architected Framework Framework
• Maintain a threat model: Establish and maintain a threat model identifying potential threats,
planned and in place mitigations, and their priority. Review the probability of threats manifesting
as incidents, the cost to recover from those incidents and the expected harm caused, and the cost
to prevent those incidents. Revise priorities as the contents of the threat model change.
Resources
Related documents:
Related videos:
Competing interests from multiple parties can make it challenging to prioritize efforts, build
capabilities, and deliver outcomes aligned with business strategies. For example, you may be asked
to accelerate speed-to-market for new features over optimizing IT infrastructure costs. This can
put two interested parties in conflict with one another. In these situations, decisions need to be
brought to a higher authority to resolve conflict. Data is required to remove emotional attachment
from the decision-making process.
The same challenge may occur at a tactical level. For example, the choice between using relational
or non-relational database technologies can have a significant impact on the operation of an
application. It's critical to understand the predictable results of various choices.
AWS can help you educate your teams about AWS and its services to increase their understanding
of how their choices can have an impact on your workload. Use the resources provided by AWS
Support (AWS Knowledge Center, AWS Discussion Forums, and AWS Support Center) and AWS
Documentation to educate your teams. For further questions, reach out to AWS Support.
Organization 83
AWS Well-Architected Framework Framework
Implementation guidance
Managing benefits and risks should be defined by a governing body that drives the requirements
for key decision-making. You want decisions to be made and prioritized based on how they
benefit the organization, with an understanding of the risks involved. Accurate information is
critical for making the organizational decisions. This should be based on solid measurements and
defined by common industry practices of cost benefit analysis. To make these types of decisions,
strike a balance between centralized and decentralized authority. There is always a tradeoff, and
it's important to understand how each choice impacts defined strategies and desired business
outcomes.
Implementation steps
Organization 85
AWS Well-Architected Framework Framework
goals. Understanding responsibility, ownership, how decisions are made, and who has authority to
make decisions will help focus efforts and maximize the benefits from your teams.
Best practices
• OPS02-BP01 Resources have identified owners
• OPS02-BP02 Processes and procedures have identified owners
• OPS02-BP03 Operations activities have identified owners responsible for their performance
• OPS02-BP04 Mechanisms exist to manage responsibilities and ownership
• OPS02-BP05 Mechanisms exist to request additions, changes, and exceptions
• OPS02-BP06 Responsibilities between teams are predefined or negotiated
Resources for your workload must have identified owners for change control, troubleshooting,
and other functions. Owners are assigned for workloads, accounts, infrastructure, platforms, and
applications. Ownership is recorded using tools like a central register or metadata attached to
resources. The business value of components informs the processes and procedures applied to
them.
Desired outcome:
Common anti-patterns:
• The alternate contacts for your AWS accounts are not populated.
• Resources lack tags that identify what teams own them.
• You have an ITSM queue without an email mapping.
• Two teams have overlapping ownership of a critical piece of infrastructure.
Organization 87
AWS Well-Architected Framework Framework
a. You can use AWS Config rules to enforce that resources have the required ownership tags.
b. For in-depth guidance on how to build a tagging strategy for your organization, see AWS
Tagging Best Practices whitepaper.
4. Use Amazon Q Business, a conversational assistant that uses generative AI to enhance workforce
productivity, answer questions, and complete tasks based on information in your enterprise
systems.
a. Connect Amazon Q Business to your company's data source. Amazon Q Business offers
prebuilt connectors to over 40 supported data sources, including Amazon Simple Storage
Service (Amazon S3), Microsoft SharePoint, Salesforce, and Atlassian Confluence. For more
information, see Amazon Q Business connectors.
5. For other resources, platforms, and infrastructure, create documentation that identifies
ownership. This should be accessible to all team members.
Level of effort for the implementation plan: Low. Leverage account contact information and tags
to assign ownership of AWS resources. For other resources you can use something as simple as a
table in a wiki to record ownership and contact information, or use an ITSM tool to map ownership.
Resources
Related documents:
Organization 89
AWS Well-Architected Framework Framework
Implementation guidance
• Processes and procedures have identified owners who are responsible for their definition.
• Identify the operations activities conducted in support of your workloads. Document these
activities in a discoverable location.
• Uniquely identify the individual or team responsible for the specification of an activity. They
are responsible to verify that it can be successfully performed by an adequately skilled team
member with the correct permissions, access, and tools. If there are issues with performing
that activity, the team members performing it are responsible for providing the detailed
feedback necessary for the activity to be improved.
• Capture ownership in the metadata of the activity artifact through services like AWS Systems
Manager, through documents, and AWS Lambda. Capture resource ownership using tags or
resource groups, specifying ownership and contact information. Use AWS Organizations to
create tagging polices and capture ownership and contact information.
• Over time, these procedures should be evolved to be runnable as code, reducing the need for
human intervention.
• For example, consider AWS Lambda functions, CloudFormation templates, or AWS Systems
Manager automation docs.
• Perform version control in appropriate repositories.
• Include suitable resource tagging so owners and documentation can readily be identified.
Customer example
Organization 91
AWS Well-Architected Framework Framework
Resources
Related documents:
Related workshops:
Related videos:
Organization 93
AWS Well-Architected Framework Framework
• You understand who is responsible to perform an activity, who to notify when action is needed,
and who performs the action, validates the result, and provides feedback to the owner of the
activity.
• Processes and procedures boost your efforts to operate your workloads.
• New team members become effective more quickly.
• You reduce the time it takes to mitigate incidents.
• Different teams use the same processes and procedures to perform tasks in a consistent manner.
• Teams can scale their processes with repeatable processes.
• Standardized processes and procedures help mitigate the impact of transferring workload
responsibilties between teams.
Implementation guidance
To begin to define responsibilities, start with existing documentation, like responsibility matrices,
processes and procedures, roles and responsibilities, and tools and automation. Review and host
discussions on the responsibilities for documented processes. Review with teams to identify
misalignments between document responsibilities and processes. Discuss services offered with
internal customers of that team to identify expectations gaps between teams.
Analyze and address the discrepancies. Identify opportunities to improvement, and look for
frequently requested, resource-intensive activities, which are typically strong candidates for
improvement. Explore best practices, patterns, and prescriptive guidance to simplify and
standardize improvements. Record improvement opportunities, and track the improvements to
completion.
Over time, these procedures should be evolved to be run as code, reducing the need for
human intervention. For example, procedures can be initiated as AWS Lambda functions, AWS
CloudFormation templates, or AWS Systems Manager Automation documents. Verify that these
procedures are version-controlled in appropriate repositories, and include suitable resource
tagging so that teams can readily identify owners and documentation. Document the responsibility
for carrying out the activities, and then monitor the automations for successful initiation and
operation, as well as performance of the desired outcomes.
Customer example
Organization 95
AWS Well-Architected Framework Framework
Resources
Related documents:
Related videos:
Organization 97
AWS Well-Architected Framework Framework
• Roles, responsibilities, and escalation paths are not discoverable, and they are not readily
available when required (for example, in response to an incident).
• When you understand who has responsibility or ownership, you can contact the proper team or
team member to make a request or transition a task.
• To reduce the risk of inaction and unaddressed needs, you have identified a person who has the
authority to assign responsibility or ownership.
• When you clearly define the scope of a responsibility, your team members gain autonomy and
ownership.
• Your responsibilities inform the decisions you make, the actions you take, and your handoff
activities to their proper owners.
• It's easy to identify abandoned responsibilities because you have a clear understanding of what
falls outside of your team's responsibility, which helps you escalate for clarification.
• Teams avoid confusion and tension, and they can more adequately manage their workloads and
resources.
Implementation guidance
Identify team members roles and responsibilities, and verify that they understand the expectations
of their role. Make this information discoverable so that members of your organization can identify
who they need to contact for specific needs, whether it's a team or individual. As organizations
seek to capitalize on the opportunities to migrate and modernize on AWS, roles and responsibilities
might also change. Keep your teams and their members aware of their responsibilities, and train
them appropriately to carry out their tasks during this change.
Determine the role or team that should receive escalations to identify responsibility and ownership.
This team can engage with various stakeholders to come to a decision. However, they should own
the management of the decision making process.
Provide accessible mechanisms for members of your organization to discover and identify
ownership and responsibility. These mechanisms teach them who to contact for specific needs.
Customer example
Organization 99
AWS Well-Architected Framework Framework
Resources
Related documents:
Organization 101
AWS Well-Architected Framework Framework
Implementation guidance
To implement this best practice, you need to be able to request changes to processes, procedures,
and resources. The change management process can be lightweight. Document the change
management process.
Customer example
AnyCompany Retail uses a responsibility assignment (RACI) matrix to identify who owns changes
for processes, procedures, and resources. They have a documented change management process
that’s lightweight and easy to follow. Using the RACI matrix and the process, anyone can submit
change requests.
Implementation steps
1. Identify the processes, procedures, and resources for your workload and the owners for each.
Document them in your knowledge management system.
a. If you have not implemented OPS02-BP01 Resources have identified owners, OPS02-BP02
Processes and procedures have identified owners, or OPS02-BP03 Operations activities have
identified owners responsible for their performance, start with those first.
2. Work with stakeholders in your organization to develop a change management process.
The process should cover additions, changes, and exceptions for resources, processes, and
procedures.
a. You can use AWS Systems Manager Change Manager as a change management platform for
workload resources.
3. Document the change management process in your knowledge management system.
Level of effort for the implementation plan: Medium. Developing a change management process
requires alignment with multiple stakeholders across your organization.
Resources
• OPS02-BP01 Resources have identified owners - Resources need identified owners before you
build a change management process.
Organization 103
AWS Well-Architected Framework Framework
• The operations team needs assistance from the development team but there is no agreed to
response time. The request is stuck in the backlog.
Implementation guidance
Implementing this best practice means that there is no ambiguity about how teams work with
each other. Formal agreements codify how teams work together or support each other. Inter-team
communication channels are documented.
Customer example
AnyCompany Retail’s SRE team has a service level agreement with their development team.
Whenever the development team makes a request in their ticketing system, they can expect
a response within fifteen minutes. If there is a site outage, the SRE team takes lead in the
investigation with support from the development team.
Implementation steps
1. Working with stakeholders across your organization, develop agreements between teams based
on processes and procedures.
a. If a process or procedure is shared between two teams, develop a runbook on how the teams
will work together.
b. If there are dependencies between teams, agree to a response SLA for requests.
Level of effort for the implementation plan: Medium. If there are no existing agreements
between teams, it can take effort to come to agreement with stakeholders across your
organization.
Organization 105
AWS Well-Architected Framework Framework
Common anti-patterns:
• There is a mandate for workload owners to migrate workloads to AWS without a clear sponsor
and plan for cloud operations. This results in teams not consciously collaborating to improve
and mature their operational capabilities. Lack of operational best practice standards overwhelm
teams (such as operator-toil, on-calls, and technical debt), which constrains innovation.
• A new organization-wide goal has been set to adopt an emerging technology without providing
leadership sponsor and strategy. Teams interpret goals differently, which causes confusion
on where to focus efforts, why they matter, and how to measure impact. Consequently, the
organization loses momentum in adopting the technology.
Benefits of establishing this best practice: When executive sponsorship clearly communicates and
shares vision, direction, and goals, team members know what is expected of them. Individuals and
teams begin to intensely focus effort in the same direction to accomplish defined objectives when
leaders are actively engaged. As a result, the organization maximies the ability to succeed. When
you evaluate success, you can better identify barriers to success so that they can be addressed
through intervention by the executive sponsor.
Implementation guidance
• At every phase of the cloud journey (migration, adoption, or optimization), success requires
active involvement at the highest level of leadership with a designated executive sponsor. The
executive sponsor aligns the team's mindset, skillsets, and ways of working to the defined
strategy.
• Explain the why: Bring clarity and explain the reasoning behind the vision and strategy.
• Set expectations: Define and publish goals for your organizations, including how progress and
success are measured.
Organization 107
AWS Well-Architected Framework Framework
c. Communicate the vision consistently to all teams and individuals responsible for parts of the
strategy.
4. Develop communication planning matrices that specify what message needs to be delivered to
specified leaders, managers, and individual contributors. Specify the person or team that should
deliver this message.
c. Accept feedback on the effectiveness of communications, and adjust the communications and
plan accordingly.
5. Actively engage each initiative from a leadership perspective to verify that all impacted teams
understand the outcomes they are accountable to achieve.
6. At every status meeting, executive sponsors should look for blockers, inspect established
metrics, anecdotes, or feedback from the teams, and measure progress towards objectives.
Resources
Related documents:
Organization 109
AWS Well-Architected Framework Framework
is no process to track such improvements. The organization continues to be plagued with failed
deployments impacting customers and causing further negative sentiment.
• In order to stay compliant, your infosec team oversees a long-established process to rotate
shared SSH keys regularly on behalf of operators connecting to their Amazon EC2 Linux
instances. It takes several days for the infosec teams to complete rotating keys, and you are
blocked from connecting to those instances. No one inside or outside of infosec suggests using
other options on AWS to achieve the same result.
Benefits of establishing this best practice: By decentralizing authority to make decisions and
empowering your teams to decide key decisions, you are able to address issues more quickly with
increasing success rates. In addition, teams start to realize a sense of ownership, and failures are
acceptable. Experimentation becomes a cultural mainstay. Managers and directors do not feel as
though they are micro-managed through every aspect of their work.
Implementation guidance
Organization 111
AWS Well-Architected Framework Framework
Escalation should be done early and often so that risks can be identified and prevented from
causing incidents. Leadership does not reprimand individuals for escalating an issue.
Desired outcome: Individuals throughout the organization are comfortable to escalate problems
to their immediate and higher levels of leadership. Leadership has deliberately and consciously
established expectations that their teams should feel safe to escalate any issue. A mechanism
exists to escalate issues at each level within the organization. When employees escalate to their
manager, they jointly decide the level of impact and whether the issue should be escalated. In
order to initiate an escalation, employees are required to include a recommended work plan to
address the issue. If direct management does not take timely action, employees are encouraged to
take issues to the highest level of leadership if they feel strongly that the risks to the organization
warrant the escalation.
Common anti-patterns:
• Executive leaders do not ask enough probing questions during your cloud transformation
program status meeting to find where issues and blockers are occurring. Only good news is
presented as status. The CIO has made it clear that she only likes to hear good news, as any
challenges brought up make the CEO think that the program is failing.
• You are a cloud operations engineer and you notice that the new knowledge management
system is not being widely adopted by application teams. The company invested one year and
several million dollars to implement this new knowledge management system, but people
are still authoring their runbooks locally and sharing them on an organizational cloud share,
making it difficult to find knowledge pertinent to supported workloads. You try to bring this
to leadership's attention, because consistent use of this system can enhance operational
efficiency. When you bring this to the director who lead the implementation of the knowledge
management system, she reprimands you because it calls the investment into question.
• The infosec team responsible for hardening compute resources has decided to put a process
in place that requires performing the scans necessary to ensure that EC2 instances are fully
secured before the compute team releases the resource for use. This has created a time delay of
an additional week for resources to be deployed, which breaks their SLA. The compute team is
afraid to escalate this to the VP over cloud because this makes the VP of information security
look bad.
Organization 113
AWS Well-Architected Framework Framework
b. Protect employees who escalate. Have policy that protects team members from retribution
if they escalate around a non-responsive decision maker or stakeholder. Have mechanisms in
place to identify if this is occurring and respond appropriately.
6. Leadership should periodically reemphasize the policies, standards, mechanisms, and the desire
for open escalation and continuous feedback loops without retribution.
Resources
Related documents:
• How do you foster a culture of continuous improvement and learning from Andon and escalation
systems?
• AWS DevOps Guidance | Establish clear escalation paths and encourage constructive
disagreement
Related videos:
• Toyota Product System: Stopping Production, a Button, and an Andon Electric Board
Related examples:
Organization 115
AWS Well-Architected Framework Framework
informed of this strategic change and thus, they are not ready with enough skilled capacity to
support a greater number of workloads lifted and shifted into AWS.
• Your organization is well-informed on new or changed strategies, and they act accordingly with
strong motivation to help each other achieve the overall objectives and metrics set by leadership.
• Mechanisms exist and are used to provide timely notice to team members of known risks and
planned events.
• New ways of working (including changes to people or the organization, processes, or
technology), along with required skills, are more effectively adopted by the organization, and
your organization realizes business benefits more quickly.
• Team members have the necessary context of the communications being received, and they can
be more effective in their jobs.
Implementation guidance
To implement this best practice, you must work with stakeholders across your organization
to agree to communication standards. Publicize those standards to your organization. For any
significant IT transitions, an established planning team can more successfully manage the impact
of change to its people than an organization that ignores this practice. Larger organizations can
be more challenging when managing change because it's critical to establish strong buy-in on a
new strategy with all individual contributors. In the absence of such a transition planning team,
leadership holds 100% of the responsibility for effective communications. When establishing
a transition planning team, assign team members to work with all organizational leadership to
define and manage effective communications at every level.
Customer example
AnyCompany Retail signed up for AWS Enterprise Support and depends on other third-
party providers for its cloud operations. The company uses chat and chatops as their main
communication medium for operational activities. Alerts and other information populate specific
channels. When someone must act, they clearly state the desired outcome, and in many cases, they
receive a runbook or playbook to use. They schedule major changes to production systems with a
change calendar.
Organization 117
AWS Well-Architected Framework Framework
a. You can use AWS Systems Manager Documents to build playbooks and runbooks for alerts.
16.Mechanisms are in place to provide notification of risks or planned events in a clear and
actionable way with enough notice to allow appropriate responses. Use email lists or chat
channels to send notifications ahead of planned events.
a. AWS Chatbot can be used to send alerts and respond to events within your organizations
messaging platform.
17.Provide an accessible source of information where planned events can be discovered. Provide
notifications of planned events from the same system.
a. AWS Systems Manager Change Calendar can be used to create change windows when
changes can occur. This provides team members notice when they can make changes safely.
a. You can subscribe to AWS Security Bulletins to receive notifications of vulnerabilities on AWS.
19.Seek diverse opinions and perspectives: Encourage contributions from everyone. Give
communication opportunities to under-represented groups. Rotate roles and responsibilities in
meetings.
a. Expand roles and responsibilities: Provide opportunities for team members to take on roles
that they might not otherwise. They can gain experience and perspective from the role and
from interactions with new team members with whom they might not otherwise interact.
They can also bring their experience and perspective to the new role and team members
they interact with. As perspective increases, identify emergent business opportunities or new
opportunities for improvement. Rotate common tasks between members within a team that
others typically perform to understand the demands and impact of performing them.
b. Provide a safe and welcoming environment: Establish policy and controls that protect the
mental and physical safety of team members within your organization. Team members should
be able to interact without fear of reprisal. When team members feel safe and welcome, they
are more likely to be engaged and productive. The more diverse your organization, the better
your understanding can be of the people you support, including your customers. When your
team members are comfortable, feel free to speak, and are confident they str heard, they
are more likely to share valuable insights (for example, marketing opportunities, accessibility
needs, unserved market segments, and unacknowledged risks in your environment).
c. Encourage team members to participate fully: Provide the resources necessary for your
employees to participate fully in all work related activities. Team members that face daily
Organization 119
AWS Well-Architected Framework Framework
Experimentation is a catalyst for turning new ideas into products and features. It accelerates
learning and keeps team members interested and engaged. Team members are encouraged
to experiment often to drive innovation. Even when an undesired result occurs, there is value
in knowing what not to do. Team members are not punished for successful experiments with
undesired results.
Desired outcome:
Common anti-patterns:
• You want to run an A/B test but there is no mechanism to run the experiment. You deploy a UI
change without the ability to test it. It results in a negative customer experience.
• Your company only has a stage and production environment. There is no sandbox environment
to experiment with new features or products so you must experiment within the production
environment.
Implementation guidance
Customer example
Organization 121
AWS Well-Architected Framework Framework
Related videos:
• AWS On Air San Fran Summit 2022 ft. AWS AppConfig Feature Flags integration with Jira
• AWS re:Invent 2022 - A deployment is not a release: Control your launches w/feature flags
(BOA305-R)
• Set Up a Multi-Account AWS Environment that Uses Best Practices for AWS Organizations
Related examples:
Related services:
• AWS AppConfig
OPS03-BP06 Team members are encouraged to maintain and grow their skill sets
Teams must grow their skill sets to adopt new technologies, and to support changes in demand
and responsibilities in support of your workloads. Growth of skills in new technologies is frequently
a source of team member satisfaction and supports innovation. Support your team members'
pursuit and maintenance of industry certifications that validate and acknowledge their growing
skills. Cross train to promote knowledge transfer and reduce the risk of significant impact when
you lose skilled and experienced team members with institutional knowledge. Provide dedicated
structured time for learning.
Organization 123
AWS Well-Architected Framework Framework
Implementation guidance
To adopt new technologies, fuel innovation, and keep pace with changes in demand and
responsibilities to support your workloads, continually invest in the professional growth of your
teams.
Implementation steps
1. Use structured cloud advocacy programs: AWS Skills Guild provides consultative training to
increase cloud skill confidence and igniting culture of continuous learning.
2. Provide resources for education: Provided dedicated, structured time and access to training
materials and lab resources, and support participation in conferences and professional
organizations that provide opportunities for learning from both educators and peers. Provide
your junior team members with access to senior team members as mentors, or allow the junior
team members to shadow their seniors' work and be exposed to their methods and skills.
Encourage learning about content not directly related to work in order to have a broader
perspective.
3. Encourage use of expert technical resources: Leverage resources such as AWS re:Post to get
access to curated knowledge and vibrant community.
4. Build and maintain an up-to-date knowledge repository: Use knowledge sharing platforms
such as wikis and runbooks. Create your own reusable expert knowledge source with AWS
re:Post Private to streamline collaboration, improve productivity, and accelerate employee
onboarding.
5. Team education and cross-team engagement: Plan for the continuing education needs of your
team members. Provide opportunities for team members to join other teams (temporarily or
permanently) to share skills and best practices benefiting your entire organization.
6. Support pursuit and maintenance of industry certifications: Support your team members in
the acquisition and maintenance of industry certifications that validate what they have learned
and acknowledge their accomplishments.
Resources
Organization 125
AWS Well-Architected Framework Framework
• You have appropriately staffed your team to gain the skillsets needed for them to operate
workloads in AWS in accordance with your migration plan. As your team has scaled itself up
during the course of your migration project, they have gained proficiency in the core AWS
technologies that the business plans to use when migrating or modernizing their applications.
• You have carefully aligned your staffing plan to make efficient use of resources by leveraging
automation and workflow. A smaller team can now manage more infrastructure on behalf of the
application development teams.
• With shifting operational priorities, any resource staffing constraints are proactively identified to
protect the success of business initiatives.
• Operational metrics that report operational toil (such as on-call fatigue or excessive paging) are
reviewed to verify that staff are not overwhelmed.
Common anti-patterns:
• Your staff have not ramped up on AWS skills as you close in on your multi-year cloud migration
plan, which risks support of the workloads and lowers employee morale.
• Your entire IT organization is shifting into agile ways of working. The business is prioritizing
the product portfolio and setting metrics for what features need to be developed first. Your
agile process does not require teams to assign story points to their work plans. As a result, it is
impossible to know the level of capacity required for the next amount of work, or if you have the
right skills assigned to the work.
• You are having an AWS partner migrate your workloads, and you don't have a support transition
plan for your teams once the partner completes the migration project. Your teams struggle to
efficiently and effectively support the workloads.
Benefits of establishing this best practice: You have appropriately-skilled team members
available in your organization to support the workloads. Resource allocation can adapt to shifting
priorities without impacting performance. The result is teams being proficient at supporting
workloads while maximizing time to focus on innovating for customers, which in turn raises
employee satisfaction.
Implementation guidance
Resource planning for your cloud migration should occur at an organizational level that aligns to
your migration plan, as well as the desired operating model being implemented to support your
Organization 127
AWS Well-Architected Framework Framework
Prepare
Questions
• OPS 4. How do you implement observability in your workload?
• OPS 5. How do you reduce defects, ease remediation, and improve flow into production?
• OPS 6. How do you mitigate deployment risks?
• OPS 7. How do you know that you are ready to support a workload?
Implement observability in your workload so that you can understand its state and make data-
driven decisions based on business requirements.
Best practices
• OPS04-BP01 Identify key performance indicators
• OPS04-BP02 Implement application telemetry
• OPS04-BP03 Implement user experience telemetry
• OPS04-BP04 Implement dependency telemetry
• OPS04-BP05 Implement distributed tracing
Implementing observability in your workload starts with understanding its state and making
data-driven decisions based on business requirements. One of the most effective ways to ensure
alignment between monitoring activities and business objectives is by defining and monitoring key
performance indicators (KPIs).
Desired outcome: Efficient observability practices that are tightly aligned with business objectives,
ensuring that monitoring efforts are always in service of tangible business outcomes.
Common anti-patterns:
Prepare 129
AWS Well-Architected Framework Framework
Resources
Related documents:
Related videos:
Related examples:
Application telemetry serves as the foundation for observability of your workload. It's crucial
to emit telemetry that offers actionable insights into the state of your application and the
achievement of both technical and business outcomes. From troubleshooting to measuring the
impact of a new feature or ensuring alignment with business key performance indicators (KPIs),
application telemetry informs the way you build, operate, and evolve your workload.
Metrics, logs, and traces form the three primary pillars of observability. These serve as diagnostic
tools that describe the state of your application. Over time, they assist in creating baselines and
identifying anomalies. However, to ensure alignment between monitoring activities and business
objectives, it's pivotal to define and monitor KPIs. Business KPIs often make it easier to identify
issues compared to technical metrics alone.
Prepare 131
AWS Well-Architected Framework Framework
resources, enhancing your understanding of how your workload operates. In tandem, AWS X-Ray
lets you trace, analyze, and debug your applications, giving you a deep understanding of your
workload's behavior. With features like service maps, latency distributions, and trace timelines,
AWS X-Ray provides insights into your workload's performance and the bottlenecks affecting it.
Implementation steps
1. Identify what data to collect: Ascertain the essential metrics, logs, and traces that would offer
substantial insights into your workload's health, performance, and behavior.
2. Deploy the CloudWatch agent: The CloudWatch agent is instrumental in procuring system
and application metrics and logs from your workload and its underlying infrastructure. The
CloudWatch agent can also be used to collect OpenTelemetry or X-Ray traces and send them to
X-Ray.
3. Implement anomaly detection for logs and metrics: Use CloudWatch Logs anomaly detection
and CloudWatch Metrics anomaly detection to automatically identify unusual activities in
your application's operations. These tools use machine learning algorithms to detect and alert
on anomalies, which enanhces your monitoring capabilities and speeds up response time to
potential disruptions or security threats. Set up these features to proactively manage application
health and security.
4. Secure sensitive log data: Use Amazon CloudWatch Logs data protection to mask sensitive
information within your logs. This feature helps maintain privacy and compliance through
automatic detection and masking of sensitive data before it is accessed. Implement data
masking to securely handle and protect sensitive details such as personally identifiable
information (PII).
5. Define and monitor business KPIs: Establish custom metrics that align with your business
outcomes.
6. Instrument your application with AWS X-Ray: In addition to deploying the CloudWatch agent,
it's crucial to instrument your application to emit trace data. This process can provide further
insights into your workload's behavior and performance.
7. Standardize data collection across your application: Standardize data collection practices
across your entire application. Uniformity aids in correlating and analyzing data, providing a
comprehensive view of your application's behavior.
Prepare 133
AWS Well-Architected Framework Framework
Gaining deep insights into customer experiences and interactions with your application is crucial.
Real user monitoring (RUM) and synthetic transactions serve as powerful tools for this purpose.
RUM provides data about real user interactions granting an unfiltered perspective of user
satisfaction, while synthetic transactions simulate user interactions, helping in detecting potential
issues even before they impact real users.
Desired outcome: A holistic view of the customer experience, proactive detection of issues, and
optimization of user interactions to deliver seamless digital experiences.
Common anti-patterns:
• Proactive issue detection: Identify and address potential issues before they impact real users.
• Optimized user experience: Continuous feedback from RUM aids in refining and enhancing the
overall user experience.
Prepare 135
AWS Well-Architected Framework Framework
Resources
Related documents:
Related videos:
• Optimize applications through end user insights with Amazon CloudWatch RUM
• AWS on Air ft. Real-User Monitoring for Amazon CloudWatch
Related examples:
Dependency telemetry is essential for monitoring the health and performance of the external
services and components your workload relies on. It provides valuable insights into reachability,
timeouts, and other critical events related to dependencies such as DNS, databases, or third-
party APIs. When you instrument your application to emit metrics, logs, and traces about these
dependencies, you gain a clearer understanding of potential bottlenecks, performance issues, or
failures that might impact your workload.
Desired outcome: Ensure that the dependencies your workload relies on are performing as
expected, allowing you to proactively address issues and ensure optimal workload performance.
Prepare 137
AWS Well-Architected Framework Framework
external databases, third-party APIs, network connectivity routes to other environments, and
DNS services. The first step towards effective dependency telemetry is being comprehensive in
understanding what those dependencies are.
2. Develop a monitoring strategy: Once you have a clear picture of your external dependencies,
architect a monitoring strategy tailored to them. This involves understanding the criticality of
each dependency, its expected behavior, and any associated service-level agreements or targets
(SLA or SLTs). Set up proactive alerts to notify you of status changes or performance deviations.
3. Use network monitoring: Use Internet Monitor and Network Monitor, which provide
comprehensive insights into global internet and network conditions. These tools help you
understand and respond to outages, disruptions, or performance degradations that affect your
external dependencies.
4. Stay informed with AWS Health Dashboard: It provides alerts and remediation guidance when
AWS is experiencing events that may impact your services.
a. Monitor AWS Health events with Amazon EventBridge rules, or integrate programatically
with AWS Health API to automate actions when you receive AWS Health events. These can be
general actions, such as sending all planned lifecycle event messages to a chat interface, or
specific actions, such as the initiation of a workflow in an IT service management tool.
b. If you use AWS Organizations, aggregate AWS Health events across accounts.
5. Instrument your application with AWS X-Ray: AWS X-Ray provides insights into how
applications and their underlying dependencies are performing. By tracing requests from start
to end, you can identify bottlenecks or failures in the external services or components your
application relies on.
6. Use Amazon DevOps Guru: This machine learning-driven service identifies operational issues,
predicts when critical issues might occur, and recommends specific actions to take. It's invaluable
for gaining insights into dependencies and ensuring they're not the source of operational
problems.
7. Monitor regularly: Continually monitor metrics and logs related to external dependencies. Set
up alerts for unexpected behavior or degraded performance.
8. Validate after changes: Whenever there's an update or change in any of the external
dependencies, validate their performance and check their alignment with your application's
requirements.
Prepare 139
AWS Well-Architected Framework Framework
Desired outcome: Achieve a holistic view of requests flowing through your distributed system,
allowing for precise debugging, optimized performance, and improved user experiences.
Common anti-patterns:
• Inconsistent instrumentation: Not all services in a distributed system are instrumented for
tracing.
• Ignoring latency: Only focusing on errors and not considering the latency or gradual
performance degradations.
• Comprehensive system overview: Visualizing the entire path of requests, from entry to exit.
• Enhanced debugging: Quickly identifying where failures or performance issues occur.
• Improved user experience: Monitoring and optimizing based on actual user data, ensuring the
system meets real-world demands.
Implementation guidance
Begin by identifying all of the elements of your workload that require instrumentation. Once all
components are accounted for, leverage tools such as AWS X-Ray and OpenTelemetry to gather
trace data for analysis with tools like X-Ray and Amazon CloudWatch ServiceLens Map. Engage
in regular reviews with developers, and supplement these discussions with tools like Amazon
DevOps Guru, X-Ray Analytics and X-Ray Insights to help uncover deeper findings. Establish alerts
from trace data to notify when outcomes, as defined in the workload monitoring plan, are at risk.
Implementation steps
1. Adopt AWS X-Ray: Integrate X-Ray into your application to gain insights into its behavior,
understand its performance, and pinpoint bottlenecks. Utilize X-Ray Insights for automatic trace
analysis.
2. Instrument your services: Verify that every service, from an AWS Lambda function to an EC2
instance, sends trace data. The more services you instrument, the clearer the end-to-end view.
Prepare 141
AWS Well-Architected Framework Framework
Related examples:
OPS 5. How do you reduce defects, ease remediation, and improve flow into
production?
Adopt approaches that improve flow of changes into production, that activate refactoring, fast
feedback on quality, and bug fixing. These accelerate beneficial changes entering production, limit
issues deployed, and achieve rapid identification and remediation of issues introduced through
deployment activities.
Best practices
• OPS05-BP01 Use version control
• OPS05-BP02 Test and validate changes
• OPS05-BP03 Use configuration management systems
• OPS05-BP04 Use build and deployment management systems
• OPS05-BP05 Perform patch management
• OPS05-BP06 Share design standards
• OPS05-BP07 Implement practices to improve code quality
• OPS05-BP08 Use multiple environments
• OPS05-BP09 Make frequent, small, reversible changes
• OPS05-BP10 Fully automate integration and deployment
Many AWS services offer version control capabilities. Use a revision or source control system
such as AWS CodeCommit to manage code and other artifacts, such as version-controlled AWS
CloudFormation templates of your infrastructure.
Desired outcome: Your teams collaborate on code. When merged, the code is consistent and no
changes are lost. Errors are easily reverted through correct versioning.
Common anti-patterns:
Prepare 143
AWS Well-Architected Framework Framework
Every change deployed must be tested to avoid errors in production. This best practice is focused
on testing changes from version control to artifact build. Besides application code changes, testing
should include infrastructure, configuration, security controls, and operations procedures. Testing
takes many forms, from unit tests to software component analysis (SCA). Move tests further to the
left in the software integration and delivery process results in higher certainty of artifact quality.
Your organization must develop testing standards for all software artifacts. Automated tests
reduce toil and avoid manual test errors. Manual tests may be necessary in some cases. Developers
must have access to automated test results to create feedback loops that improve software quality.
Desired outcome: Your software changes are tested before they are delivered. Developers have
access to test results and validations. Your organization has a testing standard that applies to all
software changes.
Common anti-patterns:
• You deploy a new software change without any tests. It fails to run in production, which leads to
an outage.
• New security groups are deployed with AWS CloudFormation without being tested in a pre-
production environment. The security groups make your app unreachable for your customers.
• A method is modified but there are no unit tests. The software fails when it is deployed to
production.
Benefits of establishing this best practice: Change fail rate of software deployments are reduced.
Software quality is improved. Developers have increased awareness on the viability of their
code. Security policies can be rolled out with confidence to support organization's compliance.
Infrastructure changes such as automatic scaling policy updates are tested in advance to meet
traffic needs.
Implementation guidance
Testing is done on all changes, from application code to infrastructure, as part of your continuous
integration practice. Test results are published so that developers have fast feedback. Your
organization has a testing standard that all changes must pass.
Prepare 145
AWS Well-Architected Framework Framework
Related documents:
Related videos:
Prepare 147
AWS Well-Architected Framework Framework
Desired outcome: You configure, validate, and deploy as part of your continuous integration,
continuous delivery (CI/CD) pipeline. You monitor to validate configurations are correct. This
minimizes any impact to end users and customers.
Common anti-patterns:
• You manually update the web server configuration across your fleet and a number of servers
become unresponsive due to update errors.
• You manually update your application server fleet over the course of many hours. The
inconsistency in configuration during the change causes unexpected behaviors.
• Someone has updated your security groups and your web servers are no longer accessible.
Without knowledge of what was changed you spend significant time investigating the issue
extending your time to recovery.
• You push a pre-production configuration into production through CI/CD without validation. You
expose users and customers to incorrect data and services.
Benefits of establishing this best practice: Adopting configuration management systems reduces
the level of effort to make and track changes, and the frequency of errors caused by manual
procedures. Configuration management systems provide assurances with regards to governance,
compliance, and regulatory requirements.
Implementation guidance
Configuration management systems are used to track and implement changes to application and
environment configurations. Configuration management systems are also used to reduce errors
caused by manual processes, make configuration changes repeatable and auditable, and reduce the
level of effort.
On AWS, you can use AWS Config to continually monitor your AWS resource configurations
across accounts and Regions. It helps you to track their configuration history, understand how a
configuration change would affect other resources, and audit them against expected or desired
configurations using AWS Config Rules and AWS Config Conformance Packs.
For dynamic configurations in your applications running on Amazon EC2 instances, AWS Lambda,
containers, mobile applications, or IoT devices, you can use AWS AppConfig to configure, validate,
deploy, and monitor them across your environments.
Prepare 149
AWS Well-Architected Framework Framework
Related videos:
• AWS re:Invent 2022 - Proactive governance and compliance for AWS workloads
Use build and deployment management systems. These systems reduce errors caused by manual
processes and reduce the level of effort to deploy changes.
In AWS, you can build continuous integration/continuous deployment (CI/CD) pipelines using
services such as AWS Developer Tools (for example, AWS CodeCommit, AWS CodeBuild, AWS
CodePipeline, AWS CodeDeploy, and AWS CodeStar).
Desired outcome: Your build and deployment management systems support your organization's
continuous integration continuous delivery (CI/CD) system that provide capabilities for automating
safe rollouts with the correct configurations.
Common anti-patterns:
• After compiling your code on your development system, you copy the executable onto your
production systems and it fails to start. The local log files indicates that it has failed due to
missing dependencies.
• You successfully build your application with new features in your development environment and
provide the code to quality assurance (QA). It fails QA because it is missing static assets.
• On Friday, after much effort, you successfully built your application manually in your
development environment including your newly coded features. On Monday, you are unable to
repeat the steps that allowed you to successfully build your application.
• You perform the tests you have created for your new release. Then you spend the next week
setting up a test environment and performing all the existing integration tests followed by
the performance tests. The new code has an unacceptable performance impact and must be
redeveloped and then retested.
Benefits of establishing this best practice: By providing mechanisms to manage build and
deployment activities you reduce the level of effort to perform repetitive tasks, free your team
Prepare 151
AWS Well-Architected Framework Framework
Resources
Related documents:
Related videos:
• AWS re:Invent 2022 - AWS Well-Architected best practices for DevOps on AWS
Perform patch management to gain features, address issues, and remain compliant with
governance. Automate patch management to reduce errors caused by manual processes, scale, and
reduce the level of effort to patch.
Patch and vulnerability management are part of your benefit and risk management activities. It is
preferable to have immutable infrastructures and deploy workloads in verified known good states.
Where that is not viable, patching in place is the remaining option.
Amazon EC2 Image Builder provides pipelines to update machine images. As a part of patch
management, consider Amazon Machine Images (AMIs) using an AMI image pipeline or container
images with a Docker image pipeline, while AWS Lambda provides patterns for custom runtimes
and additional libraries to remove vulnerabilities.
You should manage updates to Amazon Machine Images for Linux or Windows Server images using
Amazon EC2 Image Builder. You can use Amazon Elastic Container Registry (Amazon ECR) with
Prepare 153
AWS Well-Architected Framework Framework
Implementation guidance
Patch systems to remediate issues, to gain desired features or capabilities, and to remain compliant
with governance policy and vendor support requirements. In immutable systems, deploy with the
appropriate patch set to achieve the desired result. Automate the patch management mechanism
to reduce the elapsed time to patch, to avoid errors caused by manual processes, and lower the
level of effort to patch.
Implementation steps
2. Choose a recipe:
5. Review settings.
• Two development teams have each created a user authentication service. Your users must
maintain a separate set of credentials for each part of the system they want to access.
• Each team manages their own infrastructure. A new compliance requirement forces a change to
your infrastructure and each team implements it in a different way.
Benefits of establishing this best practice: Using shared standards supports the adoption of best
practices and maximizes the benefits of development efforts. Documenting and updating design
standards keeps your organization up-to-date with best practices and security and compliance
requirements.
Implementation guidance
Share existing best practices, design standards, checklists, operating procedures, guidance, and
governance requirements across teams. Have procedures to request changes, additions, and
exceptions to design standards to support improvement and innovation. Make teams are aware of
published content. Have a mechanism to keep design standards up-to-date as new best practices
emerge.
Customer example
AnyCompany Retail has a cross-functional architecture team that creates software architecture
patterns. This team builds the architecture with compliance and governance built in. Teams that
adopt these shared standards get the benefits of having compliance and governance built in. They
can quickly build on top of the design standard. The architecture team meets quarterly to evaluate
architecture patterns and update them if necessary.
Implementation steps
1. Identify a cross-functional team that owns developing and updating design standards. This team
should work with stakeholders across your organization to develop design standards, operating
procedures, checklists, guidance, and governance requirements. Document the design standards
and share them within your organization.
a. AWS Service Catalog can be used to create portfolios representing design standards using
infrastructure as code. You can share portfolios across accounts.
2. Have a mechanism in place to keep design standards up-to-date as new best practices are
identified.
Prepare 157
AWS Well-Architected Framework Framework
Related examples:
Related services:
Implement practices to improve code quality and minimize defects. Some examples include test-
driven development, code reviews, standards adoption, and pair programming. Incorporate these
practices into your continuous integration and delivery process.
Desired outcome: Your organization uses best practices like code reviews or pair programming to
improve code quality. Developers and operators adopt code quality best practices as part of the
software development lifecycle.
Common anti-patterns:
• You commit code to the main branch of your application without a code review. The change
automatically deploys to production and causes an outage.
• A new application is developed without any unit, end-to-end, or integration tests. There is no
way to test the application before deployment.
• Your teams make manual changes in production to address defects. Changes do not go through
testing or code reviews and are not captured or logged through continuous integration and
delivery processes.
Benefits of establishing this best practice: By adopting practices to improve code quality, you can
help minimize issues introduced to production. Code quality facilitates the use of best practices like
pair programming, code reviews, and implementation of AI productivity tools.
Prepare 159
AWS Well-Architected Framework Framework
Related documents:
Related videos:
Prepare 161
AWS Well-Architected Framework Framework
Implementation guidance
Use multiple environments and provide developers sandbox environments with minimized
controls to aid in experimentation. Provide individual development environments to help
work in parallel, increasing development agility. Implement more rigorous controls in the
environments approaching production to allow developers to innovate. Use infrastructure as code
and configuration management systems to deploy environments that are configured consistent
with the controls present in production to ensure systems operate as expected when deployed.
When environments are not in use, turn them off to avoid costs associated with idle resources
(for example, development systems on evenings and weekends). Deploy production equivalent
environments when load testing to improve valid results.
Resources
Related documents:
Frequent, small, and reversible changes reduce the scope and impact of a change. When used in
conjunction with change management systems, configuration management systems, and build and
delivery systems frequent, small, and reversible changes reduce the scope and impact of a change.
This results in more effective troubleshooting and faster remediation with the option to roll back
changes.
Common anti-patterns:
• You deploy a new version of your application quarterly with a change window that means a core
service is turned off.
• You frequently make changes to your database schema without tracking changes in your
management systems.
• You perform manual in-place updates, overwriting existing installations and configurations, and
have no clear roll-back plan.
Prepare 163
AWS Well-Architected Framework Framework
Processes are repeatable and are standardized across teams. Developers are free to focus on
development and code pushes, increasing productivity.
Common anti-patterns:
• On Friday, you finish authoring the new code for your feature branch. On Monday, after running
your code quality test scripts and each of your unit tests scripts, you check in your code for the
next scheduled release.
• You are assigned to code a fix for a critical issue impacting a large number of customers in
production. After testing the fix, you commit your code and email change management to
request approval to deploy it to production.
• As a developer, you log into the AWS Management Console to create a new development
environment using non-standard methods and systems.
Benefits of establishing this best practice: By implementing automated build and deployment
management systems, you reduce errors caused by manual processes and reduce the effort to
deploy changes helping your team members to focus on delivering business value. You increase the
speed of delivery as you promote through to production.
Implementation guidance
You use build and deployment management systems to track and implement change, to reduce
errors caused by manual processes, and reduce the level of effort. Fully automate the integration
and deployment pipeline from code check-in through build, testing, deployment, and validation.
This reduces lead time, encourages increased frequency of change, reduces the level of effort,
increases the speed to market, results in increased productivity, and increases the security of your
code as you promote through to production.
Resources
Related documents:
Prepare 165
AWS Well-Architected Framework Framework
• You performed a deployment and your application has become unstable but there appear to be
active users on the system. You have to decide whether to rollback the change and impact the
active users or wait to rollback the change knowing the users may be impacted regardless.
• After making a routine change, your new environments are accessible, but one of your subnets
has become unreachable. You have to decide whether to rollback everything or try to fix the
inaccessible subnet. While you are making that determination, the subnet remains unreachable.
• Your systems are not architected in a way that allows them to be updated with smaller releases.
As a result, you have difficulty in reversing those bulk changes during a failed deployment.
• You do not use infrastructure as code (IaC) and you made manual updates to your infrastructure
that resulted in an undesired configuration. You are unable to effectively track and revert the
manual changes.
• Because you have not measured increased frequency of your deployments, your team is not
incentivized to reduce the size of their changes and improve their rollback plans for each change,
leading to more risk and increased failure rates.
• You do not measure the total duration of an outage caused by unsuccessful changes. Your team
is unable to prioritize and improve its deployment process and recovery plan effectiveness.
Benefits of establishing this best practice: Having a plan to recover from unsuccessful changes
minimizes the mean time to recover (MTTR) and reduces your business impact.
Implementation guidance
A consistent, documented policy and practice adopted by release teams allows an organization
to plan what should happen if unsuccessful changes occur. The policy should allow for fixing
forward in specific circumstances. In either situation, a fix forward or rollback plan should be well
documented and tested before deployment to live production so that the time it takes to revert a
change is minimized.
Implementation steps
1. Document the policies that require teams to have effective plans to reverse changes within a
specified period.
a. Policies should specify when a fix-forward situation is allowed.
b. Require a documented rollback plan to be accessible by all involved.
Prepare 167
AWS Well-Architected Framework Framework
Test release procedures in pre-production by using the same deployment configuration, security
controls, steps, and procedures as in production. Validate that all deployed steps are completed
as expected, such as inspecting files, configurations, and services. Further test all changes with
functional, integration, and load tests, along with any monitoring such as health checks. By doing
these tests, you can identify deployment issues early with an opportunity to plan and mitigate
them prior to production.
You can create temporary parallel environments for testing every change. Automate the
deployment of the test environments using infrastructure as code (IaC) to help reduce amount of
work involved and ensure stability, consistency, and faster feature delivery.
Desired outcome: Your organization adopts a test-driven development culture that includes
testing deployments. This ensures teams are focused on delivering business value rather than
managing releases. Teams are engaged early upon identification of deployment risks to determine
the appropriate course of mitigation.
Common anti-patterns:
• During production releases, untested deployments cause frequent issues that require
troubleshooting and escalation.
• Your release contains infrastructure as code (IaC) that updates existing resources. You are unsure
if the IaC runs successfully or causes impact to the resources.
• You deploy a new feature to your application. It doesn't work as intended and there is no
visibility until it gets reported by impacted users.
• You update your certificates. You accidentally install the certificates to the wrong components,
which goes undetected and impacts website visitors because a secure connection to the website
can't be established.
Prepare 169
AWS Well-Architected Framework Framework
Resources
Related documents:
Related videos:
Related examples:
Safe production roll-outs control the flow of beneficial changes with an aim to minimize any
perceived impact for customers from those changes. The safety controls provide inspection
mechanisms to validate desired outcomes and limit the scope of impact from any defects
introduced by the changes or from deployment failures. Safe roll-outs may include strategies such
as feature-flags, one-box, rolling (canary releases), immutable, traffic splitting, and blue/green
deployments.
Prepare 171
AWS Well-Architected Framework Framework
Implementation steps
1. Use an approval workflow to initiate the sequence of production roll-out steps upon promotion
to production .
2. Use an automated deployment system such as AWS CodeDeploy. AWS CodeDeploy deployment
options include in-place deployments for EC2/On-Premises and blue/green deployments for
EC2/On-Premises, AWS Lambda, and Amazon ECS (see the preceding workflow diagram).
a. Where applicable, integrate AWS CodeDeploy with other AWS services or integrate AWS
CodeDeploy with partner product and services.
3. Use blue/green deployments for databases such as Amazon Aurora and Amazon RDS.
4. Monitor deployments using Amazon CloudWatch, AWS CloudTrail, and Amazon Simple
Notification Service (Amazon SNS) event notifications.
Prepare 173
AWS Well-Architected Framework Framework
To increase the speed, reliability, and confidence of your deployment process, have a strategy
for automated testing and rollback capabilities in pre-production and production environments.
Automate testing when deploying to production to simulate human and system interactions
that verify the changes being deployed. Automate rollback to revert back to a previous known
good state quickly. The rollback should be initiated automatically on pre-defined conditions such
as when the desired outcome of your change is not achieved or when the automated test fails.
Automating these two activities improves your success rate for your deployments, minimizes
recovery time, and reduces the potential impact to the business.
Desired outcome: Your automated tests and rollback strategies are integrated into your
continuous integration, continuous delivery (CI/CD) pipeline. Your monitoring is able to validate
against your success criteria and initiate automatic rollback upon failure. This minimizes any
impact to end users and customers. For example, when all testing outcomes have been satisfied,
you promote your code into the production environment where automated regression testing is
initiated, leveraging the same test cases. If regression test results do not match expectations, then
automated rollback is initiated in the pipeline workflow.
Common anti-patterns:
• Your systems are not architected in a way that allows them to be updated with smaller releases.
As a result, you have difficulty in reversing those bulk changes during a failed deployment.
• Your deployment process consists of a series of manual steps. After you deploy changes to your
workload, you start post-deployment testing. After testing, you realize that your workload is
inoperable and customers are disconnected. You then begin rolling back to the previous version.
All of these manual steps delay overall system recovery and cause a prolonged impact to your
customers.
• You spent time developing automated test cases for functionality that is not frequently used in
your application, minimizing the return on investment in your automated testing capability.
• Your release is comprised of application, infrastructure, patches and configuration updates that
are independent from one another. However, you have a single CI/CD pipeline that delivers
all changes at once. A failure in one component forces you to revert all changes, making your
rollback complex and inefficient.
• Your team completes the coding work in sprint one and begins sprint two work, but your plan
did not include testing until sprint three. As a result, automated tests revealed defects from
Prepare 175
AWS Well-Architected Framework Framework
3. Decide which test cases you wish to automate and which should be performed manually. These
can be defined based on business value priority of the feature being tested. Align all team
members to this plan and verify accountability for performing manual tests.
a. Apply automated testing capabilities to specific test cases that make sense for automation,
such as repeatable or frequently run cases, those that require repetitive tasks, or those that
are required across multiple configurations.
b. Define test automation scripts as well as the success criteria in the automation tool so
continued workflow automation can be initiated when specific cases fail.
4. Prioritize test automation to drive consistent results with thorough test case development where
complexity and human interaction have a higher risk of failure.
5. Integrate your automated testing and rollback tools into your CI/CD pipeline.
a. Develop clear success criteria for your changes.
b. Monitor and observe to detect these criteria and automatically reverse changes when specific
rollback criteria are met.
a. A/B testing to show results in comparison to the current version between two user testing
groups.
b. Canary testing that allows you to roll out your change to a subset of users before releasing it
to all.
c. Feature-flag testing which allows a single feature of the new version at a time to be flagged
on and off from outside the application so that each new feature can be validated one at a
time.
7. Monitor the operational aspects of the application, transactions, and interactions with other
applications and components. Develop reports to show success of changes by workload so that
you can identify what parts of the automation and workflow can be further optimized.
a. Develop test result reports that help you make quick decisions on whether or not rollback
procedures should be invoked.
b. Implement a strategy that allows for automated rollback based upon pre-defined failure
conditions that result from one or more of your test methods.
8. Develop your automated test cases to allow for reusability across future repeatable changes.
Prepare 177
AWS Well-Architected Framework Framework
Have a mechanism to validate that you have the appropriate number of trained personnel to
support the workload. They must be trained on the platform and services that make up your
workload. Provide them with the knowledge necessary to operate the workload. You must have
enough trained personnel to support the normal operation of the workload and troubleshoot any
incidents that occur. Have enough personnel so that you can rotate during on-call and vacations to
avoid burnout.
Desired outcome:
• There are enough trained personnel to support the workload at times when the workload is
available.
• You provide training for your personnel on the software and services that make up your
workload.
Common anti-patterns:
• Deploying a workload without team members trained to operate the platform and services in
use.
• Not having enough personnel to support on-call rotations or personnel taking time off.
Implementation guidance
Validate that there are sufficient trained personnel to support the workload. Verify that you have
enough team members to cover normal operational activities, including on-call rotations.
Customer example
Prepare 179
AWS Well-Architected Framework Framework
ORR is a review and inspection process using a checklist of requirements. An ORR is a self-service
experience that teams use to certify their workloads. ORRs include best practices from lessons
learned from our years of building software.
Run ORRs before a workload launches to general availability and then throughout the software
development lifecycle. Running the ORR before launch increases your ability to operate the
workload safely. Periodically re-run your ORR on the workload to catch any drift from best
practices. You can have ORR checklists for new services launches and ORRs for periodic reviews.
This helps keep you up to date on new best practices that arise and incorporate lessons learned
from post-incident analysis. As your use of the cloud matures, you can build ORR requirements into
your architecture as defaults.
Desired outcome: You have an ORR checklist with best practices for your organization. ORRs are
conducted before workloads launch. ORRs are run periodically over the course of the workload
lifecycle.
Common anti-patterns:
Prepare 181
AWS Well-Architected Framework Framework
4. Identify one workload to conduct the ORR on. A pre-launch workload or an internal workload is
ideal.
5. Run through the ORR checklist and take note of any discoveries made. Discoveries might not be
ok if a mitigation is in place. For any discovery that lacks a mitigation, add those to your backlog
of items and implement them before launch.
6. Continue to add best practices and requirements to your ORR checklist over time.
AWS Support customers with Enterprise Support can request the Operational Readiness Review
Workshop from their Technical Account Manager. The workshop is an interactive working
backwards session to develop your own ORR checklist.
Level of effort for the implementation plan: High. Adopting an ORR practice in your organization
requires executive sponsorship and stakeholder buy-in. Build and update the checklist with inputs
from across your organization.
Resources
Prepare 183
AWS Well-Architected Framework Framework
As your organization matures, begin automating runbooks. Start with runbooks that are short and
frequently used. Use scripting languages to automate steps or make steps easier to perform. As
you automate the first few runbooks, you'll dedicate time to automating more complex runbooks.
Over time, most of your runbooks should be automated in some way.
Desired outcome: Your team has a collection of step-by-step guides for performing workload
tasks. The runbooks contain the desired outcome, necessary tools and permissions, and
instructions for error handling. They are stored in a central location (version control system) and
updated frequently. For example, your runbooks provide capabilities for your teams to monitor,
communicate, and respond to AWS Health events for critical accounts during application alarms,
operational issues, and planned lifecycle events.
Common anti-patterns:
Implementation guidance
Runbooks can take several forms depending on the maturity level of your organization. At a
minimum, they should consist of a step-by-step text document. The desired outcome should
be clearly indicated. Clearly document necessary special permissions or tools. Provide detailed
guidance on error handling and escalations in case something goes wrong. List the runbook
owner and publish it in a central location. Once your runbook is documented, validate it by having
someone else on your team run it. As procedures evolve, update your runbooks in accordance with
your change management process.
Prepare 185
AWS Well-Architected Framework Framework
5. Give the runbook to a team member. Have them use the runbook to validate the steps. If
something is missing or needs clarity, update the runbook.
6. Publish the runbook to your internal documentation store. Once published, tell your team and
other stakeholders.
7. Over time, you'll build a library of runbooks. As that library grows, start working to automate
runbooks.
Level of effort for the implementation plan: Low. The minimum standard for a runbook is a step-
by-step text guide. Automating runbooks can increase the implementation effort.
Resources
Related documents:
Related videos:
• AWS re:Invent 2019: DIY guide to runbooks, incident reports, and incident response
• How to automate IT Operations on AWS | Amazon Web Services
• Integrate Scripts into AWS Systems Manager
Related examples:
Prepare 187
AWS Well-Architected Framework Framework
Desired outcome: Your organization has playbooks for common incidents. The playbooks are
stored in a central location and available to your team members. Playbooks are updated frequently.
For any known root causes, companion runbooks are built.
Common anti-patterns:
• New team members learn how to investigate issues through trial and error.
• Best practices for investigating issues are not shared across teams.
• Different team members can use the same playbook to identify a root cause in a consistent
manner.
• Known root causes can have runbooks developed for them, speeding up recovery time.
Implementation guidance
How you build and use playbooks depends on the maturity of your organization. If you are new
to the cloud, build playbooks in text form in a central document repository. As your organization
matures, playbooks can become semi-automated with scripting languages like Python. These
scripts can be run inside a Jupyter notebook to speed up discovery. Advanced organizations have
fully automated playbooks for common issues that are auto-remediated with runbooks.
Start building your playbooks by listing common incidents that happen to your workload. Choose
playbooks for incidents that are low risk and where the root cause has been narrowed down to
a few issues to start. After you have playbooks for simpler scenarios, move on to the higher risk
scenarios or scenarios where the root cause is not well known.
Prepare 189
AWS Well-Architected Framework Framework
2. Identify a common issue that requires investigation. This should be a scenario where the root
cause is limited to a few issues and resolution is low risk.
3. Using the Markdown template, fill in the Playbook Name section and the fields under Playbook
Info.
4. Fill in the troubleshooting steps. Be as clear as possible on what actions to perform or what
areas you should investigate.
5. Give a team member the playbook and have them go through it to validate it. If there's anything
missing or something isn't clear, update the playbook.
6. Publish your playbook in your document repository and inform your team and any stakeholders.
7. This playbook library will grow as you add more playbooks. Once you have several playbooks,
start automating them using tools like AWS Systems Manager Automations to keep automation
and playbooks in sync.
Level of effort for the implementation plan: Low. Your playbooks should be text documents
stored in a central location. More mature organizations will move towards automating playbooks.
Resources
Related documents:
Related videos:
Prepare 191
AWS Well-Architected Framework Framework
• Making changes to your production environment that are out of compliance with governance
requirements.
• Deploying a new version of your workload without establishing a baseline for resource
utilization.
Implementation guidance
Use pre-mortems to develop processes for unsuccessful changes. Document your processes for
unsuccessful changes. Ensure that all changes comply with governance. Evaluate the benefits and
risks to deploying changes to your workload.
Customer example
AnyCompany Retail regularly conducts pre-mortems to validate their processes for unsuccessful
changes. They document their processes in a shared Wiki and update it frequently. All changes
comply with governance requirements.
Implementation steps
1. Make informed decisions when deploying changes to your workload. Establish and review
criteria for a successful deployment. Develop scenarios or criteria that would initiate a rollback
of a change. Weigh the benefits of deploying changes against the risks of an unsuccessful
change.
3. Use pre-mortems to plan for unsuccessful changes and document mitigation strategies. Run a
table-top exercise to model an unsuccessful change and validate roll-back procedures.
Level of effort for the implementation plan: Moderate. Implementing a practice of pre-mortems
requires coordination and effort from stakeholders across your organization
Prepare 193
AWS Well-Architected Framework Framework
• A developer that was the primary point of contact for a software vendor left the company.
You are not able to reach the vendor support directly. You must spend time researching and
navigating generic contact systems, increasing the time required to respond when needed.
• A production outage occurs with a software vendor. There is no documentation on how to file a
support case.
• With the appropriate support level, you are able to get a response in the time frame necessary to
meet service-level needs.
• As a supported customer you can escalate if there are production issues.
• Software and services vendors can assist in troubleshooting during an incident.
Implementation guidance
Enable support plans for any software and services vendors that your production workload relies
on. Set up appropriate support plans to meet service-level needs. For AWS customers, this means
activating AWS Business Support or greater on any accounts where you have production workloads.
Meet with support vendors on a regular cadence to get updates about support offerings, processes,
and contacts. Document how to request support from software and services vendors, including
how to escalate if there is an outage. Implement mechanisms to keep support contacts up to date.
Customer example
At AnyCompany Retail, all commercial software and services dependencies have support plans.
For example, they have AWS Enterprise Support activated on all accounts with production
workloads. Any developer can raise a support case when there is an issue. There is a wiki page with
information on how to request support, whom to notify, and best practices for expediting a case.
Implementation steps
1. Work with stakeholders in your organization to identify software and services vendors that your
workload relies on. Document these dependencies.
2. Determine service-level needs for your workload. Select a support plan that aligns with them.
3. For commercial software and services, establish a support plan with the vendors.
Prepare 195
AWS Well-Architected Framework Framework
Ensure optimal workload health by leveraging observability. Utilize relevant metrics, logs, and
traces to gain a comprehensive view of your workload's performance and address issues efficiently.
Best practices
• OPS08-BP01 Analyze workload metrics
• OPS08-BP02 Analyze workload logs
• OPS08-BP03 Analyze workload traces
• OPS08-BP04 Create actionable alerts
• OPS08-BP05 Create dashboards
After implementing application telemetry, regularly analyze the collected metrics. While latency,
requests, errors, and capacity (or quotas) provide insights into system performance, it's vital to
prioritize the review of business outcome metrics. This ensures you're making data-driven decisions
aligned with your business objectives.
Desired outcome: Accurate insights into workload performance that drive data-informed decisions,
ensuring alignment with business objectives.
Common anti-patterns:
Operate 197
AWS Well-Architected Framework Framework
Resources
Related documents:
Related videos:
Related examples:
Regularly analyzing workload logs is essential for gaining a deeper understanding of the
operational aspects of your application. By efficiently sifting through, visualizing, and interpreting
log data, you can continually optimize application performance and security.
Desired outcome: Rich insights into application behavior and operations derived from thorough
log analysis, ensuring proactive issue detection and mitigation.
Common anti-patterns:
Operate 199
AWS Well-Architected Framework Framework
4. Monitor logs in real-time with Live Tail: Use Amazon CloudWatch Logs Live Tail to view log
data in real-time. You can actively monitor your application's operational activities as they occur,
which provides immediate visibility into system performance and potential issues.
5. Leverage Contributor Insights: Use CloudWatch Contributor Insights to identify top talkers in
high cardinality dimensions like IP addresses or user-agents.
6. Implement CloudWatch Logs metric filters: Configure CloudWatch Logs metric filters to
convert log data into actionable metrics. This allows you to set alarms or further analyze
patterns.
7. Implement CloudWatch cross-account observability: Monitor and troubleshoot applications
that span multiple accounts within a Region.
8. Regular review and refinement: Periodically review your log analysis strategies to capture all
relevant information and continually optimize application performance.
Resources
Related documents:
Related videos:
Related examples:
Operate 201
AWS Well-Architected Framework Framework
Implementation steps
The following steps offer a structured approach to effectively implementing trace data analysis
using AWS services:
1. Integrate AWS X-Ray: Ensure X-Ray is integrated with your applications to capture trace data.
2. Analyze X-Ray metrics: Delve into metrics derived from X-Ray traces, such as latency, request
rates, fault rates, and response time distributions, using the service map to monitor application
health.
3. Use ServiceLens: Leverage the ServiceLens map for enhanced observability of your services and
applications. This allows for integrated viewing of traces, metrics, logs, alarms, and other health
information.
4. Enable X-Ray Insights:
a. Turn on X-Ray Insights for automated anomaly detection in traces.
b. Examine insights to pinpoint patterns and ascertain root causes, such as increased fault rates
or latencies.
c. Consult the insights timeline for a chronological analysis of detected issues.
5. Use X-Ray Analytics: X-Ray Analytics allows you to thoroughly explore trace data, pinpoint
patterns, and extract insights.
6. Use groups in X-Ray: Create groups in X-Ray to filter traces based on criteria such as high
latency, allowing for more targeted analysis.
7. Incorporate Amazon DevOps Guru: Engage Amazon DevOps Guru to benefit from machine
learning models pinpointing operational anomalies in traces.
8. Use CloudWatch Synthetics: Use CloudWatch Synthetics to create canaries for continually
monitoring your endpoints and workflows. These canaries can integrate with X-Ray to provide
trace data for in-depth analysis of the applications being tested.
9. Use Real User Monitoring (RUM): With AWS X-Ray and CloudWatch RUM, you can analyze and
debug the request path starting from end users of your application through downstream AWS
managed services. This helps you identify latency trends and errors that impact your end users.
10.Correlate with logs: Correlate trace data with related logs within the X-Ray trace view for
a granular perspective on application behavior. This allows you to view log events directly
associated with traced transactions.
11.Implement CloudWatch cross-account observability: Monitor and troubleshoot applications
that span multiple accounts within a Region.
Operate 203
AWS Well-Architected Framework Framework
Common anti-patterns:
Implementation guidance
To create an effective alerting mechanism, it's vital to use metrics, logs, and trace data that flag
when outcomes based on KPIs are at risk or anomalies are detected.
Implementation steps
1. Determine key performance indicators (KPIs): Identify your application's KPIs. Alerts should be
tied to these KPIs to reflect the business impact accurately.
2. Implement anomaly detection:
• Use Amazon CloudWatch anomaly detection: Set up Amazon CloudWatch anomaly detection
to automatically detect unusual patterns, which helps you only generate alerts for genuine
anomalies.
• Use AWS X-Ray Insights:
a. Set up X-Ray Insights to detect anomalies in trace data.
b. Configure notifications for X-Ray Insights to be alerted on detected issues.
• Integrate with Amazon DevOps Guru:
a. Leverage Amazon DevOps Guru for its machine learning capabilities in detecting
operational anomalies with existing data.
b. Navigate to the notification settings in DevOps Guru to set up anomaly alerts.
Operate 205
AWS Well-Architected Framework Framework
Related videos:
Related examples:
• Alarms, incident management, and remediation in the cloud with Amazon CloudWatch
• Tutorial: Creating an Amazon EventBridge rule that sends notifications to AWS Chatbot
• One Observability Workshop
Dashboards are the human-centric view into the telemetry data of your workloads. While they
provide a vital visual interface, they should not replace alerting mechanisms, but complement
them. When crafted with care, not only can they offer rapid insights into system health and
performance, but they can also present stakeholders with real-time information on business
outcomes and the impact of issues.
Desired outcome:
Clear, actionable insights into system and business health using visual representations.
Common anti-patterns:
Operate 207
AWS Well-Architected Framework Framework
significance of the represented metrics, and can also contain links to other dashboards and
troubleshooting tools.
3. Create dashboard variables: Incorporate dashboard variables where appropriate to allow for
dynamic and flexible dashboard views.
4. Create metrics widgets: Add metric widgets to visualize various metrics your application emits,
tailoring these widgets to effectively represent system health and business outcomes.
5. Log Insights queries: Utilize CloudWatch Log Insights to derive actionable metrics from your
logs and display these insights on your dashboard.
6. Set up alarms: Integrate CloudWatch Alarms into your dashboard for a quick view of any metrics
breaching their thresholds.
7. Use Contributor Insights: Incorporate CloudWatch Contributor Insights to analyze high-
cardinality fields and get a clearer understanding of your resource's top contributors.
8. Design custom widgets: For specific needs not met by standard widgets, consider creating
custom widgets. These can pull from various data sources or represent data in unique ways.
9. Use AWS Health Dashboard: Use AWS Health Dashboard to get deeper insights into your
account health, events, and upcoming changes that might affect your services and resources.
You can also get a centralized view for health events in your AWS Organizations or build your
own custom dashboards (for more detail, see Related examples).
10.Iterate and refine: As your application evolves, regularly revisit your dashboard to ensure its
relevance.
Resources
Related documents:
Operate 209
AWS Well-Architected Framework Framework
• Time spent working issues with or without a standardized operating procedure (SOP)
• Amount of time spent recovering from a failed code push
• Call volume
Common anti-patterns:
• Deployment deadlines are missed because developers are pulled away to perform
troubleshooting tasks. Development teams argue for more personnel, but cannot quantify how
many they need because the time taken away cannot be measured.
• A Tier 1 desk was set up to handle user calls. Over time, more workloads were added, but no
headcount was allocated to the Tier 1 desk. Customer satisfaction suffers as call times increase
and issues go longer without resolution, but management sees no indicators of such, preventing
any action.
• A problematic workload has been handed off to a separate operations team for upkeep. Unlike
other workloads, this new one was not supplied with proper documentation and runbooks. As
such, teams spend longer troubleshooting and addressing failures. However, there are no metrics
documenting this, which makes accountability difficult.
Benefits of establishing this best practice: Where workload monitoring shows the state of
our applications and services, monitoring operations teams provide owners gain insight into
changes among the consumers of those workloads, such as shifting business needs. Measure the
effectiveness of these teams and evaluate them against business goals by creating metrics that can
reflect the state of operations. Metrics can highlight support issues or identify when drifts occur
away from a service level target.
Implementation guidance
Schedule time with business leaders and stakeholders to determine the overall goals of the service.
Determine what the tasks of various operations teams should be and what challenges they could
be approached with. Using these, brainstorm key performance indicators (KPIs) that might reflect
these operations goals. These might be customer satisfaction, time from feature conception to
deployment, average issue resolution time, and others.
Working from KPIs, identify the metrics and sources of data that might reflect these goals best.
Customer satisfaction may be an combination of various metrics such as call wait or response
Operate 211
AWS Well-Architected Framework Framework
• A workload goes down, leaving a service unavailable. Call volumes spike as users request to
know what's going on. Managers add to the volume requesting to know who's working an issue.
Various operations teams duplicate efforts in trying to investigate.
• A desire for a new capability leads to several personnel being reassigned to an engineering
effort. No backfill is provided, and issue resolution times spike. This information is not captured,
and only after several weeks and dissatisfied user feedback does leadership become aware of the
issue.
Benefits of establishing this best practice: During operational events where the business is
impacted, much time and energy can be wasted querying information from various teams
attempting to understand the situation. By establishing widely-disseminated status pages and
dashboards, stakeholders can quickly obtain information such as whether or not an issue was
detected, who has lead on the issue, or when a return to normal operations may be expected. This
frees team members from spending too much time communicating status to others and more time
addressing issues.
In addition, dashboards and reports can provide insights to decision-makers and stakeholders to
see how operations teams are able to respond to business needs and how their resources are being
allocated. This is crucial for determining if adequate resources are in place to support the business.
Implementation guidance
Build dashboards that show the current key metrics for your ops teams, and make them readily
accessible both to operations leaders and management.
Build status pages that can be updated quickly to show when an incident or event is unfolding,
who has ownership and who is coordinating the response. Share any steps or workarounds that
users should consider on this page, and disseminate the location widely. Encourage users to check
this location first when confronted with an unknown issue.
Collect and provide reports that show the health of operations over time, and distribute this to
leaders and decision makers to illustrate the work of operations along with challenges and needs.
Share between teams these metrics and reports that best reflect goals and KPIs and where they
have been influential in driving change. Dedicate time to these activities to elevate the importance
of operations inside of and between teams.
Operate 213
AWS Well-Architected Framework Framework
Benefits of establishing this best practice: In some organizations, it can become a challenge
to allocate the same time and attention that is afforded to service delivery and new products or
offerings. When this occurs, the line of business can suffer as the level of service expected slowly
deteriorates. This is because operations does not change and evolve with the growing business,
and can soon be left behind. Without regular review into the insights operations collects, the risk
to the business may become visible only when it's too late. By allocating time to review metrics
and procedures both among the operations staff and with leadership, the crucial role operations
plays remains visible, and risks can be identified long before they reach critical levels. Operations
teams get better insight into impending business changes and initiatives, allowing for proactive
efforts to be undertaken. Leadership visibility into operations metrics showcases the role that these
teams play in customer satisfaction, both internal and external, and let them better weigh choices
for priorities, or ensure that operations has the time and resources to change and evolve with new
business and workload initiatives.
Implementation guidance
Dedicate time to review operations metrics between stakeholders and operations teams and
review report data. Place these reports in the contexts of the organizations goals and objectives to
determine if they're being met. Identify sources of ambiguity where goals are not clear, or where
there may be conflicts between what is asked for and what is given.
Identify where time, people, and tools can aid in operations outcomes. Determine which KPIs this
would impact and what targets for success should be. Revisit regularly to ensure operations is
resourced sufficiently to support the line of business.
Resources
Related documents:
• Amazon Athena
• Amazon CloudWatch metrics and dimensions reference
• Amazon QuickSight
• AWS Glue
• AWS Glue Data Catalog
• Collect metrics and logs from Amazon EC2 instances and on-premises servers with the Amazon
CloudWatch Agent
Operate 215
AWS Well-Architected Framework Framework
Implementation guidance
Implementing this best practice means you are tracking workload events. You have processes to
handle incidents and problems. The processes are documented, shared, and updated frequently.
Problems are identified, prioritized, and fixed.
Implementation steps
Events
1. Monitor events:
• Implement observability and utilize workload observability.
• Monitor actions taken by a user, role, or an AWS service are recorded as events in AWS
CloudTrail.
• Respond to operational changes in your applications in real time with Amazon EventBridge.
• Continually assess, monitor, and record resource configuration changes with AWS Config.
2. Create processes:
• Develop a process to assess which events are significant and require monitoring. This involves
setting thresholds and parameters for normal and abnormal activities.
Operate 217
AWS Well-Architected Framework Framework
1. Identify problems:
• Use data from previous incidents to identify recurring patterns that may indicate deeper
systemic issues.
• Leverage tools like AWS CloudTrail and Amazon CloudWatch to analyze trends and uncover
underlying problems.
• Engage cross-functional teams, including operations, development, and business units, to gain
diverse perspectives on the root causes.
• Incorporate root cause analysis (RCA) techniques to investigate and understand the underlying
causes of incidents.
• Update operational policies, procedures, and infrastructure based on findings to prevent
recurrence.
3. Continue to improve:
• Regularly review and revise problem management processes and tools to align with evolving
business and technology landscapes.
• Share insights and best practices across the organization to build a more resilient and efficient
operational environment.
• Enterprise Support customers can access specialized programs like AWS Countdown for
support during critical events.
Resources
• Amazon EventBridge
Establishing a clear and defined process for each alert in your system is essential for effective and
efficient incident management. This practice ensures that every alert leads to a specific, actionable
response, improving the reliability and responsiveness of your operations.
Desired outcome: Every alert initiates a specific, well-defined response plan. Where possible,
responses are automated, with clear ownership and a defined escalation path. Alerts are linked
to an up-to-date knowledge base so that any operator can respond consistently and effectively.
Responses are quick and uniform across the board, enhancing operational efficiency and reliability.
Common anti-patterns:
• Alerts have no predefined response process, leading to makeshift and delayed resolutions.
• Alerts are inconsistently handled due to lack of clear ownership and responsibility.
Implementation guidance
Having a process per alert involves establishing a clear response plan for each alert, automating
responses where possible, and continually refining these processes based on operational feedback
and evolving requirements.
Operate 221
AWS Well-Architected Framework Framework
1. Use composite alarms: Create composite alarms in CloudWatch to group related alarms,
reducing noise and allowing for more meaningful responses.
2. Integrate Amazon CloudWatch alarms with Incident Manager Configure CloudWatch alarms to
automatically create incidents in AWS Systems Manager Incident Manager.
Operate 223
AWS Well-Architected Framework Framework
Responding promptly to operational events is critical, but not all events are equal. When you
prioritize based on business impact, you also prioritize addressing events with the potential
for significant consequences, such as safety, financial loss, regulatory violations, or damage to
reputation.
Desired outcome: Responses to operational events are prioritized based on potential impact to
business operations and objectives. This makes the responses efficient and effective.
Common anti-patterns:
• Every event is treated with the same level of urgency, leading to confusion and delays in
addressing critical issues.
• You fail to distinguish between high and low impact events, leading to misallocation of
resources.
• Events are prioritized based on the order they are reported, rather than their impact on business
outcomes.
• Ensures critical business functions receive attention first, minimizing potential damage.
Implementation guidance
When faced with multiple operational events, a structured approach to prioritization based on
impact and urgency is essential. This approach helps you make informed decisions, direct efforts
where they're needed most, and mitigate the risk to business continuity.
Operate 225
AWS Well-Architected Framework Framework
• Make the matrix accessible and understood by all team members responsible for operational
event responses.
• The following example matrix displays incident severity according to urgency and impact:
4. Train and communicate: Train response teams on the prioritization matrix and the importance
of following it during an event. Communicate the prioritization process to all stakeholders to set
clear expectations.
5. Integrate with incident response:
• Incorporate the prioritization matrix into your incident response plans and tools.
• Automate the classification and prioritization of events where possible to speed up response
times.
• Enterprise Support customers can leverage AWS Incident Detection and Response, which
provides 24x7 proactive monitoring and incident management for production workloads.
6. Review and adapt: Regularly review the effectiveness of the prioritization process and make
adjustments based on feedback and changes in the business environment.
Resources
Related documents:
Operate 227
AWS Well-Architected Framework Framework
Resources
Related documents:
Operate 229
AWS Well-Architected Framework Framework
Implementation guidance
Creating a comprehensive communication plan for service impacting events involves multiple
facets, from choosing the right channels to crafting the message and tone. The plan should be
adaptable, scalable, and cater to different outage scenarios.
Implementation steps
• Designate a communications manager responsible for coordinating all external and internal
communications.
• Include the support manager to provide consistent communication through support tickets.
2. Identify communication channels: Select channels like workplace chat, email, SMS, social
media, in-app notifications, and status pages. These channels should be resilient and able to
operate independently during service impacting events.
• Develop templates for various service impairment scenarios, emphasizing simplicity and
essential details. Include information about the service impairment, expected resolution time,
and impact.
• Use Amazon Pinpoint to alert customers using push notifications, in-app notifications, emails,
text messages, voice messages, and messages over custom channels.
• Use Amazon Simple Notification Service (Amazon SNS) to alert subscribers programatically or
through email, mobile push notifications, and text messages.
• Communicate status through dashboards by sharing an Amazon CloudWatch dashboard
publicly.
• Post on social media platforms for public updates and community engagement.
4. Coordinate internal communication: Implement internal protocols using tools like AWS
Chatbot for team coordination and communication. Use CloudWatch dashboards to
communicate status.
Related examples:
Use dashboards as a strategic tool to convey real-time operational status and key metrics
to different audiences, including internal technical teams, leadership, and customers. These
dashboards offer a centralized, visual representation of system health and business performance,
enhancing transparency and decision-making efficiency.
Desired outcome:
• Your dashboards provide a comprehensive view of the system and business metrics relevant to
different stakeholders.
• Stakeholders can proactively access operational information, reducing the need for frequent
status requests.
Common anti-patterns:
• Engineers joining an incident management call require status updates to get up to speed.
• Relying on manual reporting for management, which leads to delays and potential inaccuracies.
• Operations teams are frequently interrupted for status updates during incidents.
• Reduces operational inefficiencies by minimizing manual reporting and frequent status inquiries.
• Increases transparency and trust through real-time visibility into system performance and
business metrics.
Operate 233
AWS Well-Architected Framework Framework
Resources
Related documents:
Related examples:
Automating event responses is key for fast, consistent, and error-free operational handling. Create
streamlined processes and use tools to automatically manage and respond to events, minimizing
manual interventions and enhancing operational effectiveness.
Desired outcome:
Common anti-patterns:
Operate 235
AWS Well-Architected Framework Framework
Resources
Related documents:
Related videos:
Related examples:
Operate 237
AWS Well-Architected Framework Framework
• You give improvement opportunities equal priority to features in your software development
process.
Common anti-patterns:
• You have not conducted an architecture review on your workload since it was deployed several
years ago.
• You give a lower priority to improvement opportunities. Compared to new features, these
opportunities stay in the backlog.
• There is no standard for implementing modifications to best practices for the organization.
Implementation guidance
Frequently conduct an architectural review of your workload. Use internal and external best
practices, evaluate your workload, and identify improvement opportunities. Prioritize improvement
opportunities into your software development cadence.
Implementation steps
Evolve 239
AWS Well-Architected Framework Framework
Desired outcome:
• You have established incident management processes that include post-incident analysis.
• You have observability plans in place to collect data on events.
• With this data, you understand and collect metrics that support your post-incident analysis
process.
• You learn from incidents to improve future outcomes.
Common anti-patterns:
• You administer an application server. Approximately every 23 hours and 55 minutes all
your active sessions are terminated. You have tried to identify what is going wrong on your
application server. You suspect it could instead be a network issue but are unable to get
cooperation from the network team as they are too busy to support you. You lack a predefined
process to follow to get support and collect the information necessary to determine what is
going on.
• You have had data loss within your workload. This is the first time it has happened and the cause
is not obvious. You decide it is not important because you can recreate the data. Data loss starts
occurring with greater frequency impacting your customers. This also places addition operational
burden on you as you restore the missing data.
• You have a predefined process to determine the components, conditions, actions, and events
that contributed to an incident, which helps you identify opportunities for improvement.
• You use data from post-incident analysis to make improvements.
Implementation guidance
Use a process to determine contributing factors. Review all customer impacting incidents. Have a
process to identify and document the contributing factors of an incident so that you can develop
mitigations to limit or prevent recurrence and you can develop procedures for prompt and effective
responses. Communicate incident root causes as appropriate, and tailor the communication to your
target audience. Share learnings openly within your organization.
Evolve 241
AWS Well-Architected Framework Framework
They also validate investments made in improvements. These feedback loops are the foundation
for continuously improving your workload.
Feedback loops fall into two categories: immediate feedback and retrospective analysis. Immediate
feedback is gathered through review of the performance and outcomes from operations activities.
This feedback comes from team members, customers, or the automated output of the activity.
Immediate feedback is received from things like A/B testing and shipping new features, and it is
essential to failing fast.
Retrospective analysis is performed regularly to capture feedback from the review of operational
outcomes and metrics over time. These retrospectives happen at the end of a sprint, on a cadence,
or after major releases or events. This type of feedback loop validates investments in operations or
your workload. It helps you measure success and validates your strategy.
Desired outcome: You use immediate feedback and retrospective analysis to drive improvements.
There is a mechanism to capture user and team member feedback. Retrospective analysis is used to
identify trends that drive improvements.
Common anti-patterns:
• You launch a new feature but have no way of receiving customer feedback on it.
• After investing in operations improvements, you don’t conduct a retrospective to validate them.
• Feedback loops lead to proposed action items but they aren’t included in the software
development process.
• You can work backwards from the customer to drive new features.
Evolve 243
AWS Well-Architected Framework Framework
• Add the actions to your software development process and communicate status updates to
stakeholders as you make the improvements.
Level of effort for the implementation plan: Medium. To implement this best practice, you need
a way to take in immediate feedback and analyze it. Also, you need to establish a retrospective
analysis process.
Resources
• OPS01-BP01 Evaluate customer needs: Feedback loops are a mechanism to gather external
customer needs.
• OPS01-BP02 Evaluate internal customer needs: Internal stakeholders can use feedback loops to
communicate needs and requirements.
• OPS11-BP02 Perform post-incident analysis: Post-incident analyses are an important form of
retrospective analysis conducted after incidents.
• OPS11-BP07 Perform operations metrics reviews: Operations metrics reviews identify trends and
areas for improvement.
Related documents:
Related videos:
Evolve 245
AWS Well-Architected Framework Framework
• New team members are onboarded faster because documentation is up to date and searchable.
Implementation guidance
Customer example
AnyCompany Retail hosts an internal Wiki where all knowledge is stored. Team members are
encouraged to add to the knowledge base as they go about their daily duties. On a quarterly basis,
a cross-functional team evaluates which pages are least updated and determines if they should be
archived or updated.
Implementation steps
1. Start with identifying the content management system where knowledge will be stored. Get
agreement from stakeholders across your organization.
a. If you don’t have an existing content management system, consider running a self-hosted wiki
or using a version control repository as a starting point.
2. Develop runbooks for adding, updating, and archiving information. Educate your team on these
processes.
3. Identify what knowledge should be stored in the content management system. Start with daily
activities (runbooks and playbooks) that team members perform. Work with stakeholders to
prioritize what knowledge is added.
4. On a periodic basis, work with stakeholders to identify out-of-date information and archive it or
bring it up to date.
Evolve 247
AWS Well-Architected Framework Framework
• You collect data from across your environment but do not correlate events and activities.
• You collect detailed data from across your estate, and it drives high Amazon CloudWatch and
AWS CloudTrail activity and cost. However, you do not use this data meaningfully.
• You do not account for business outcomes when defining drivers for improvement.
Implementation guidance
• Understand drivers for improvement: You should only make changes to a system when a desired
outcome is supported.
• Desired capabilities: Evaluate desired features and capabilities when evaluating opportunities
for improvement.
• Unacceptable issues: Evaluate unacceptable issues, bugs, and vulnerabilities when evaluating
opportunities for improvement. Track rightsizing options, and seek optimization opportunities.
• AWS Compliance
Review your analysis results and responses with cross-functional teams and business owners. Use
these reviews to establish common understanding, identify additional impacts, and determine
courses of action. Adjust responses as appropriate.
Desired outcomes:
• You review insights with business owners on a regular basis. Business owners provide additional
context to newly-gained insights.
• You review insights and request feedback from technical peers, and you share your learnings
across teams.
• You publish data and insights for other technical and business teams to review. You factor in
your learnings to new practices by other departments.
• Summarize and review new insights with senior leaders. Senior leaders use new insights to define
strategy.
Common anti-patterns:
• You release a new feature. This feature changes some of your customer behaviors. Your
observability does not take these changes into account. You do not quantify the benefits of
these changes.
• You push a new update and neglect refreshing your CDN. The CDN cache is no longer compatible
with the latest release. You measure the percentage of requests with errors. All of your users
report HTTP 400 errors when communicating with backend servers. You investigate the client
errors and find that because you measured the wrong dimension, your time was wasted.
• Your service-level agreement stipulates 99.9% uptime, and your recovery point objective is
four hours. The service owner maintains that the system is zero downtime. You implement an
expensive and complex replication solution, which wastes time and money.
• When you validate insights with business owners and subject matter experts, you establish
common understanding and more effectively guide improvement.
• You discover hidden issues and factor them into future decisions.
• Your focus moves from technical outcomes to business outcomes.
Evolve 251
AWS Well-Architected Framework Framework
• Your maintenance window interrupts a significant retail promotion. The business remains
unaware that there is a standard maintenance window that could be delayed if there are other
business impacting events.
• You suffered an extended outage because you commonly use an outdated library in your
organization. You have since migrated to a supported library. The other teams in your
organization do not know that they are at risk.
• You do not regularly review attainment of customer SLAs. You are trending to not meet your
customer SLAs. There are financial penalties related to not meeting your customer SLAs.
• When you meet regularly to review operations metrics, events, and incidents, you maintain
common understanding across teams.
• Your team meets routinely to review metrics and incidents, which positions you to take action on
risks and recognize customer SLAs.
• You share lessons learned, which provides data for prioritization and targeted improvements for
business outcomes.
Implementation guidance
• Regularly perform retrospective analysis of operations metrics with cross-team participants from
different areas of the business.
• Engage stakeholders, including the business, development, and operations teams, to validate
your findings from immediate feedback and retrospective analysis and share lessons learned.
• Use their insights to identify opportunities for improvement and potential courses of action.
Resources
Evolve 253
AWS Well-Architected Framework Framework
Benefits of establishing this best practice: Share lessons learned to support improvement and to
maximize the benefits of experience.
Implementation guidance
• Document and share lessons learned: Have procedures to document the lessons learned from
the running of operations activities and retrospective analysis so that they can be used by other
teams.
• Share learnings: Have procedures to share lessons learned and associated artifacts across teams.
For example, share updated procedures, guidance, governance, and best practices through an
accessible wiki. Share scripts, code, and libraries through a common repository.
Resources
Related documents:
Related videos:
Evolve 255
AWS Well-Architected Framework Framework
Resources
Related videos:
• AWS re:Invent 2023 - Improve application resilience with AWS Fault Injection Service
Security
The Security pillar encompasses the ability to protect data, systems, and assets to take
advantage of cloud technologies to improve your security. You can find prescriptive guidance on
implementation in the Security Pillar whitepaper.
Security foundations
Question
• SEC 1. How do you securely operate your workload?
To operate your workload securely, you must apply overarching best practices to every area of
security. Take requirements and processes that you have defined in operational excellence at an
Security 257
AWS Well-Architected Framework Framework
Implementation guidance
AWS accounts provide a security isolation boundary between workloads or resources that operate
at different sensitivity levels. AWS provides tools to manage your cloud workloads at scale through
a multi-account strategy to leverage this isolation boundary. For guidance on the concepts,
patterns, and implementation of a multi-account strategy on AWS, see Organizing Your AWS
Environment Using Multiple Accounts.
When you have multiple AWS accounts under central management, your accounts should be
organized into a hierarchy defined by layers of organizational units (OUs). Security controls
can then be organized and applied to the OUs and member accounts, establishing consistent
preventative controls on member accounts in the organization. The security controls are inherited,
allowing you to filter permissions available to member accounts located at lower levels of an OU
hierarchy. A good design takes advantage of this inheritance to reduce the number and complexity
of security policies required to achieve the desired security controls for each member account.
AWS Organizations and AWS Control Tower are two services that you can use to implement and
manage this multi-account structure in your AWS environment. AWS Organizations allows you to
organize accounts into a hierarchy defined by one or more layers of OUs, with each OU containing
a number of member accounts. Service control policies (SCPs) allow the organization administrator
to establish granular preventative controls on member accounts, and AWS Config can be used to
establish proactive and detective controls on member accounts. Many AWS services integrate with
AWS Organizations to provide delegated administrative controls and performing service-specific
tasks across all member accounts in the organization.
Layered on top of AWS Organizations, AWS Control Tower provides a one-click best practices setup
for a multi-account AWS environment with a landing zone. The landing zone is the entry point
to the multi-account environment established by Control Tower. Control Tower provides several
benefits over AWS Organizations. Three benefits that provide improved account governance are:
• Integrated mandatory security controls that are automatically applied to accounts admitted into
the organization.
• Optional controls that can be turned on or off for a given set of OUs.
• Use CloudFormation StackSets to provision resources across multiple AWS accounts and regions
• Organizations FAQ
• AWS Organizations terminology and concepts
• Best Practices for Service Control Policies in an AWS Organizations Multi-Account Environment
• AWS Account Management Reference Guide
• Organizing Your AWS Environment Using Multiple Accounts
Related videos:
Related workshops:
The root user is the most privileged user in an AWS account, with full administrative access to
all resources within the account, and in some cases cannot be constrained by security policies.
Deactivating programmatic access to the root user, establishing appropriate controls for the root
user, and avoiding routine use of the root user helps reduce the risk of inadvertent exposure of the
root credentials and subsequent compromise of the cloud environment.
Desired outcome: Securing the root user helps reduce the chance that accidental or intentional
damage can occur through the misuse of root user credentials. Establishing detective controls can
also alert the appropriate personnel when actions are taken using the root user.
Common anti-patterns:
• Using the root user for tasks other than the few that require root user credentials.
• Neglecting to test contingency plans on a regular basis to verify the functioning of critical
infrastructure, processes, and personnel during an emergency.
management account’s root user can differ from your member account root users, and you can
place preventative security controls on your member account root users.
Implementation steps
The following implementation steps are recommended to establish controls for the root user.
Where applicable, recommendations are cross-referenced to CIS AWS Foundations benchmark
version 1.4.0. In addition to these steps, consult AWS best practice guidelines for securing your
AWS account and resources.
Preventative controls
• Determine who in the organization should have access to the root user credentials.
• Use a two-person rule so that no one individual has access to all necessary credentials and MFA
to obtain root user access.
• Verify that the organization, and not a single individual, maintains control over the phone
number and email alias associated with the account (which are used for password reset and
MFA reset flow).
• Use root user only by exception (CIS 1.7).
• The AWS root user must not be used for everyday tasks, even administrative ones. Only log
in as the root user to perform AWS tasks that require root user. All other actions should be
performed by other users assuming appropriate roles.
• Periodically check that access to the root user is functioning so that procedures are tested prior
to an emergency situation requiring the use of the root user credentials.
• Periodically check that the email address associated with the account and those listed under
Alternate Contacts work. Monitor these email inboxes for security notifications you might receive
from <[email protected]>. Also ensure any phone numbers associated with the account are
working.
• Prepare incident response procedures to respond to root account misuse. Refer to the AWS
Security Incident Response Guide and the best practices in the Incident Response section of the
Security Pillar whitepaper for more information on building an incident response strategy for
your AWS account.
Resources
Related documents:
• The implementation of controls does not strongly align to your control objectives in a
measurable way
• You do not use automation to report on the effectiveness of your controls
Implementation guidance
There are many common cybersecurity frameworks that can form the basis for your security
control objectives. Consider the regulatory requirements, market expectations, and industry
standards for your business to determine which frameworks best supports your needs. Examples
include AICPA SOC 2, HITRUST, PCI-DSS, ISO 27001, and NIST SP 800-53.
For the control objectives you identify, understand how AWS services you consume help you to
achieve those objectives. Use AWS Artifact to find documentation and reports aligned to your
target frameworks that describe the scope of responsibility covered by AWS and guidance for the
remaining scope that is your responsibility. For further service-specific guidance as they align to
various framework control statements, see AWS Customer Compliance Guides.
As you define the controls that achieve your objectives, codify enforcement using preventative
controls, and automate mitigations using detective controls. Help prevent non-compliant resource
configurations and actions across your AWS Organizations using service control policies (SCP).
Implement rules in AWS Config to monitor and report on non-compliant resources, then switch
rules to an enforcement model once confident in their behavior. To deploy sets of pre-defined
and managed rules that align to your cybersecurity frameworks, evaluate the use of AWS Security
Hub standards as your first option. The AWS Foundational Service Best Practices (FSBP) standard
and the CIS AWS Foundations Benchmark are good starting points with controls that align to
many objectives that are shared across multiple standard frameworks. Where Security Hub does
not intrinsically have the control detections desired, it can be complemented using AWS Config
conformance packs.
Use APN Partner Bundles recommended by the AWS Global Security and Compliance Acceleration
(GSCA) team to get assistance from security advisors, consulting agencies, evidence collection and
reporting systems, auditors, and other complementary services when required.
Implementation steps
1. Evaluate common cybersecurity frameworks, and align your control objectives to the ones
chosen.
you identify new threats. You take mitigating action against these threats. You adopt AWS services
that automatically update with the latest threat intelligence.
Common anti-patterns:
• Not having a reliable and repeatable mechanism to stay informed of the latest threat
intelligence.
• Maintaining manual inventory of your technology portfolio, workloads, and dependencies that
require human review for potential vulnerabilities and exposures.
• Not having mechanisms in place to update your workloads and dependencies to the latest
versions available that provide known threat mitigations.
Benefits of establishing this best practice: Using threat intelligence sources to stay up to date
reduces the risk of missing out on important changes to the threat landscape that can impact
your business. Having automation in place to scan, detect, and remediate where potential
vulnerabilities or exposures exist in your workloads and their dependencies can help you mitigate
risks quickly and predictably, compared to manual alternatives. This helps control time and costs
related to vulnerability mitigation.
Implementation guidance
Review trusted threat intelligence publications to stay on top of the threat landscape. Consult
the MITRE ATT&CK knowledge base for documentation on known adversarial tactics, techniques,
and procedures (TTPs). Review MITRE's Common Vulnerabilities and Exposures (CVE) list to
stay informed on known vulnerabilities in products you rely on. Understand critical risks to web
applications with the Open Worldwide Application Security Project (OWASP)'s popular OWASP Top
10 project.
Stay up to date on AWS security events and recommended remediation steps with AWS Security
Bulletins for CVEs.
To reduce your overall effort and overhead of staying up to date, consider using AWS services that
automatically incorporate new threat intelligence over time. For example, Amazon GuardDuty
stays up to date with industry threat intelligence for detecting anomalous behaviors and threat
signatures within your accounts. Amazon Inspector automatically keeps a database of the CVEs
it uses for its continuous scanning features up to date. Both AWS WAF and AWS Shield Advanced
provide managed rule groups that are updated automatically as new threats emerge.
• Not including security management tasks into the total cost of ownership of hosting
technologies on virtual machines when compared to managed service options.
Benefits of establishing this best practice: Using managed services can reduce your overall burden
of managing operational security controls, which can reduce your security risks and total cost of
ownership. Time that would otherwise be spent on certain security tasks can be reinvested into
tasks that provide more value to your business. Managed services can also reduce the scope of your
compliance requirements by shifting some control requirements to AWS.
Implementation guidance
There are multiple ways you can integrate the components of your workload on AWS. Installing
and running technologies on Amazon EC2 instances often requires you to take on the largest share
of the overall security responsibility. To help reduce the burden of operating certain controls,
identify AWS managed services that reduce the scope of your side of the shared responsibility
model and understand how you can use them in your existing architecture. Examples include
using the Amazon Relational Database Service (Amazon RDS) for deploying databases, Amazon
Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS)
for orchestrating containers, or using serverless options. When building new applications, think
through which services can help reduce time and cost when it comes to implementing and
managing security controls.
Compliance requirements can also be a factor when selecting services. Managed services can shift
the compliance of some requirements to AWS. Discuss with your compliance team about their
degree of comfort with auditing the aspects of services you operate and manage and accepting
control statements in relevant AWS audit reports. You can provide the audit artifacts found in AWS
Artifact to your auditors or regulators as evidence of AWS security controls. You can also use the
responsibility guidance provided by some of the AWS audit artifacts to design your architecture,
along with the AWS Customer Compliance Guides. This guidance helps determine the additional
security controls you should put in place in order to support the specific use cases of your system.
When using managed services, be familiar with the process of updating their resources to
newer versions (for example, updating the version of a database managed by Amazon RDS, or a
programming language runtime for an AWS Lambda function). While the managed service may
perform this operation for you, configuring the timing of the update and understanding the impact
Related videos:
• How do I migrate to an Amazon RDS or Aurora MySQL DB instance using AWS DMS?
• AWS re:Invent 2023 - Manage resource lifecycle events at scale with AWS Health
Apply modern DevOps practices as you develop and deploy security controls that are standard
across your AWS environments. Define standard security controls and configurations using
Infrastructure as Code (IaC) templates, capture changes in a version control system, test changes as
part of a CI/CD pipeline, and automate the deployment of changes to your AWS environments.
Desired outcome: IaC templates capture standardized security controls and commit them
to a version control system. CI/CD pipelines are in places that detect changes and automate
testing and deploying your AWS environments. Guardrails are in place to detect and alert on
misconfigurations in templates before proceeding to deployment. Workloads are deployed into
environments where standard controls are in place. Teams have access to deploy approved service
configurations through a self-service mechanism. Secure backup and recovery strategies are in
place for control configurations, scripts, and related data.
Common anti-patterns:
• Making changes to your standard security controls manually, through a web console or
command-line interface.
• Relying on individual workload teams to manually implement the controls a central team
defines.
• Relying on a central security team to deploy workload-level controls at the request of a workload
team.
• Allowing the same individuals or teams to develop, test, and deploy security control automation
scripts without proper separation of duties or checks and balances.
Benefits of establishing this best practice: Using templates to define your standard security
controls allows you to track and compare changes over time using a version control system. Using
automation to test and deploy changes creates standardization and predictability, increasing
the chances of a successful deployment and reducing manual repetitive tasks. Providing a self-
serve mechanism for workload teams to deploy approved services and configurations reduces the
2. Create CI/CD pipelines to test and deploy your templates. Define tests to check for
misconfigurations and that templates adhere to your company standards.
3. Build a catalog of standardized templates for workload teams to deploy AWS accounts and
services according to your requirements.
4. Implement secure backup and recovery strategies for your control configurations, scripts, and
related data.
Resources
Related documents:
Related examples:
• Automate account creation, and resource provisioning using Service Catalog, AWS Organizations,
and AWS Lambda
• Strengthen the DevOps pipeline and protect data with AWS Secrets Manager, AWS KMS, and
AWS Certificate Manager
Related tools:
Perform threat modeling to identify and maintain an up-to-date register of potential threats
and associated mitigations for your workload. Prioritize your threats and adapt your security
Implementation steps
There are many different ways to perform threat modeling. Much like programming languages,
there are advantages and disadvantages to each, and you should choose the way that works best
for you. One approach is to start with Shostack’s 4 Question Frame for Threat Modeling, which
poses open-ended questions to provide structure to your threat modeling exercise:
The purpose of this question is to help you understand and agree upon the system you are
building and the details about that system that are relevant to security. Creating a model or
diagram is the most popular way to answer this question, as it helps you to visualize what
you are building, for example, using a data flow diagram. Writing down assumptions and
important details about your system also helps you define what is in scope. This allows everyone
contributing to the threat model to focus on the same thing, and avoid time-consuming detours
into out-of-scope topics (including out of date versions of your system). For example, if you are
building a web application, it is probably not worth your time threat modeling the operating
system trusted boot sequence for browser clients, as you have no ability to affect this with your
design.
This is where you identify threats to your system. Threats are accidental or intentional actions or
events that have unwanted impacts and could affect the security of your system. Without a clear
understanding of what could go wrong, you have no way of doing anything about it.
There is no canonical list of what can go wrong. Creating this list requires brainstorming and
collaboration between all of the individuals within your team and relevant personas involved in
the threat modeling exercise. You can aid your brainstorming by using a model for identifying
threats, such as STRIDE, which suggests different categories to evaluate: Spoofing, Tampering,
Repudiation, Information Disclosure, Denial of Service, and Elevation of privilege. In addition,
you might want to aid the brainstorming by reviewing existing lists and research for inspiration,
including the OWASP Top 10, HiTrust Threat Catalog, and your organization’s own threat
catalog.
Threat Composer
To aid and guide you in performing threat modeling, consider using the Threat Composer tool,
which aims to your reduce time-to-value when threat modeling. The tool helps you do the
following:
• Write useful threat statements aligned to threat grammar that work in a natural non-linear
workflow
• Generate a human-readable threat model
• Generate a machine-readable threat model to allow you treat threat models as code
• Help you to quickly identify areas of quality and coverage improvement using the Insights
Dashboard
For further reference, visit Threat Composer and switch to the system-defined Example
Workspace.
Resources
Related documents:
Related videos:
Related training:
You can subscribe to an AWS Daily Feature Updates topic using Amazon Simple Notification Service
(Amazon SNS) for a comprehensive daily summary of updates. Some security services, such as
Amazon GuardDuty and AWS Security Hub, provide their own SNS topics to stay informed about
new standards, findings, and other updates for those particular services.
New services and features are also announced and described in detail during conferences, events,
and webinars conducted around the globe each year. Of particular note is the annual AWS
re:Inforce security conference and the more general AWS re:Invent conference. The previously-
mentioned AWS news channels share these conference announcements about security and other
services, and you can view deep dive educational breakout sessions online at the AWS Events
channel on YouTube.
You can also ask your AWS account team about the latest security service updates and
recommendations. You can reach out to your team through the Sales Support form if you do not
have their direct contact information. Similarly, if you subscribed to AWS Enterprise Support, you
will receive weekly updates from your Technical Account Manager (TAM) and can schedule a regular
review meeting with them.
Implementation steps
1. Subscribe to the various blogs and bulletins with your favorite RSS reader or to the Daily
Features Updates SNS topic.
2. Evaluate which AWS events to attend to learn first-hand about new features and services.
3. Set up meetings with your AWS account team for any questions about updating security services
and features.
4. Consider subscribing to Enterprise Support to have regular consultations with a Technical
Account Manager (TAM).
Resources
• PERF01-BP01 Learn about and understand available cloud services and features
• COST01-BP07 Keep up-to-date with new service releases
inadvertently disclosed or are easily guessed. Use strong sign-in mechanisms to reduce these risks
by requiring MFA and strong password policies.
Desired outcome: Reduce the risks of unintended access to credentials in AWS by using strong
sign-in mechanisms for AWS Identity and Access Management (IAM) users, the AWS account
root user, AWS IAM Identity Center (successor to AWS Single Sign-On), and third-party identity
providers. This means requiring MFA, enforcing strong password policies, and detecting anomalous
login behavior.
Common anti-patterns:
• Not enforcing a strong password policy for your identities including complex passwords and
MFA.
• Sharing the same credentials among different users.
• Not using detective controls for suspicious sign-ins.
Implementation guidance
There are many ways for human identities to sign-in to AWS. It is an AWS best practice to rely on a
centralized identity provider using federation (direct federation or using AWS IAM Identity Center)
when authenticating to AWS. In that case, you should establish a secure sign-in process with your
identity provider or Microsoft Active Directory.
When you first open an AWS account, you begin with an AWS account root user. You should only
use the account root user to set up access for your users (and for tasks that require the root user).
It’s important to turn on MFA for the account root user immediately after opening your AWS
account and to secure the root user using the AWS best practice guide.
If you create users in AWS IAM Identity Center, then secure the sign-in process in that service. For
consumer identities, you can use Amazon Cognito user pools and secure the sign-in process in that
service, or by using one of the identity providers that Amazon Cognito user pools supports.
If you are using AWS Identity and Access Management (IAM) users, you would secure the sign-in
process using IAM.
Regardless of the sign-in method, it’s critical to enforce a strong sign-in policy.
Implementation steps
• Create an IAM policy to enforce MFA sign-in so that users are allowed to manage their own
passwords and MFA devices.
Resources
Related documents:
Related videos:
When doing any type of authentication, it’s best to use temporary credentials instead of long-term
credentials to reduce or eliminate risks, such as credentials being inadvertently disclosed, shared,
or stolen.
Desired outcome: To reduce the risk of long-term credentials, use temporary credentials wherever
possible for both human and machine identities. Long-term credentials create many risks,
for example, they can be uploaded in code to public GitHub repositories. By using temporary
credentials, you significantly reduce the chances of credentials becoming compromised.
done either with direct federation to each AWS account or using AWS IAM Identity Center and
the identity provider of your choice. Federation provides a number of advantages over using IAM
users in addition to eliminating long-term credentials. Your users can also request temporary
credentials from the command line for direct federation or by using IAM Identity Center. This
means that there are few uses cases that require IAM users or long-term credentials for your
users.
• When granting third parties, such as software as a service (SaaS) providers, access to resources in
your AWS account, you can use cross-account roles and resource-based policies.
• If you need to grant applications for consumers or customers access to your AWS resources, you
can use Amazon Cognito identity pools or Amazon Cognito user pools to provide temporary
credentials. The permissions for the credentials are configured through IAM roles. You can also
define a separate IAM role with limited permissions for guest users who are not authenticated.
For machine identities, you might need to use long-term credentials. In these cases, you should
require workloads to use temporary credentials with IAM roles to access AWS.
• For Amazon Elastic Compute Cloud (Amazon EC2), you can use roles for Amazon EC2.
• AWS Lambda allows you to configure a Lambda execution role to grant the service permissions
to perform AWS actions using temporary credentials. There are many other similar models for
AWS services to grant temporary credentials using IAM roles.
• For IoT devices, you can use the AWS IoT Core credential provider to request temporary
credentials.
• For on-premises systems or systems that run outside of AWS that need access to AWS resources,
you can use IAM Roles Anywhere.
There are scenarios where temporary credentials are not an option and you might need to use
long-term credentials. In these situations, audit and rotate credentials periodically and rotate
access keys regularly for use cases that require long-term credentials. Some examples that might
require long-term credentials include WordPress plugins and third-party AWS clients. In situations
where you must use long-term credentials, or for credentials other than AWS access keys, such
as database logins, you can use a service that is designed to handle the management of secrets,
such as AWS Secrets Manager. Secrets Manager makes it simple to manage, rotate, and securely
store encrypted secrets using supported services. For more information about rotating long-term
credentials, see rotating access keys.
• Reducing the number of long-term credentials required by replacing them with short-term
credentials when possible.
• Establishing secure storage and automated rotation of remaining long-term credentials.
• Auditing access to secrets that exist in the workload.
• Continual monitoring to verify that no secrets are embedded in source code during the
development process.
• Reduce the likelihood of credentials being inadvertently disclosed.
Common anti-patterns:
Implementation guidance
In the past, credentials used to authenticate to databases, third-party APIs, tokens, and other
secrets might have been embedded in source code or in environment files. AWS provides several
mechanisms to store these credentials securely, automatically rotate them, and audit their usage.
The best way to approach secrets management is to follow the guidance of remove, replace, and
rotate. The most secure credential is one that you do not have to store, manage, or handle. There
Application and database Passwords – plain text string Rotate: Store credentials in
credentials AWS Secrets Manager and
establish automated rotation
if possible.
Amazon RDS and Aurora Passwords – plain text string Replace: Use the Secrets
Admin Database credentials Manager integration with
Amazon RDS or Amazon
Aurora. In addition, some RDS
database types can use IAM
roles instead of passwords
for some use cases (for more
detail, see IAM database
authentication).
API tokens and keys Secret tokens – plain text Rotate: Store in AWS Secrets
string Manager and establish
automated rotation if
possible.
A common anti-pattern is embedding IAM access keys inside source code, configuration files,
or mobile apps. When an IAM access key is required to communicate with an AWS service, use
temporary (short-term) security credentials. These short-term credentials can be provided through
IAM roles for EC2 instances, execution roles for Lambda functions, Cognito IAM roles for mobile
user access, and IoT Core policies for IoT devices. When interfacing with third parties, prefer
a. Consider using a tool such as git-secrets to prevent committing new secrets to your source
code repository.
5. Monitor Secrets Manager activity for indications of unexpected usage, inappropriate secret
access, or attempts to delete secrets.
6. Reduce human exposure to credentials. Restrict access to read, write, and modify credentials to
an IAM role dedicated for this purpose, and only provide access to assume the role to a small
subset of operational users.
Resources
Related documents:
Related videos:
Related workshops:
Implementation guidance
Workforce users like employees and contractors in your organization may require access to AWS
using the AWS Management Console or AWS Command Line Interface (AWS CLI) to perform
their job functions. You can grant AWS access to your workforce users by federating from your
centralized identity provider to AWS at two levels: direct federation to each AWS account or
federating to multiple accounts in your AWS organization.
• To federate your workforce users directly with each AWS account, you can use a centralized
identity provider to federate to AWS Identity and Access Management in that account. The
flexibility of IAM allows you to enable a separate SAML 2.0 or an Open ID Connect (OIDC)
Identity Provider for each AWS account and use federated user attributes for access control.
Your workforce users will use their web browser to sign in to the identity provider by providing
their credentials (such as passwords and MFA token codes). The identity provider issues a SAML
assertion to their browser that is submitted to the AWS Management Console sign in URL to
allow the user to single sign-on to the AWS Management Console by assuming an IAM Role. Your
users can also obtain temporary AWS API credentials for use in the AWS CLI or AWS SDKs from
AWS STS by assuming the IAM role using a SAML assertion from the identity provider.
• To federate your workforce users with multiple accounts in your AWS organization, you can use
AWS IAM Identity Center to centrally manage access for your workforce users to AWS accounts
and applications. You enable Identity Center for your organization and configure your identity
source. IAM Identity Center provides a default identity source directory which you can use to
manage your users and groups. Alternatively, you can choose an external identity source by
connecting to your external identity provider using SAML 2.0 and automatically provisioning
users and groups using SCIM, or connecting to your Microsoft AD Directory using AWS Directory
Service. Once an identity source is configured, you can assign access to users and groups to
AWS accounts by defining least-privilege policies in your permission sets. Your workforce users
can authenticate through your central identity provider to sign in to the AWS access portal and
single-sign on to the AWS accounts and cloud applications assigned to them. Your users can
configure the AWS CLI v2 to authenticate with Identity Center and get credentials to run AWS CLI
commands. Identity Center also allows single-sign on access to AWS applications such as Amazon
SageMaker Studio and AWS IoT Sitewise Monitor portals.
After you follow the preceding guidance, your workforce users will no longer need to use IAM users
and groups for normal operations when managing workloads on AWS. Instead, your users and
Resources
Related documents:
• How to use customer managed policies in IAM Identity Center for advanced use cases
Related videos:
• AWS re:Inforce 2022 - AWS Identity and Access Management (IAM) deep dive
• AWS re:Invent 2022 - Simplify your existing workforce access with IAM Identity Center
Related examples:
• Workshop: Using AWS IAM Identity Center to achieve strong identity management
Related tools:
can include, but are not limited to, IAM users, AWS IAM Identity Center users, Active Directory
users, or users in a different upstream identity provider. For example, remove people that leave
the organization, and remove cross-account roles that are no longer required. Have a process
in place to periodically audit permissions to the services accessed by an IAM entity. This helps
you identify the policies you need to modify to remove any unused permissions. Use credential
reports and AWS Identity and Access Management Access Analyzer to audit IAM credentials and
permissions. You can use Amazon CloudWatch to set up alarms for specific API calls called within
your AWS environment. Amazon GuardDuty can also alert you to unexpected activity, which
might indicate overly permissive access or unintended access to IAM credentials.
• Rotate credentials regularly: When you are unable to use temporary credentials, rotate long-
term IAM access keys regularly (maximum every 90 days). If an access key is unintentionally
disclosed without your knowledge, this limits how long the credentials can be used to access
your resources. For information about rotating access keys for IAM users, see Rotating access
keys.
• Review IAM permissions: To improve the security of your AWS account, regularly review and
monitor each of your IAM policies. Verify that policies adhere to the principle of least privilege.
• Consider automating IAM resource creation and updates: IAM Identity Center automates many
IAM tasks, such as role and policy management. Alternatively, AWS CloudFormation can be used
to automate the deployment of IAM resources, including roles and policies, to reduce the chance
of human error because the templates can be verified and version controlled.
• Use IAM Roles Anywhere to replace IAM users for machine identities: IAM Roles Anywhere
allows you to use roles in areas that you traditionally could not, such as on-premise servers. IAM
Roles Anywhere uses a trusted X.509 certificate to authenticate to AWS and receive temporary
credentials. Using IAM Roles Anywhere avoids the need to rotate these credentials, as long-term
credentials are no longer stored in your on-premises environment. Please note that you will need
to monitor and rotate the X.509 certificate as it approaches expiration.
Resources
Related documents:
• Defining groups at too granular a level, creating duplication and confusion about membership.
• Using groups with duplicate permissions across subsets of resources when attributes can be used
instead.
• Not managing groups, attributes, and memberships through a standardized identity provider
integrated with your AWS environments.
Implementation guidance
AWS permissions are defined in documents called policies that are associated to a principal, such
as a user, group, role, or resource. For your workforce, this allows you to define groups based on
the function your users perform for your organization, rather than based on the resources being
accessed. For example, a WebAppDeveloper group may have a policy attached for configuring a
service such as Amazon CloudFront within a development account. An AutomationDeveloper
group may have some CloudFront permissions in common with the WebAppDeveloper group.
These permissions can be captured in a separate policy and associated to both groups, rather than
having users from both functions belong to a CloudFrontAccess group.
In addition to groups, you can use attributes to further scope access. For example, you may have a
Project attribute for users in your WebAppDeveloper group to scope access to resources specific
to their project. Using this technique removes the need to have different groups for application
developers working on different projects if their permissions are otherwise the same. The way
you refer to attributes in permission policies is based on their source, whether they are defined as
part of your federation protocol (such as SAML, OIDC, or SCIM), as custom SAML assertions, or set
within IAM Identity Center.
Implementation steps
Manage permissions to control access to people and machine identities that require access to AWS
and your workload. Permissions control who can access what, and under what conditions.
Best practices
Common anti-patterns:
set permissions at scale, rather than defining permissions for individual users. For example, you
can allow a group of developers access to manage only resources for their project. This way, if a
developer leaves the project, the developer’s access is automatically revoked without changing the
underlying access policies.
Desired outcome: Users should only have the permissions required to do their job. Users should
only be given access to production environments to perform a specific task within a limited
time period, and access should be revoked once that task is complete. Permissions should be
revoked when no longer needed, including when a user moves onto a different project or job
function. Administrator privileges should be given only to a small group of trusted administrators.
Permissions should be reviewed regularly to avoid permission creep. Machine or system accounts
should be given the smallest set of permissions needed to complete their tasks.
Common anti-patterns:
Implementation guidance
The principle of least privilege states that identities should only be permitted to perform the
smallest set of actions necessary to fulfill a specific task. This balances usability, efficiency, and
security. Operating under this principle helps limit unintended access and helps track who has
access to what resources. IAM users and roles have no permissions by default. The root user has full
access by default and should be tightly controlled, monitored, and used only for tasks that require
root access.
IAM policies are used to explicitly grant permissions to IAM roles or specific resources. For example,
identity-based policies can be attached to IAM groups, while S3 buckets can be controlled by
resource-based policies.
When creating an IAM policy, you can specify the service actions, resources, and conditions that
must be true for AWS to allow or deny access. AWS supports a variety of conditions to help you
scope down access. For example, by using the PrincipalOrgID condition key, you can deny
actions if the requestor isn’t a part of your AWS Organization.
environments. Using these tags, you can restrict developers to the development environment.
By combining tagging and permissions policies, you can achieve fine-grained resource access
without needing to define complicated, custom policies for every job function.
• Use service control policies for AWS Organizations. Service control policies centrally control
the maximum available permissions for member accounts in your organization. Importantly,
service control policies allow you to restrict root user permissions in member accounts. Also
consider using AWS Control Tower, which provides prescriptive managed controls that enrich
AWS Organizations. You can also define your own controls within Control Tower.
• Establish a user lifecycle policy for your organization: User lifecycle policies define tasks to
perform when users are onboarded onto AWS, change job role or scope, or no longer need access
to AWS. Permission reviews should be done during each step of a user’s lifecycle to verify that
permissions are properly restrictive and to avoid permissions creep.
• Establish a regular schedule to review permissions and remove any unneeded permissions:
You should regularly review user access to verify that users do not have overly permissive access.
AWS Config and IAM Access Analyzer can help when auditing user permissions.
• Establish a job role matrix: A job role matrix visualizes the various roles and access levels
required within your AWS footprint. Using a job role matrix, you can define and separate
permissions based on user responsibilities within your organization. Use groups instead of
applying permissions directly to individual users or roles.
Resources
Related documents:
• You have defined and documented the failure modes that count as an emergency: consider
your normal circumstances and the systems your users depend on to manage their workloads.
Consider how each of these dependencies can fail and cause an emergency situation. You may
find the questions and best practices in the Reliability pillar useful to identify failure modes and
architect more resilient systems to minimize the likelihood of failures.
• You have documented the steps that must be followed to confirm a failure as an emergency. For
example, you can require your identity administrators to check the status of your primary and
standby identity providers and, if both are unavailable, declare an emergency event for identity
provider failure.
• You have defined an emergency access process specific to each type of emergency or failure
mode. Being specific can reduce the temptation on the part of your users to overuse a
general process for all types of emergencies. Your emergency access processes describe the
circumstances under which each process should be used, and conversely situations where the
process should not be used and points to alternate processes that may apply.
• Your processes are well-documented with detailed instructions and playbooks that can be
followed quickly and efficiently. Remember that an emergency event can be a stressful time for
your users and they may be under extreme time pressure, so design your process to be as simple
as possible.
Common anti-patterns:
• You do not have well-documented and well-tested emergency access processes. Your users are
unprepared for an emergency and follow improvised processes when an emergency event arises.
• Your emergency access processes depend on the same systems (such as a centralized identity
provider) as your normal access mechanisms. This means that the failure of such a system may
impact both your normal and emergency access mechanisms and impair your ability to recover
from the failure.
• Your emergency access processes are used in non-emergency situations. For example, your users
frequently misuse emergency access processes as they find it easier to make changes directly
than submit changes through a pipeline.
• Your emergency access processes do not generate sufficient logs to audit the processes, or the
logs are not monitored to alert for potential misuse of the processes.
emergency event tracked in your incident management system. Having a uniform system for
emergency accesses allows you to track those requests in a single system, analyze usage trends,
and improve your processes.
• Verify that your emergency access processes can only be initiated by authorized users and
require approvals from the user's peers or management as appropriate. The approval process
should operate effectively both inside and outside business hours. Define how requests for
approval allow secondary approvers if the primary approvers are unavailable and are escalated
up your management chain until approved.
• Verify that the process generates detailed audit logs and events for both successful and failed
attempts to gain emergency access. Monitor both the request process and the emergency
access mechanism to detect misuse or unauthorized accesses. Correlate activity with ongoing
emergency events from your incident management system and alert when actions happen
outside of expected time periods. For example, you should monitor and alert on activity in the
emergency access AWS account, as it should never be used in normal operations.
• Test emergency access processes periodically to verify that the steps are clear and grant the
correct level of access quickly and efficiently. Your emergency access processes should be tested
as part of incident response simulations (SEC10-BP07) and disaster recovery tests (REL13-BP03).
In the unlikely event that your centralized identity provider is unavailable, your workforce users
can't federate to AWS accounts or manage their workloads. In this emergency event, you can
provide an emergency access process for a small set of administrators to access AWS accounts to
perform critical tasks that cannot wait until your centralized identity providers are back online.
For example, your identity provider is unavailable for 4 hours and during that period you need
to modify the upper limits of an Amazon EC2 Auto Scaling group in a Production account to
handle an unexpected spike in customer traffic. Your emergency administrators should follow the
emergency access process to gain access to the specific production AWS account and make the
necessary changes.
To allow your workforce users to federate to AWS accounts, you can configure the IAM Identity
Center with an external identity provider or create an IAM Identity Provider (SEC02-BP04).
Typically, you configure these by importing a SAML meta-data XML document provided by your
identity provider. The meta-data XML document includes a X.509 certificate corresponding to a
private key that the identity provider uses to sign its SAML assertions.
In such an emergency event, you can provide your identity administrators access to AWS to fix the
federation issues. For example, your identity administrator uses the emergency access process to
sign into the emergency access AWS account, switches to a role in the Identity Center administrator
account, and updates the external identity provider configuration by importing the latest SAML
meta-data XML document from your identity provider to re-enable federation. Once federation
is fixed, your workforce users continue to use the normal operating process to federate into their
workload accounts.
You can follow the approaches detailed in the previous Failure Mode 1 to create an emergency
access process. You can grant least-privilege permissions to your identity administrators to access
only the Identity Center administrator account and perform actions on Identity Center in that
account.
In the unlikely event of an IAM Identity Center or AWS Region disruption, we recommend that
you set up a configuration that you can use to provide temporary access to the AWS Management
Console.
The emergency access process uses direct federation from your identity provider to IAM in an
emergency account. For detail on the process and design considerations, see Set up emergency
access to the AWS Management Console.
Implementation steps
• Create IAM roles corresponding to the emergency operations groups in the emergency access
account.
Resources
Related documents:
• Enabling SAML 2.0 federated users to access the AWS Management Console
Related videos:
• AWS re:Invent 2022 - Simplify your existing workforce access with IAM Identity Center
• AWS re:Inforce 2022 - AWS Identity and Access Management (IAM) deep dive
Related examples:
As your teams determine what access is required, remove unneeded permissions and establish
review processes to achieve least privilege permissions. Continually monitor and remove unused
identities and permissions for both human and machine access.
• Determine an acceptable timeframe and usage policy for IAM users and roles: Use the
last accessed timestamp to identify unused users and roles and remove them. Review service
and action last accessed information to identify and scope permissions for specific users and
roles. For example, you can use last accessed information to identify the specific Amazon
S3 actions that your application role requires and restrict the role’s access to only those
actions. Last accessed information features are available in the AWS Management Console
and programmatically allow you to incorporate them into your infrastructure workflows and
automated tools.
• Consider logging data events in AWS CloudTrail: By default, CloudTrail does not log data
events such as Amazon S3 object-level activity (for example, GetObject and DeleteObject)
or Amazon DynamoDB table activities (for example, PutItem and DeleteItem). Consider using
logging for these events to determine what users and roles need access to specific Amazon S3
objects or DynamoDB table items.
Resources
Related documents:
Related videos:
• AWS re:Inforce 2022 - AWS Identity and Access Management (IAM) deep dive
additional layers are applied. This helps you grant access based on the principle of least privilege,
reducing the risk of unintended access due to policy misconfiguration.
The first step to establish permission guardrails is to isolate your workloads and environments into
separate AWS accounts. Principals from one account cannot access resources in another account
without explicit permission to do so, even when both accounts are in the same AWS organization
or under the same organizational unit (OU). You can use OUs to group accounts you want to
administer as a single unit.
The next step is to reduce the maximum set of permissions that you can grant to principals within
the member accounts of your organization. You can use service control policies (SCPs) for this
purpose, which you can apply to either an OU or an account. SCPs can enforce common access
controls, such as restricting access to specific AWS Regions, help prevent resources from being
deleted, or disabling potentially risky service actions. SCPs that you apply to the root of your
organization only affect its member accounts, not the management account. SCPs only govern the
principals within your organization. Your SCPs don't govern principals outside your organization
that are accessing your resources.
A further step is to use IAM resource policies to scope the available actions that you can take on
the resources they govern, along with any conditions that the acting principal must meet. This
can be as broad as allowing all actions so long as the principal is part of your organization (using
the PrincipalOrgId condition key), or as granular as only allowing specific actions by a specific IAM
role. You can take a similar approach with conditions in IAM role trust policies. If a resource or role
trust policy explicitly names a principal in the same account as the role or resource it governs, that
principal does not need an attached IAM policy that grants the same permissions. If the principal is
in a different account from the resource, then the principal does need an attached IAM policy that
grants those permissions.
Often, a workload team will want to manage the permissions their workload requires. This may
require them to create new IAM roles and permission policies. You can capture the maximum scope
of permissions the team is allowed to grant in an IAM permission boundary, and associate this
document to an IAM role the team can then use to manage their IAM roles and permissions. This
approach can provide them the ability to complete their work while mitigating risks of having IAM
administrative access.
A more granular step is to implement privileged access management (PAM) and temporary
elevated access management (TEAM) techniques. One example of PAM is to require principals to
perform multi-factor authentication before taking privileged actions. For more information, see
Related tools:
Monitor and adjust the permissions granted to your principals (users, roles, and groups) throughout
their lifecycle within your organization. Adjust group memberships as users change roles, and
remove access when a user leaves the organization.
Desired outcome: You monitor and adjust permissions throughout the lifecycle of principals within
the organization, reducing risk of unnecessary privileges. You grant appropriate access when you
create a user. You modify access as the user's responsibilities change, and you remove access when
the user is no longer active or has left the organization. You centrally manage changes to your
users, roles, and groups. You use automation to propagate changes to your AWS environments.
Common anti-patterns:
• You grant excessive or broad access privileges to identities upfront beyond what is initially
required.
• You don't review and adjust access privileges as the roles and responsibilities of identities change
over time.
• You leave inactive or terminated identities with active access privileges. This increases the risk of
unauthorized access.
• You don't automate the management of identity lifecycles.
Implementation guidance
Carefully manage and adjust access privileges that you grant to identities (such as users, roles,
groups) throughout their lifecycle. This lifecycle includes the initial onboarding phase, ongoing
changes in roles and responsibilities, and eventual offboarding or termination. Proactively manage
access based on the stage of the lifecycle to maintain the appropriate access level. Adhere to the
principle of least privilege to reduce the risk of excessive or unnecessary access privileges.
Resources
Related documents:
Related videos:
• AWS re:Inforce 2023 - Manage temporary elevated access with AWS IAM Identity Center
• AWS re:Invent 2022 - Simplify your existing workforce access with IAM Identity Center
• AWS re:Invent 2022 - Harness power of IAM policies & rein in permissions w/Access Analyzer
Continually monitor findings that highlight public and cross-account access. Reduce public access
and cross-account access to only the specific resources that require this access.
Desired outcome: Know which of your AWS resources are shared and with whom. Continually
monitor and audit your shared resources to verify they are shared with only authorized principals.
Common anti-patterns:
Implementation guidance
If your account is in AWS Organizations, you can grant access to resources to the entire
organization, specific organizational units, or individual accounts. If your account is not a member
Access is turned off, and if Amazon S3 buckets become public. Additionally, if you are using
AWS Organizations, you can create a service control policy that prevents changes to Amazon
S3 public access policies. AWS Trusted Advisor checks for Amazon S3 buckets that have open
access permissions. Bucket permissions that grant, upload, or delete access to everyone create
potential security issues by allowing anyone to add, modify, or remove items in a bucket. The
Trusted Advisor check examines explicit bucket permissions and associated bucket policies that
might override the bucket permissions. You also can use AWS Config to monitor your Amazon S3
buckets for public access. For more information, see How to Use AWS Config to Monitor for and
Respond to Amazon S3 Buckets Allowing Public Access. While reviewing access, it’s important to
consider what types of data are contained in Amazon S3 buckets. Amazon Macie helps discover
and protect sensitive data, such as PII, PHI, and credentials, such as private or AWS keys.
Resources
Related documents:
Related videos:
As the number of workloads grows, you might need to share access to resources in those workloads
or provision the resources multiple times across multiple accounts. You might have constructs
for a resource and share it across accounts. You can use AWS Resource Access Manager (AWS RAM)
to share other common resources, such as VPC subnets and Transit Gateway attachments, AWS
Network Firewall, or Amazon SageMaker pipelines.
To restrict your account to only share resources within your organization, use service control
policies (SCPs) to prevent access to external principals. When sharing resources, combine identity-
based controls and network controls to create a data perimeter for your organization to help
protect against unintended access. A data perimeter is a set of preventive guardrails to help verify
that only your trusted identities are accessing trusted resources from expected networks. These
controls place appropriate limits on what resources can be shared and prevent sharing or exposing
resources that should not be allowed. For example, as a part of your data perimeter, you can use
VPC endpoint policies and the AWS:PrincipalOrgId condition to ensure the identities accessing
your Amazon S3 buckets belong to your organization. It is important to note that SCPs do not
apply to service-linked roles or AWS service principals.
When using Amazon S3, turn off ACLs for your Amazon S3 bucket and use IAM policies to define
access control. For restricting access to an Amazon S3 origin from Amazon CloudFront, migrate
from origin access identity (OAI) to origin access control (OAC) which supports additional features
including server-side encryption with AWS Key Management Service.
In some cases, you might want to allow sharing resources outside of your organization or grant a
third party access to your resources. For prescriptive guidance on managing permissions to share
resources externally, see Permissions management.
Implementation steps
AWS Organizations is an account management service that allows you to consolidate multiple
AWS accounts into an organization that you create and centrally manage. You can group your
accounts into organizational units (OUs) and attach different policies to each OU to help you
meet your budgetary, security, and compliance needs. You can also control how AWS artificial
intelligence (AI) and machine learning (ML) services can collect and store data, and use the
multi-account management of the AWS services integrated with Organizations.
2. Integrate AWS Organizations with AWS services.
When you use an AWS service to perform tasks on your behalf in the member accounts of your
organization, AWS Organizations creates an IAM service-linked role (SLR) for that service in each
member account. You should manage trusted access using the AWS Management Console, the
AWS RAM helps you securely share the resources that you have created with roles and users in
your account and with other AWS accounts. In a multi-account environment, AWS RAM allows
you to create a resource once and share it with other accounts. This approach helps reduce
your operational overhead while providing consistency, visibility, and auditability through
integrations with Amazon CloudWatch and AWS CloudTrail, which you do not receive when
using cross-account access.
If you have resources that you shared previously using a resource-based policy, you can use the
PromoteResourceShareCreatedFromPolicy API or an equivalent to promote the resource
share to a full AWS RAM resource share.
In some cases, you might need to take additional steps to share resources. For example, to share
an encrypted snapshot, you need to share a AWS KMS key.
Resources
Related documents:
Related videos:
condition. When using an external ID, you or the third party can generate a unique ID for each
customer, third party, or tenancy. The unique ID should not be controlled by anyone but you after
it’s created. The third party must implement a process to relate the external ID to the customer in a
secure, auditable, and reproduceable manner.
You can also use IAM Roles Anywhere to manage IAM roles for applications outside of AWS that use
AWS APIs.
If the third party no longer requires access to your environment, remove the role. Avoid providing
long-term credentials to a third party. Maintain awareness of other AWS services that support
sharing. For example, the AWS Well-Architected Tool allows sharing a workload with other AWS
accounts, and AWS Resource Access Manager helps you securely share an AWS resource you own
with other accounts.
Implementation steps
Cross-account roles reduce the amount of sensitive information that is stored by external
accounts and third parties for servicing their customers. Cross-account roles allow you to grant
access to AWS resources in your account securely to a third party, such as AWS Partners or other
accounts in your organization, while maintaining the ability to manage and audit that access.
The third party might be providing service to you from a hybrid infrastructure or alternatively
pulling data into an offsite location. IAM Roles Anywhere helps you allow third party workloads
to securely interact with your AWS workloads and further reduce the need for long-term
credentials.
You should not use long-term credentials, or access keys associated with users, to provide
external account access. Instead, use cross-account roles to provide the cross-account access.
2. Use an external ID with third parties.
Using an external ID allows you to designate who can assume a role in an IAM trust policy.
The trust policy can require that the user assuming the role assert the condition and target in
which they are operating. It also provides a way for the account owner to permit the role to be
assumed only under specific circumstances. The primary function of the external ID is to address
and prevent the confused deputy problem.
Use an external ID if you are an AWS account owner and you have configured a role for a third
party that accesses other AWS accounts in addition to yours, or when you are in the position of
The third party should provide an automated, auditable setup mechanism. However, by using
the role policy document outlining the access needed, you should automate the setup of the
role. Using a AWS CloudFormation template or equivalent, you should monitor for changes with
drift detection as part of the audit practice.
6. Account for changes.
Your account structure, your need for the third party, or their service offering being provided
might change. You should anticipate changes and failures, and plan accordingly with the right
people, process, and technology. Audit the level of access you provide on a periodic basis, and
implement detection methods to alert you to unexpected changes. Monitor and audit the use
of the role and the datastore of the external IDs. You should be prepared to revoke third-party
access, either temporarily or permanently, as a result of unexpected changes or access patterns.
Also, measure the impact to your revocation operation, including the time it takes to perform,
the people involved, the cost, and the impact to other resources.
For prescriptive guidance on detection methods, see the Detection best practices.
Resources
Related documents:
Retain security event logs from services and applications. This is a fundamental principle of
security for audit, investigations, and operational use cases, and a common security requirement
driven by governance, risk, and compliance (GRC) standards, policies, and procedures.
Desired outcome: An organization should be able to reliably and consistently retrieve security
event logs from AWS services and applications in a timely manner when required to fulfill an
internal process or obligation, such as a security incident response. Consider centralizing logs for
better operational results.
Common anti-patterns:
Benefits of establishing this best practice: Implement a root cause analysis (RCA) mechanism for
security incidents and a source of evidence for your governance, risk, and compliance obligations.
Implementation guidance
During a security investigation or other use cases based on your requirements, you need to be able
to review relevant logs to record and understand the full scope and timeline of the incident. Logs
are also required for alert generation, indicating that certain actions of interest have happened. It
is critical to select, turn on, store, and set up querying and retrieval mechanisms and alerting.
Implementation steps
• Select and use log sources. Ahead of a security investigation, you need to capture relevant
logs to retroactively reconstruct activity in an AWS account. Select log sources relevant to your
workloads.
The log source selection criteria should be based on the use cases required by your business.
Establish a trail for each AWS account using AWS CloudTrail or an AWS Organizations trail, and
configure an Amazon S3 bucket for it.
Detection 337
AWS Well-Architected Framework Framework
An Amazon S3 bucket provides cost-effective, durable storage with an optional lifecycle policy.
Logs stored in Amazon S3 buckets can be queried using services such as Amazon Athena.
A CloudWatch log group provides durable storage and a built-in query facility through
CloudWatch Logs Insights.
• Identify appropriate log retention: When you use an Amazon S3 bucket or CloudWatch log
group to store logs, you must establish adequate lifecycles for each log source to optimize
storage and retrieval costs. Customers generally have between three months to one year of logs
readily available for querying, with retention of up to seven years. The choice of availability and
retention should align with your security requirements and a composite of statutory, regulatory,
and business mandates.
• Use logging for each AWS service and application with proper retention and lifecycle
policies: For each AWS service or application in your organization, look for the specific logging
configuration guidance:
• Configure AWS CloudTrail Trail
• Configure VPC Flow Logs
• Configure Amazon GuardDuty Finding Export
• Configure AWS Config recording
• Configure AWS WAF web ACL traffic
• Configure AWS Network Firewall network traffic logs
• Configure Elastic Load Balancing access logs
• Configure Amazon Route 53 resolver query logs
• Configure Amazon RDS logs
• Configure Amazon EKS Control Plane logs
• Configure Amazon CloudWatch agent for Amazon EC2 instances and on-premises servers
• Select and implement querying mechanisms for logs: For log queries, you can use CloudWatch
Logs Insights for data stored in CloudWatch log groups, and Amazon Athena and Amazon
OpenSearch Service for data stored in Amazon S3. You can also use third-party querying tools
such as a security information and event management (SIEM) service.
The process for selecting a log querying tool should consider the people, process, and
technology aspects of your security operations. Select a tool that fulfills operational, business,
and security requirements, and is both accessible and maintainable in the long term. Keep in
mind that log querying tools work optimally when the number of logs to be scanned is kept
Detection 339
AWS Well-Architected Framework Framework
Related videos:
Related examples:
Related tools:
Security teams rely on logs and findings to analyze events that may indicate unauthorized
activity or unintentional changes. To streamline this analysis, capture security logs and findings
in standardized locations. This makes data points of interest available for correlation and can
simplify tool integrations.
Desired outcome: You have a standardized approach to collect, analyze, and visualize log data,
findings, and metrics. Security teams can efficiently correlate, analyze, and visualize security data
across disparate systems to discover potential security events and identify anomalies. Security
information and event management (SIEM) systems or other mechanisms are integrated to query
and analyze log data for timely responses, tracking, and escalation of security events.
Common anti-patterns:
• Teams independently own and manage logging and metrics collection that is inconsistent to the
organization's logging strategy.
• Teams don't have adequate access controls to restrict visibility and alteration of the data
collected.
Detection 341
AWS Well-Architected Framework Framework
AWS services, such as Amazon GuardDuty and Amazon Inspector, with your log data. You can
also use third-party data source integrations, or configure custom data sources. All integrations
standardize your data into the Open Cybersecurity Schema Framework (OCSF) format, and are
stored in Amazon S3 buckets as Parquet files, eliminating the need for ETL processing.
Storing security data in standardized locations provides advanced analytics capabilities. AWS
recommends you deploy tools for security analytics that operate in an AWS environment into a
Security Tooling account that is separate from your Log Archive account. This approach allows
you to implement controls at depth to protect the integrity and availability of the logs and log
management process, distinct from the tools that access them. Consider using services, such as
Amazon Athena, to run on-demand queries that correlate multiple data sources. You can also
integrate visualization tools, such as Amazon QuickSight. AI-powered solutions are becoming
increasingly available and can perform functions such as translating findings into human-readable
summaries and natural language interaction. These solutions are often more readily integrated by
having a standardized data storage location for querying.
Implementation steps
Detection 343
AWS Well-Architected Framework Framework
Unexpected activity can generate multiple security alerts by different sources, requiring further
correlation and enrichment to understand the full context. Implement automated correlation and
enrichment of security alerts to help achieve more accurate incident identification and response.
Desired outcome: As activity generates different alerts within your workloads and environments,
automated mechanisms correlate data and enrich that data with additional information. This pre-
processing presents a more detailed understanding of the event, which helps your investigators
determine the criticality of the event and if it constitutes an incident that requires formal response.
This process reduces the load on your monitoring and investigation teams.
Common anti-patterns:
• Different groups of people investigate findings and alerts generated by different systems, unless
otherwise mandated by separation of duty requirements.
• Your organization funnels all security finding and alert data to standard locations, but requires
investigators to perform manual correlation and enrichment.
• You rely solely on the intelligence of threat detection systems to report on findings and establish
criticality.
Benefits of establishing this best practice: Automated correlation and enrichment of alerts helps
to reduce the overall cognitive load and manual data preparation required of your investigators.
This practice can reduce the time it takes to determine if the event represents an incident and
initiate a formal response. Additional context also helps you accurately assess the true severity of
an event, as it may be higher or lower than what any one alert suggests.
Implementation guidance
Security alerts can come from many different sources within AWS, including:
Detection 345
AWS Well-Architected Framework Framework
3. Identify sources for data correlation and enrichment. Example sources include CloudTrail, VPC
Flow Logs, Amazon Security Lake, and infrastructure and application logs.
4. Integrate your alerts with your data correlation and enrichment sources to create more detailed
security event contexts and establish criticality.
a. Amazon Detective, SIEM tooling, or other third-party solutions can perform a certain level of
ingestion, correlation, and enrichment automatically.
b. You can also use AWS services to build your own. For example, you can invoke an AWS
Lambda function to run an Amazon Athena query against AWS CloudTrail or Amazon Security
Lake, and publish the results to EventBridge.
Resources
Related documents:
Related examples:
• How to use AWS Security Hub and Amazon OpenSearch Service for SIEM
Related tools:
• Amazon Detective
• Amazon EventBridge
• AWS Lambda
• Amazon Athena
Detection 347
AWS Well-Architected Framework Framework
Implementation guidance
As described in SEC01-BP03 Identify and validate control objectives, services such as AWS Config
can help you monitor the configuration of resources in your accounts for adherence to your
requirements. When non-compliant resources are detected, we recommend that you configure
sending alerts to a cloud security posture management (CSPM) solution, such as AWS Security Hub,
to help with remediation. These solutions provide a central place for your security investigators to
monitor for issues and take corrective action.
While some non-compliant resource situations are unique and require human judgment to
remediate, other situations have a standard response that you can define programmatically. For
example, a standard response to a misconfigured VPC security group could be to remove the
disallowed rules and notify the owner. Responses can be defined in AWS Lambda functions, AWS
Systems Manager Automation documents, or through other code environments you prefer. Make
sure the environment is able to authenticate to AWS using an IAM role with the least amount of
permission needed to take corrective action.
Once you define the desired remediation, you can then determine your preferred means for
initiating it. AWS Config can initiate remediations for you. If you are using Security Hub, you can
do this through custom actions, which publishes the finding information to Amazon EventBridge.
An EventBridge rule can then initiate your remediation. You can configure the custom action in
Security Hub to run either automatically or manually.
For programmatic remediation, we recommend that you have comprehensive logs and audits
for the actions taken, as well as their outcomes. Review and analyze these logs to assess the
effectiveness of the automated processes, and identify areas of improvement. Capture logs in
Amazon CloudWatch Logs and remediation outcomes as finding notes in Security Hub.
As a starting point, consider Automated Security Response on AWS, which has pre-built
remediations for resolving common security misconfigurations.
Implementation steps
Detection 349
AWS Well-Architected Framework Framework
Any workload that has some form of network connectivity, whether it’s the internet or a private
network, requires multiple layers of defense to help protect from external and internal network-
based threats.
Best practices
• SEC05-BP01 Create network layers
• SEC05-BP02 Control traffic flow within your network layers
• SEC05-BP03 Implement inspection-based protection
• SEC05-BP04 Automate network protection
Segment your network topology into different layers based on logical groupings of your workload
components according to their data sensitivity and access requirements. Distinguish between
components that require inbound access from the internet, such as public web endpoints, and
those that only need internal access, such as databases.
Desired outcome: The layers of your network are part of an integral defense-in-depth approach
to security that complements the identity authentication and authorization strategy of your
workloads. Layers are in place according to data sensitivity and access requirements, with
appropriate traffic flow and control mechanisms.
Common anti-patterns:
Benefits of establishing this best practice: Establishing network layers is the first step in
restricting unnecessary pathways through the network, particularly those that lead to critical
systems and data. This makes it harder for unauthorized actors to gain access to your network and
navigate to additional resources within. Discrete network layers beneficially reduce the scope of
VPC subnets unless there are specific reasons not to. Determine where VPC endpoints and AWS
PrivateLink can simplify adhering to security policies that limit access to internet gateways.
Implementation steps
1. Review your workload architecture. Logically group components and services based on the
functions they serve, the sensitivity of data being processed, and their behavior.
2. For components responding to requests from the internet, consider using load balancers or
other proxies to provide public endpoints. Explore shifting security controls by using managed
services, such as CloudFront, Amazon API Gateway, Elastic Load Balancing, and AWS Amplify to
host public endpoints.
3. For components running in compute environments, such as Amazon EC2 instances, AWS Fargate
containers, or Lambda functions, deploy these into private subnets based on your groups from
the first step.
4. For fully managed AWS services, such as Amazon DynamoDB, Amazon Kinesis, or Amazon SQS,
consider using VPC endpoints as the default for access over private IP addresses.
Resources
Related videos:
Related examples:
• VPC examples
• Access container applications privately on Amazon ECS by using AWS Fargate, AWS PrivateLink,
and a Network Load Balancer
• Serve static content in an Amazon S3 bucket through a VPC by using Amazon CloudFront
further control using additional services, such as AWS PrivateLink, Amazon Route 53 Resolver DNS
Firewall, AWS Network Firewall, and AWS WAF.
Understand and inventory the data flow and communication requirements of your workloads in
terms of connection-initiating parties, ports, protocols, and network layers. Evaluate the protocols
available for establishing connections and transmitting data to select ones that achieve your
protection requirements (for example, HTTPS rather than HTTP). Capture these requirements
at both the boundaries of your networks and within each layer. Once these requirements are
identified, explore options to only allow the required traffic to flow at each connection point. A
good starting point is to use security groups within your VPC, as they can be attached to resources
that uses an Elastic Network Interface (ENI), such Amazon EC2 instances, Amazon ECS tasks,
Amazon EKS pods, or Amazon RDS databases. Unlike a Layer 4 firewall, a security group can have
a rule that allows traffic from another security group by its identifier, minimizing updates as
resources within the group change over time. You can also filter traffic using both inbound and
outbound rules using security groups.
When traffic moves between VPCs, it's common to use VPC peering for simple routing or the
AWS Transit Gateway for complex routing. With these approaches, you facilitate traffic flows
between the range of IP addresses of both the source and destination networks. However, if your
workload only requires traffic flows between specific components in different VPCs, consider
using a point-to-point connection using AWS PrivateLink. To do this, identify which service should
act as the producer and which should act as the consumer. Deploy a compatible load balancer
for the producer, turn on PrivateLink accordingly, and then accept a connection request by the
consumer. The producer service is then assigned a private IP address from the consumer's VPC
that the consumer can use to make subsequent requests. This approach reduces the need to
peer the networks. Include the costs for data processing and load balancing as part of evaluating
PrivateLink.
While security groups and PrivateLink help control the flow between the components of your
workloads, another major consideration is how to control which DNS domains your resources are
allowed to access (if any). Depending on the DHCP configuration of your VPCs, you can consider
two different AWS services for this purpose. Most customers use the default Route 53 Resolver
DNS service (also called Amazon DNS server or AmazonProvidedDNS) available to VPCs at the +2
address of its CIDR range. With this approach, you can create DNS Firewall rules and associate them
to your VPC that determine what actions to take for the domain lists you supply.
If you are not using the Route 53 Resolver, or if you want to complement the Resolver with deeper
inspection and flow control capabilities beyond domain filtering, consider deploying an AWS
• Application Acceleration and Protection with Amazon CloudFront, AWS WAF, and AWS Shield
• AWS re:Inforce 2023: Firewalls and where to put them
Related examples:
Set up traffic inspection points between your network layers to make sure data in transit matches
the expected categories and patterns. Analyze traffic flows, metadata, and patterns to help
identify, detect, and respond to events more effectively.
Desired outcome: Traffic that traverses between your network layers are inspected and authorized.
Allow and deny decisions are based on explicit rules, threat intelligence, and deviations from
baseline behaviors. Protections become stricter as traffic moves closer to sensitive data.
Common anti-patterns:
• Relying solely on firewall rules based on ports and protocols. Not taking advantage of intelligent
systems.
• Authoring firewall rules based on specific current threat patterns that are subject to change.
• Only inspecting traffic where traffic transits from private to public subnets, or from public
subnets to the Internet.
• Not having a baseline view of your network traffic to compare for behavior anomalies.
Benefits of establishing this best practice: Inspection systems allow you to author intelligent
rules, such as allowing or denying traffic only when certain conditions within the traffic data exist.
Benefit from managed rule sets from AWS and partners, based on the latest threat intelligence,
as the threat landscape changes over time. This reduces the overhead of maintaining rules and
researching indicators of compromise, reducing the potential for false positives.
Implementation guidance
Have fine-grained control over both your stateful and stateless network traffic using AWS Network
Firewall, or other Firewalls and Intrusion Prevention Systems (IPS) on AWS Marketplace that you
a. To configure AWS WAF, start by configuring a web access control list (web ACL). The web ACL
is a collection of rules with a serially processed default action (ALLOW or DENY) that defines
how your WAF handles traffic. You can create your own rules and groups or use AWS managed
rule groups in your web ACL.
b. Once your web ACL is configured, associate the web ACL with an AWS resource (like an
Application Load Balancer, API Gateway REST API, or CloudFront distribution) to begin
protecting web traffic.
Resources
Related documents:
• Centralized inspection architecture with AWS Gateway Load Balancer and AWS Transit Gateway
Related examples:
• TLS inspection configuration for encrypted egress traffic and AWS Network Firewall
Related tools:
rules systems that can update automatically based on the latest threat intelligence. Examples
of protecting your web endpoints include AWS WAF managed rules and AWS Shield Advanced
automatic application layer DDoS mitigation. Use AWS Network Firewall managed rule groups to
stay up to date with low-reputation domain lists and threat signatures as well.
Beyond managed rules, we recommend you use DevOps practices to automate deploying your
network resources, protections, and the rules you specify. You can capture these definitions in
AWS CloudFormation or another infrastructure as code (IaC) tool of your choice, commit them
to a version control system, and deploy them using CI/CD pipelines. Use this approach to gain
the traditional benefits of DevOps for managing your network controls, such as more predictable
releases, automated testing using tools like AWS CloudFormation Guard, and detecting drift
between your deployed environment and your desired configuration.
Based on the decisions you made as part of SEC05-BP01 Create network layers, you may have
a central management approach to creating VPCs that are dedicated for ingress, egress, and
inspection flows. As described in the AWS Security Reference Architecture (AWS SRA), you can
define these VPCs in a dedicated Network infrastructure account. You can use similar techniques
to centrally define the VPCs used by your workloads in other accounts, their security groups, AWS
Network Firewall deployments, Route 53 Resolver rules and DNS Firewall configurations, and other
network resources. You can share these resources with your other accounts with the AWS Resource
Access Manager. With this approach, you can simplify the automated testing and deployment of
your network controls to the Network account, with only one destination to manage. You can do
this in a hybrid model, where you deploy and share certain controls centrally and delegate other
controls to the individual workload teams and their respective accounts.
Implementation steps
1. Establish ownership over which aspects of the network and protections are defined centrally,
and which your workload teams can maintain.
2. Create environments to test and deploy changes to your network and its protections. For
example, use a Network Testing account and a Network Production account.
3. Determine how you will store and maintain your templates in a version control system. Store
central templates in a repository that is distinct from workload repositories, while workload
templates can be stored in repositories specific to that workload.
4. Create CI/CD pipelines to test and deploy templates. Define tests to check for misconfigurations
and that templates adhere to your company standards.
Frequently scan and patch for vulnerabilities in your code, dependencies, and in your infrastructure
to help protect against new threats.
Desired outcome: Create and maintain a vulnerability management program. Regularly scan
and patch resources such as Amazon EC2 instances, Amazon Elastic Container Service (Amazon
ECS) containers, and Amazon Elastic Kubernetes Service (Amazon EKS) workloads. Configure
maintenance windows for AWS managed resources, such as Amazon Relational Database Service
(Amazon RDS) databases. Use static code scanning to inspect application source code for common
issues. Consider web application penetration testing if your organization has the requisite skills or
can hire outside assistance.
Common anti-patterns:
Implementation guidance
• Use AWS Systems Manager: You are responsible for patch management for your AWS resources,
including Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Machine Images
(AMIs), and other compute resources. AWS Systems Manager Patch Manager automates the
process of patching managed instances with both security related and other types of updates.
Patch Manager can be used to apply patches on Amazon EC2 instances for both operating
systems and applications, including Microsoft applications, Windows service packs, and minor
version upgrades for Linux based instances. In addition to Amazon EC2, Patch Manager can also
be used to patch on-premises servers.
For a list of supported operating systems, see Supported operating systems in the Systems
Manager User Guide. You can scan instances to see only a report of missing patches, or you can
scan and automatically install all missing patches.
• Use AWS Security Hub: Security Hub provides a comprehensive view of your security state in
AWS. It collects security data across multiple AWS services and provides those findings in a
standardized format, allowing you to prioritize security findings across AWS services.
• Use AWS CloudFormation: AWS CloudFormation is an infrastructure as code (IaC) service that
can help with vulnerability management by automating resource deployment and standardizing
resource architecture across multiple accounts and environments.
Resources
Related documents:
• Improved, Automated Vulnerability Management for Cloud Workloads with a New Amazon
Inspector
• Automate vulnerability management and remediation in AWS using Amazon Inspector and AWS
Systems Manager – Part 1
Related videos:
• Security best practices for the Amazon EC2 instance metadata service
You can reduce the burden of hardening systems by using guidance that trusted sources provide,
such as the Center for Internet Security (CIS) and the Defense Information Systems Agency (DISA)
Security Technical Implementation Guides (STIGs). We recommend you start with an Amazon
Machine Image (AMI) published by AWS or an APN partner, and use the AWS EC2 Image Builder to
automate configuration according to an appropriate combination of CIS and STIG controls.
While there are available hardened images and EC2 Image Builder recipes that apply the CIS
or DISA STIG recommendations, you may find their configuration prevents your software from
running successfully. In this situation, you can start from a non-hardened base image, install your
software, and then incrementally apply CIS controls to test their impact. For any CIS control that
prevents your software from running, test if you can implement the finer-grained hardening
recommendations in a DISA instead. Keep track of the different CIS controls and DISA STIG
configurations you are able to apply successfully. Use these to define your image hardening recipes
in EC2 Image Builder accordingly.
For containerized workloads, hardened images from Docker are available on the Amazon Elastic
Container Registry (ECR) public repository. You can use EC2 Image Builder to harden container
images alongside AMIs.
Similar to operating systems and container images, you can obtain code packages (or libraries)
from public repositories, through tooling such as pip, npm, Maven, and NuGet. We recommend you
manage code packages by integrating private repositories, such as within AWS CodeArtifact, with
trusted public repositories. This integration can handle retrieving, storing, and keeping packages
up-to-date for you. Your application build processes can then obtain and test the latest version of
these packages alongside your application, using techniques like Software Composition Analysis
(SCA), Static Application Security Testing (SAST), and Dynamic Application Security Testing (DAST).
For serverless workloads that use AWS Lambda, simplify managing package dependencies using
Lambda layers. Use Lambda layers to configure a set of standard dependencies that are shared
across different functions into a standalone archive. You can create and maintain layers through
their own build process, providing a central way for your functions to stay up-to-date.
Implementation steps
• Harden operating systems. Use base images from trusted sources as a foundation for building
your hardened AMIs. Use EC2 Image Builder to help customize the software installed on your
images.
Common anti-patterns:
• Interactive access to Amazon EC2 instances with protocols such as SSH or RDP.
Benefits of establishing this best practice: Performing actions with automation helps you to
reduce the operational risk of unintended changes and misconfigurations. Removing the use of
Secure Shell (SSH) and Remote Desktop Protocol (RDP) for interactive access reduces the scope
of access to your compute resources. This takes away a common path for unauthorized actions.
Capturing your compute resource management tasks in automation documents and programmatic
scripts provides a mechanism to define and audit the full scope of authorized activities at a fine-
grained level of detail.
Implementation guidance
Logging into an instance is a classic approach to system administration. After installing the server
operating system, users would typically log in manually to configure the system and install the
desired software. During the server's lifetime, users might log in to perform software updates,
apply patches, change configurations, and troubleshoot problems.
Manual access poses a number of risks, however. It requires a server that listens for requests, such
as an SSH or RDP service, that can provide a potential path to unauthorized access. It also increases
the risk of human error associated with performing manual steps. These can result in workload
incidents, data corruption or destruction, or other security issues. Human access also requires
protections against the sharing of credentials, creating additional management overhead.
To mitigate these risks, you can implement an agent-based remote access solution, such as AWS
Systems Manager. AWS Systems Manager Agent (SSM Agent) initiates an encrypted channel and
thus does not rely on listening for externally-initiated requests. Consider configuring SSM Agent to
establish this channel over a VPC endpoint.
Related tools:
Related videos:
• Controlling User Session Access to Instances in AWS Systems Manager Session Manager
Use cryptographic verification to validate the integrity of software artifacts (including images) your
workload uses. Cryptographically sign your software as a safeguard against unauthorized changes
run within your compute environments.
Desired outcome: All artifacts are obtained from trusted sources. Vendor website certificates
are validated. Downloaded artifacts are cryptographically verified by their signatures. Your own
software is cryptographically signed and verified by your computing environments.
Common anti-patterns:
• Trusting reputable vendor websites to obtain software artifacts, but ignoring certificate
expiration notices. Proceeding with downloads without confirming certificates are valid.
• Validating vendor website certificates, but not cryptographically verifying downloaded artifacts
from these websites.
• Relying solely on digests or hashes to validate software integrity. Hashes establish that artifacts
have not been modified from the original version, but do not validate their source.
• Not signing your own software, code, or libraries, even when only used in your own
deployments.
Benefits of establishing this best practice: Validating the integrity of artifacts that your workload
depends on helps prevent malware from entering your compute environments. Signing your
software helps safeguard against unauthorized running in your compute environments. Secure
your software supply chain by signing and verifying code.
Related examples:
• Automate Lambda code signing with Amazon CodeCatalyst and AWS Signer
• Signing and Validating OCI Artifacts with AWS Signer
Related tools:
• AWS Lambda
• AWS Signer
• AWS Certificate Manager
• AWS Key Management Service
• AWS CodeArtifact
Automate compute protection operations to reduce the need for human intervention. Use
automated scanning to detect potential issues within your compute resources, and remediate with
automated programmatic responses or fleet management operations. Incorporate automation in
your CI/CD processes to deploy trustworthy workloads with up-to-date dependencies.
Desired outcome: Automated systems perform all scanning and patching of compute resources.
You use automated verification to check that software images and dependencies come from
trusted sources, and have not been tampered with. Workloads are automatically checked for up-
to-date dependencies, and are signed to establish trustworthiness in AWS compute environments.
Automated remediations are initiated when non-compliant resources are detected.
Common anti-patterns:
• Following the practice of immutable infrastructure, but not having a solution in place for
emergency patching or replacement of production systems.
• Using automation to fix misconfigured resources, but not having a manual override mechanism
in place. Situations may arise where you need to adjust the requirements, and you may need to
suspend automations until you make these changes.
Benefits of establishing this best practice: Automation can reduce the risk of unauthorized access
and use of your compute resources. It helps to prevent misconfigurations from making their way
into production environments, and detecting and fixing misconfigurations should they occur.
or Security Technical Implementation Guide (STIG) standards from base AWS and APN partner
images.
2. Automate configuration management. Enforce and validate secure configurations in your
compute resources automatically by using a configuration management service or tool.
a. Automated configuration management using AWS Config
b. Automated security and compliance posture management using AWS Security Hub
3. Automate patching or replacing Amazon Elastic Compute Cloud (Amazon EC2) instances. AWS
Systems Manager Patch Manager automates the process of patching managed instances with
both security-related and other types of updates. You can use Patch Manager to apply patches
for both operating systems and applications.
a. AWS Systems Manager Patch Manager
4. Automate scanning of compute resources for common vulnerabilities and exposures (CVEs), and
embed security scanning solutions within your build pipeline.
a. Amazon Inspector
b. ECR Image Scanning
5. Consider Amazon GuardDuty for automatic malware and threat detection to protect compute
resources. GuardDuty can also identify potential issues when an AWS Lambda function gets
invoked in your AWS environment.
a. Amazon GuardDuty
6. Consider AWS Partner solutions. AWS Partners offer industry-leading products that are
equivalent, identical to, or integrate with existing controls in your on-premises environments.
These products complement the existing AWS services to allow you to deploy a comprehensive
security architecture and a more seamless experience across your cloud and on-premises
environments.
a. Infrastructure security
Resources
Related documents:
• Get the full benefits of IMDSv2 and disable IMDSv1 across your AWS infrastructure
Common anti-patterns:
• Not having a formal data classification policy in place to define data sensitivity levels and their
handling requirements
• Not having a good understanding of the sensitivity levels of data within your workload, and not
capturing this information in architecture and operations documentation
• Failing to apply the appropriate controls around your data based on its sensitivity and
requirements, as outlined in your data classification and handling policy
• Failing to provide feedback about data classification and handling requirements to owners of the
policies.
Benefits of establishing this best practice: This practice removes ambiguity around the
appropriate handling of data within your workload. Applying a formal policy that defines the
sensitivity levels of data in your organization and their required protections can help you comply
with legal regulations and other cybersecurity attestations and certifications. Workload owners
can have confidence in knowing where sensitive data is stored and what protection controls are in
place. Capturing these in documentation helps new team members better understand them and
maintain controls early in their tenure. These practices can also help reduce costs by right sizing the
controls for each type of data.
Implementation guidance
When designing a workload, you may be considering ways to protect sensitive data intuitively. For
example, in a multi-tenant application, it is intuitive to think of each tenant's data as sensitive and
put protections in place so that one tenant can't access the data of another tenant. Likewise, you
may intuitively design access controls so only administrators can modify data while other users
have only read-level access or no access at all.
By having these data sensitivity levels defined and captured in policy, along with their data
protection requirements, you can formally identify what data resides in your workload. You can
then determine if the right controls are in place, if the controls can be audited, and what responses
are appropriate if data is found to be mishandled.
To help with categorizing where sensitive data is present within your workload, consider
using resource tags where available. For example, you can apply a tag that has a tag key of
Classification and a tag value of PHI for protected health information (PHI), and another tag
Apply data protection controls that provide an appropriate level of control for each class of data
defined in your classification policy. This practice can allow you to protect sensitive data from
unauthorized access and use, while preserving the availability and use of data.
Desired outcome: You have a classification policy that defines the different levels of sensitivity for
data in your organization. For each of these sensitivity levels, you have clear guidelines published
for approved storage and handling services and locations, and their required configuration.
You implement the controls for each level according to the level of protection required and
their associated costs. You have monitoring and alerting in place to detect if data is present in
unauthorized locations, processed in unauthorized environments, accessed by unauthorized actors,
or the configuration of related services becomes non-compliant.
Common anti-patterns:
• Applying the same level of protection controls across all data. This may lead to over-provisioning
security controls for low-sensitivity data, or insufficient protection of highly sensitive data.
• Not involving relevant stakeholders from security, compliance, and business teams when
defining data protection controls.
• Overlooking the operational overhead and costs associated with implementing and maintaining
data protection controls.
• Not conducting periodic data protection control reviews to maintain alignment with
classification policies.
Benefits of establishing this best practice: By aligning your controls to the classification level of
your data, your organization can invest in higher levels of control where needed. This can include
increasing resources on securing, monitoring, measuring, remediating, and reporting. Where fewer
controls are appropriate, you can improve the accessibility and completeness of data for your
workforce, customers, or constituents. This approach gives your organization the most flexibility
with data usage, while still adhering to data protection requirements.
Implementation guidance
Implementing data protection controls based on data sensitivity levels involves several key steps.
First, identify the different data sensitivity levels within your workload architecture (such as public,
Related documents:
Related examples:
Related tools:
Automating the identification and classification of data can help you implement the correct
controls. Using automation to augment manual determination reduces the risk of human error and
exposure.
Desired outcome: You are able to verify whether the proper controls are in place based on
your classification and handling policy. Automated tools and services help you to identify and
classify the sensitivity level of your data. Automation also helps you continually monitor your
environments to detect and alert if data is being stored or handled in unauthorized ways so
corrective action can be taken quickly.
Common anti-patterns:
• Relying solely on manual processes for data identification and classification, which can be error-
prone and time-consuming. This can lead to inefficient and inconsistent data classification,
especially as data volumes grow.
• Not having mechanisms to track and manage data assets across the organization.
a. The automated sensitive data discovery capability of Macie can be used to perform ongoing
scans of your environments. Known S3 buckets that are authorized to store sensitive data can
be excluded using an allow list in Macie.
3. Incorporate identification and classification into your build and test processes.
a. Identify tools that developers can use to scan data for sensitivity while workloads are in
development. Use these tools as part of integration testing to alert when sensitive data is
unexpected and prevent further deployment.
4. Implement a system or runbook to take action when sensitive data is found in unauthorized
locations.
Resources
Related documents:
• Amazon CloudWatch Logs: Help protect sensitive log data with masking
Related examples:
Related tools:
• Amazon Macie
• Amazon Comprehend
• AWS Glue
lifecycle mechanism, such as Amazon S3 lifecycle policies and the Amazon Data Lifecycle Manager,
to configure your data retention, archiving, and expiration processes.
Distinguish between data that is available for use, and data that is stored as a backup. Consider
using AWS Backup to automate the backup of data across AWS services. Amazon EBS snapshots
provide a way to copy an EBS volume and store it using S3 features, including lifecycle, data
protection, and access to protection mechanisms. Two of these mechanisms are S3 Object Lock
and AWS Backup Vault Lock, which can provide you with additional security and control over your
backups. Manage clear separation of duties and access for backups. Isolate backups at the account
level to maintain separation from the affected environment during an event.
Another aspect of lifecycle management is recording the history of data as it progresses through
your workload, called data provenance tracking. This can give confidence that you know where
the data came from, any transformations performed, what owner or process made those changes,
and when. Having this history helps with troubleshooting issues and investigations during
potential security events. For example, you can log metadata about transformations in an Amazon
DynamoDB table. Within a data lake, you can keep copies of transformed data in different
S3 buckets for each data pipeline stage. Store schema and timestamp information in an AWS
Glue Data Catalog. Regardless of your solution, consider the requirements of your end users to
determine the appropriate tooling you need to report on your data provenance. This will help you
determine how to best track your provenance.
Implementation steps
1. Analyze the workload's data types, sensitivity levels, and access requirements to classify the data
and define appropriate lifecycle management strategies.
2. Design and implement data retention policies and automated destruction processes that align
with legal, regulatory, and organizational requirements.
3. Establish processes and automation for continuous monitoring, auditing, and adjustment of
data lifecycle management strategies, controls, and policies as workload requirements and
regulations evolve.
Resources
correct balance between key availability, confidentiality, and integrity. Access to keys should be
monitored, and key material rotated through an automated process. Key material should never be
accessible to human identities.
Common anti-patterns:
Benefits of establishing this best practice: By establishing a secure key management mechanism
for your workload, you can help provide protection for your content against unauthorized access.
Additionally, you may be subject to regulatory requirements to encrypt your data. An effective key
management solution can provide technical mechanisms aligned to those regulations to protect
key material.
Implementation guidance
Many regulatory requirements and best practices include encryption of data at rest as a
fundamental security control. In order to comply with this control, your workload needs a
mechanism to securely store and manage the key material used to encrypt your data at rest.
AWS offers AWS Key Management Service (AWS KMS) to provide durable, secure, and redundant
storage for AWS KMS keys. Many AWS services integrate with AWS KMS to support encryption of
your data. AWS KMS uses FIPS 140-2 Level 3 validated hardware security modules to protect your
keys. There is no mechanism to export AWS KMS keys in plain text.
When deploying workloads using a multi-account strategy, it is considered best practice to keep
AWS KMS keys in the same account as the workload that uses them. In this distributed model,
responsibility for managing the AWS KMS keys resides with the application team. In other use
cases, organizations may choose to store AWS KMS keys into a centralized account. This centralized
structure requires additional policies to enable the cross-account access required for the workload
account to access keys stored in the centralized account, but may be more applicable in use cases
where a single key is shared across multiple AWS accounts.
Regardless of where the key material is stored, access to the key should be tightly controlled
through the use of key policies and IAM policies. Key policies are the primary way to control access
• Review the best practices for access control to your AWS KMS keys.
4. Consider AWS Encryption SDK: Use the AWS Encryption SDK with AWS KMS integration when
your application needs to encrypt data client-side.
• AWS Encryption SDK
5. Enable IAM Access Analyzer to automatically review and notify if there are overly broad AWS
KMS key policies.
6. Enable Security Hub to receive notifications if there are misconfigured key policies, keys
scheduled for deletion, or keys without automated rotation enabled.
7. Determine the logging level appropriate for your AWS KMS keys. Since calls to AWS KMS,
including read-only events, are logged, the CloudTrail logs associated with AWS KMS can
become voluminous.
• Some organizations prefer to segregate the AWS KMS logging activity into a separate trail.
For more detail, see the Logging AWS KMS API calls with CloudTrail section of the AWS KMS
developers guide.
Resources
Related documents:
Related videos:
Additionally, Amazon Elastic Compute Cloud (Amazon EC2) and Amazon S3 support the
enforcement of encryption by setting default encryption. You can use AWS Config Rules to check
automatically that you are using encryption, for example, for Amazon Elastic Block Store (Amazon
EBS) volumes, Amazon Relational Database Service (Amazon RDS) instances, and Amazon S3
buckets.
AWS also provides options for client-side encryption, allowing you to encrypt data prior to
uploading it to the cloud. The AWS Encryption SDK provides a way to encrypt your data using
envelope encryption. You provide the wrapping key, and the AWS Encryption SDK generates a
unique data key for each data object it encrypts. Consider AWS CloudHSM if you need a managed
single-tenant hardware security module (HSM). AWS CloudHSM allows you to generate, import,
and manage cryptographic keys on a FIPS 140-2 level 3 validated HSM. Some use cases for AWS
CloudHSM include protecting private keys for issuing a certificate authority (CA), and turning on
transparent data encryption (TDE) for Oracle databases. The AWS CloudHSM Client SDK provides
software that allows you to encrypt data client side using keys stored inside AWS CloudHSM prior
to uploading your data into AWS. The Amazon DynamoDB Encryption Client also allows you to
encrypt and sign items prior to upload into a DynamoDB table.
Implementation steps
• Enforce encryption at rest for Amazon S3: Implement Amazon S3 bucket default encryption.
Configure default encryption for new Amazon EBS volumes: Specify that you want all newly
created Amazon EBS volumes to be created in encrypted form, with the option of using the
default key provided by AWS or a key that you create.
Configure encrypted Amazon Machine Images (AMIs): Copying an existing AMI with encryption
configured will automatically encrypt root volumes and snapshots.
Configure Amazon RDS encryption: Configure encryption for your Amazon RDS database
clusters and snapshots at rest by using the encryption option.
Create and configure AWS KMS keys with policies that limit access to the appropriate
principals for each classification of data: For example, create one AWS KMS key for encrypting
production data and a different key for encrypting development or test data. You can also
provide key access to other AWS accounts. Consider having different accounts for your
development and production environments. If your production environment needs to decrypt
artifacts in the development account, you can edit the CMK policy used to encrypt the
Common anti-patterns:
Benefits of establishing this best practice: Automation helps to prevent the risk of misconfiguring
your data storage locations. It helps to prevent misconfigurations from entering your production
environments. This best practice also helps with detecting and fixing misconfigurations if they
occur.
Implementation guidance
Automation is a theme throughout the practices for protecting your data at rest. SEC01-
BP06 Automate deployment of standard security controls describes how you can capture the
configuration of your resources using infrastructure as code (IaC) templates, such as with AWS
CloudFormation. These templates are committed to a version control system, and are used to
deploy resources on AWS through a CI/CD pipeline. These techniques equally apply to automating
the configuration of your data storage solutions, such as encryption settings on Amazon S3
buckets.
You can check the settings that you define in your IaC templates for misconfiguration in your CI/
CD pipelines using rules in AWS CloudFormation Guard. You can monitor settings that are not yet
available in CloudFormation or other IaC tooling for misconfiguration with AWS Config. Alerts that
Config generates for misconfigurations can be remediated automatically, as described in SEC04-
BP04 Initiate remediation for non-compliant resources.
Using automation as part of your permissions management strategy is also an integral component
of automated data protections. SEC03-BP02 Grant least privilege access and SEC03-BP04 Reduce
permissions continuously describe configuring least-privilege access policies that are continually
monitored by the AWS Identity and Access Management Access Analyzer to generate findings when
a. AWS Backup is a managed service that creates encrypted and secure backups of various
data sources on AWS. Elastic Disaster Recovery allows you to copy full server workloads
and maintain continuous data protection with a recovery point objective (RPO) measured
in seconds. You can configure both services to work together to automate creating data
backups and copying them to failover locations. This can help keep your data available when
impacted by either operational or security events.
Resources
Related documents:
• AWS Prescriptive Guidance: Automatically encrypt existing and new Amazon EBS volumes
• Ransomware Risk Management on AWS Using the NIST Cyber Security Framework (CSF)
Related examples:
• How to use AWS Config proactive rules and AWS CloudFormation Hooks to prevent creation of
noncompliant cloud resources
• Automate and centrally manage data protection for Amazon S3 with AWS Backup
• AWS re:Invent 2023 - Implement proactive data protection using Amazon EBS snapshots
• AWS re:Invent 2022 - Build and automate for resilience with modern data protection
Related tools:
Amazon S3 Glacier Vault Lock and Amazon S3 Object Lock provide mandatory access control for
objects in Amazon S3—once a vault policy is locked with the compliance option, not even the root
user can change it until the lock expires.
Implementation steps
• Enforce access control: Enforce access control with least privileges, including access to
encryption keys.
• Separate data based on different classification levels: Use different AWS accounts for data
classification levels, and manage those accounts using AWS Organizations.
• Review AWS Key Management Service (AWS KMS) policies: Review the level of access granted
in AWS KMS policies.
• Review Amazon S3 bucket and object permissions: Regularly review the level of access granted
in S3 bucket policies. Best practice is to avoid using publicly readable or writeable buckets.
Consider using AWS Config to detect buckets that are publicly available, and Amazon CloudFront
to serve content from Amazon S3. Verify that buckets that should not allow public access are
properly configured to prevent public access. By default, all S3 buckets are private, and can only
be accessed by users that have been explicitly granted access.
• Use AWS IAM Access Analyzer: IAM Access Analyzer analyzes Amazon S3 buckets and generates
a finding when an S3 policy grants access to an external entity.
• Use Amazon S3 versioning and object lock when appropriate.
• Use Amazon S3 Inventory: Amazon S3 Inventory can be used to audit and report on the
replication and encryption status of your S3 objects.
• Review Amazon EBS and AMI sharing permissions: Sharing permissions can allow images and
volumes to be shared with AWS accounts that are external to your workload.
• Review AWS Resource Access Manager Shares periodically to determine whether resources
should continue to be shared. Resource Access Manager allows you to share resources, such
as AWS Network Firewall policies, Amazon Route 53 resolver rules, and subnets, within your
Amazon VPCs. Audit shared resources regularly and stop sharing resources which no longer need
to be shared.
Resources
Desired outcome: A secure certificate management system that can provision, deploy, store, and
renew certificates in a public key infrastructure (PKI). A secure key and certificate management
mechanism prevents certificate private key material from disclosure and automatically renews
the certificate on a periodic basis. It also integrates with other services to provide secure network
communications and identity for machine resources inside of your workload. Key material should
never be accessible to human identities.
Common anti-patterns:
Implementation guidance
Modern workloads make extensive use of encrypted network communications using PKI protocols
such as TLS. PKI certificate management can be complex, but automated certificate provisioning,
deployment, and renewal can reduce the friction associated with certificate management.
AWS provides two services to manage general-purpose PKI certificates: AWS Certificate Manager
and AWS Private Certificate Authority (AWS Private CA). ACM is the primary service that customers
use to provision, manage, and deploy certificates for use in both public-facing as well as private
AWS workloads. ACM issues certificates using AWS Private CA and integrates with many other AWS
managed services to provide secure TLS certificates for workloads.
AWS Private CA allows you to establish your own root or subordinate certificate authority and
issue TLS certificates through an API. You can use these kinds of certificates in scenarios where you
control and manage the trust chain on the client side of the TLS connection. In addition to TLS use
• Use ACM managed renewal for certificates issued by ACM along with integrated AWS
managed services.
3. Establish logging and audit trails:
• Enable CloudTrail logs to track access to the accounts holding certificate authorities. Consider
configuring log file integrity validation in CloudTrail to verify the authenticity of the log data.
• Periodically generate and review audit reports that list the certificates that your private CA has
issued or revoked. These reports can be exported to an S3 bucket.
• When deploying a private CA, you will also need to establish an S3 bucket to store the
Certificate Revocation List (CRL). For guidance on configuring this S3 bucket based on your
workload's requirements, see Planning a certificate revocation list (CRL).
Resources
Related documents:
• How to secure an enterprise scale ACM Private CA hierarchy for automotive and manufacturing
• Private CA best practices
Related videos:
Related examples:
• Private CA workshop
Additionally, you can use VPN connectivity into your VPC from an external network or AWS Direct
Connect to facilitate encryption of traffic. Verify that your clients are making calls to AWS APIs
using at least TLS 1.2, as AWS is deprecating the use of earlier versions of TLS in June 2023. AWS
recommends using TLS 1.3. Third-party solutions are available in the AWS Marketplace if you have
special requirements.
Implementation steps
• Enforce encryption in transit: Your defined encryption requirements should be based on the
latest standards and best practices and only allow secure protocols. For example, configure a
security group to only allow the HTTPS protocol to an application load balancer or Amazon EC2
instance.
• Configure secure protocols in edge services: Configure HTTPS with Amazon CloudFront and use
a security profile appropriate for your security posture and use case.
• Use a VPN for external connectivity: Consider using an IPsec VPN for securing point-to-point or
network-to-network connections to help provide both data privacy and integrity.
• Configure secure protocols in load balancers: Select a security policy that provides the
strongest cipher suites supported by the clients that will be connecting to the listener. Create an
HTTPS listener for your Application Load Balancer.
• Configure secure protocols in Amazon Redshift: Configure your cluster to require a secure
socket layer (SSL) or transport layer security (TLS) connection.
• Configure secure protocols: Review AWS service documentation to determine encryption-in-
transit capabilities.
• Configure secure access when uploading to Amazon S3 buckets: Use Amazon S3 bucket policy
controls to enforce secure access to data.
• Consider using AWS Certificate Manager: ACM allows you to provision, manage, and deploy
public TLS certificates for use with AWS services.
• Consider using AWS Private Certificate Authority for private PKI needs: AWS Private CA allows
you to create private certificate authority (CA) hierarchies to issue end-entity X.509 certificates
that can be used to create encrypted TLS channels.
Resources
Related documents:
• Enhances monitoring, logging, and incident response through request attribution and well-
defined communication interfaces.
• Provides defense-in-depth for your workloads by combining network controls with
authentication and authorization controls.
Implementation guidance
Your workload’s network traffic patterns can be characterized into two categories:
• East-west traffic represents traffic flows between services that make up a workload.
• North-south traffic represents traffic flows between your workload and consumers.
While it is common practice to encrypt north-south traffic, securing east-west traffic using
authenticated protocols is less common. Modern security practices recommend that network
design alone does not grant a trusted relationship between two entities. When two services may
reside within a common network boundary, it is still best practice to encrypt, authenticate, and
authorize communications between those services.
As an example, AWS service APIs use the AWS Signature Version 4 (SigV4) signature protocol to
authenticate the caller, no matter what network the request originates from. This authentication
ensures that AWS APIs can verify the identity that requested the action, and that identity can then
be combined with policies to make an authorization decision to determine whether the action
should be allowed or not.
Services such as Amazon VPC Lattice and Amazon API Gateway allow you use the same SigV4
signature protocol to add authentication and authorization to east-west traffic in your own
workloads. If resources outside of your AWS environment need to communicate with services
that require SigV4-based authentication and authorization, you can use AWS Identity and
Access Management (IAM) Roles Anywhere on the non-AWS resource to acquire temporary AWS
credentials. These credentials can be used to sign requests to services using SigV4 to authorize
access.
Another common mechanism for authenticating east-west traffic is TLS mutual authentication
(mTLS). Many Internet of Things (IoT), business-to-business applications, and microservices use
mTLS to validate the identity of both sides of a TLS communication through the use of both client
and server-side X.509 certificates. These certificates can be issued by AWS Private Certificate
• For service-to-service communication using mTLS, consider API Gateway or App Mesh. AWS
Private CA can be used to establish a private CA hierarchy capable of issuing certificates for use
with mTLS.
• When integrating with services using OAuth 2.0 or OIDC, consider API Gateway using the JWT
authorizer.
• For communication between your workload and IoT devices, consider AWS IoT Core, which
provides several options for network traffic encryption and authentication.
• Monitor for unauthorized access: Continually monitor for unintended communication channels,
unauthorized principals attempting to access protected resources, and other improper access
patterns.
• If using VPC Lattice to manage access to your services, consider enabling and monitoring VPC
Lattice access logs. These access logs include information on the requesting entity, network
information including source and destination VPC, and request metadata.
• Consider enabling VPC flow logs to capture metadata on network flows and periodically
review for anomalies.
• Refer to the AWS Security Incident Response Guide and the Incident Response section of the
AWS Well-Architected Framework security pillar for more guidance on planning, simulating,
and responding to security incidents.
Resources
Related documents:
Identify internal and external personnel, resources, and legal obligations to help your organization
respond to an incident.
Desired outcome: You have a list of key personnel, their contact information, and the roles they
play when responding to a security event. You review this information regularly and update it
to reflect personnel changes from an internal and external tools perspective. You consider all
third-party service providers and vendors while documenting this information, including security
partners, cloud providers, and software-as-a-service (SaaS) applications. During a security event,
personnel are available with the appropriate level of responsibility, context, and access to be able
to respond and recover.
Common anti-patterns:
• Not maintaining an updated list of key personnel with contact information, their roles, and their
responsibilities when responding to security events.
• Assuming that everyone understands the people, dependencies, infrastructure, and solutions
when responding to and recovering from an event.
• Not having a document or knowledge repository that represents key infrastructure or application
design.
• Not having proper onboarding processes for new employees to effectively contribute to a
security event response, such as conducting event simulations.
• Not having an escalation path in place when key personnel are temporarily unavailable or fail to
respond during security events.
Benefits of establishing this best practice: This practice reduces the triage and response time
spent on identifying the right personnel and their roles during an event. Minimize wasted time
during an event by maintaining an updated list of key personnel and their roles so you can bring
the right individuals to triage and recover from an event.
Implementation guidance
Identify key personnel in your organization: Maintain a contact list of personnel within your
organization that you need to involve. Regularly review and update this information in the event of
a. Identify the most appropriate contacts to engage during an incident. Define escalation plans
aligned to the roles of personnel to be engaged, rather than individual contacts. Consider
including contacts that may be responsible for informing external entities, even if they are
not directly engaged to resolve the incident.
Resources
• OPS02-BP03 Operations activities have identified owners responsible for their performance
Related documents:
Related examples:
Related tools:
Related videos:
The first document to develop for incident response is the incident response plan. The incident
response plan is designed to be the foundation for your incident response program and strategy.
Benefits of establishing this best practice: Developing thorough and clearly defined incident
response processes is key to a successful and scalable incident response program. When a security
event occurs, clear steps and workflows can help you to respond in a timely manner. You might
responsible, accountable, consulted, and informed (RACI) chart for your security response plans,
doing so facilitates quick and direct communication and clearly outlines the leadership across
different stages of the event.
During an incident, including the owners and developers of impacted applications and resources
is key because they are subject matter experts (SMEs) that can provide information and context
to aid in measuring impact. Make sure to practice and build relationships with the developers and
application owners before you rely on their expertise for incident response. Application owners or
SMEs, such as your cloud administrators or engineers, might need to act in situations where the
environment is unfamiliar or has complexity, or where the responders don’t have access.
Lastly, trusted partners might be involved in the investigation or response because they can
provide additional expertise and valuable scrutiny. When you don’t have these skills on your own
team, you might want to hire an external party for assistance.
• AWS Support
• AWS Support offers a range of plans that provide access to tools and expertise that support
the success and operational health of your AWS solutions. If you need technical support and
more resources to help plan, deploy, and optimize your AWS environment, you can select a
support plan that best aligns with your AWS use case.
• Consider the Support Center in AWS Management Console (sign-in required) as the central
point of contact to get support for issues that affect your AWS resources. Access to AWS
Support is controlled by AWS Identity and Access Management. For more information about
getting access to AWS Support features, see Getting started with AWS Support.
• AWS Customer Incident Response Team (CIRT)
• The AWS Customer Incident Response Team (CIRT) is a specialized 24/7 global AWS team that
provides support to customers during active security events on the customer side of the AWS
Shared Responsibility Model.
• When the AWS CIRT supports you, they provide assistance with triage and recovery for an
active security event on AWS. They can assist in root cause analysis through the use of AWS
service logs and provide you with recommendations for recovery. They can also provide
security recommendations and best practices to help you avoid security events in the future.
• AWS customers can engage the AWS CIRT through an AWS Support case.
• Phases of incident response and actions to take: Enumerates the phases of incident response
(for example, detect, analyze, eradicate, contain, and recover), including high-level actions to
take within those phases.
• Incident severity and prioritization definitions: Details how to classify the severity of an
incident, how to prioritize the incident, and then how the severity definitions affect escalation
procedures.
While these sections are common throughout companies of different sizes and industries, each
organization’s incident response plan is unique. You need to build an incident response plan that
works best for your organization.
Resources
Related documents:
Ahead of a security incident, consider developing forensics capabilities to support security event
investigations.
Concepts from traditional on-premises forensics apply to AWS. For key information to start
building forensics capabilities in the AWS Cloud, see Forensic investigation environment strategies
in the AWS Cloud.
Once you have your environment and AWS account structure set up for forensics, define the
technologies required to effectively perform forensically sound methodologies across the four
phases:
instrument the forensics accounts well ahead of an incident so that responders can be prepared to
effectively use them for response.
The following diagram displays a sample account structure including a forensics OU with per-
Region forensics accounts:
Setting up backups of key systems and databases are critical for recovering from a security incident
and for forensics purposes. With backups in place, you can restore your systems to their previous
safe state. On AWS, you can take snapshots of various resources. Snapshots provide you with point-
in-time backups of those resources. There are many AWS services that can support you in backup
and recovery. For detail on these services and approaches for backup and recovery, see Backup and
Recovery Prescriptive Guidance and Use backups to recover from security incidents.
Especially when it comes to situations such as ransomware, it’s critical for your backups to be well
protected. For guidance on securing your backups, see Top 10 security best practices for securing
backups in AWS. In addition to securing your backups, you should regularly test your backup and
restore processes to verify that the technology and processes you have in place work as expected.
Automate forensics
During a security event, your incident response team must be able to collect and analyze evidence
quickly while maintaining accuracy for the time period surrounding the event (such as capturing
logs related to a specific event or resource or collecting memory dump of an Amazon EC2
Implementation guidance
• Expected incidents: Playbooks should be created for incidents you anticipate. This includes
threats like denial of service (DoS), ransomware, and credential compromise.
• Known security findings or alerts: Playbooks should be created for your known security findings
and alerts, such as GuardDuty findings. You might receive a GuardDuty finding and think, "Now
what?" To prevent the mishandling or ignoring of a GuardDuty finding, create a playbook for
each potential GuardDuty finding. Some remediation details and guidance can be found in
the GuardDuty documentation. It’s worth noting that GuardDuty is not enabled by default and
does incur a cost. For more detail on GuardDuty, see Appendix A: Cloud capability definitions -
Visibility and alerting.
Playbooks should contain technical steps for a security analyst to complete in order to adequately
investigate and respond to a potential security incident.
Implementation steps
• Playbook overview: What risk or incident scenario does this playbook address? What is the goal
of the playbook?
• Prerequisites: What logs, detection mechanisms, and automated tools are required for this
incident scenario? What is the expected notification?
• Communication and escalation information: Who is involved and what is their contact
information? What are each of the stakeholders’ responsibilities?
• Response steps: Across phases of incident response, what tactical steps should be taken? What
queries should an analyst run? What code should be run to achieve the desired outcome?
• Recover: How will the affected system or resource be brought back into production?
• Expected outcomes: After queries and code are run, what is the expected result of the playbook?
We recommend the use of temporary privilege escalation in the majority of incident response
scenarios. The correct way to do this is to use the AWS Security Token Service and session policies
to scope access.
There are scenarios where federated identities are unavailable, such as:
• Malicious activity such as a distributed denial of service (DDoS) event or rendering unavailability
of the system.
In the preceding cases, there should be emergency break glass access configured to allow
investigation and timely remediation of incidents. We recommend that you use a user, group,
or role with appropriate permissions to perform tasks and access AWS resources. Use the root
user only for tasks that require root user credentials. To verify that incident responders have the
correct level of access to AWS and other relevant systems, we recommend the pre-provisioning
of dedicated accounts. The accounts require privileged access, and must be tightly controlled
and monitored. The accounts must be built with the fewest privileges required to perform the
necessary tasks, and the level of access should be based on the playbooks created as part of the
incident management plan.
Use purpose-built and dedicated users and roles as a best practice. Temporarily escalating user or
role access through the addition of IAM policies both makes it unclear what access users had during
the incident, and risks the escalated privileges not being revoked.
It is important to remove as many dependencies as possible to verify that access can be gained
under the widest possible number of failure scenarios. To support this, create a playbook to verify
that incident response users are created as users in a dedicated security account, and not managed
through any existing Federation or single sign-on (SSO) solution. Each individual responder must
have their own named account. The account configuration must enforce strong password policy
and multi-factor authentication (MFA). If the incident response playbooks only require access to
the AWS Management Console, the user should not have access keys configured and should be
explicitly disallowed from creating access keys. This can be configured with IAM policies or service
control policies (SCPs) as mentioned in the AWS Security Best Practices for AWS Organizations
SCPs. The users should have no privileges other than the ability to assume incident response roles
in other accounts.
As the incident response roles are likely to have a high level of access, it is important that these
alerts go to a wide group and are acted upon promptly.
During an incident, it is possible that a responder might require access to systems which are not
directly secured by IAM. These could include Amazon Elastic Compute Cloud instances, Amazon
Relational Database Service databases, or software-as-a-service (SaaS) platforms. It is strongly
recommended that rather than using native protocols such as SSH or RDP, AWS Systems Manager
Session Manager is used for all administrative access to Amazon EC2 instances. This access can be
controlled using IAM, which is secure and audited. It might also be possible to automate parts of
your playbooks using AWS Systems Manager Run Command documents, which can reduce user
error and improve time to recovery. For access to databases and third-party tools, we recommend
storing access credentials in AWS Secrets Manager and granting access to the incident responder
roles.
Finally, the management of the incident response IAM accounts should be added to your Joiners,
Movers, and Leavers processes and reviewed and tested periodically to verify that only the
intended access is allowed.
Resources
Related documents:
Related videos:
During a security investigation, you need to be able to review relevant logs to record and
understand the full scope and timeline of the incident. Logs are also required for alert generation,
indicating certain actions of interest have happened. It is critical to select, enable, store, and set up
querying and retrieval mechanisms, and set up alerting. Additionally, an effective way to provide
tools to search log data is Amazon Detective.
AWS offers over 200 cloud services and thousands of features. We recommend that you review the
services that can support and simplify your incident response strategy.
In addition to logging, you should develop and implement a tagging strategy. Tagging can help
provide context around the purpose of an AWS resource. Tagging can also be used for automation.
Implementation steps
AWS provides native detective, preventative, and responsive capabilities, and other services can
be used to architect custom security solutions. For a list of the most relevant services for security
incident response, see Cloud capability definitions.
Obtaining contextual information on the business use case and relevant internal stakeholders
surrounding an AWS resource can be difficult. One way to do this is in the form of tags, which
assign metadata to your AWS resources and consist of a user-defined key and value. You can create
tags to categorize resources by purpose, owner, environment, type of data processed, and other
criteria of your choice.
Having a consistent tagging strategy can speed up response times and minimize time spent on
organizational context by allowing you to quickly identify and discern contextual information
about an AWS resource. Tags can also serve as a mechanism to initiate response automations.
For more detail on what to tag, see Tagging your AWS resources. You’ll want to first define the
Implementation guidance
• Purple team exercises: Purple team exercises increase the level of collaboration between
the incident responders (blue team) and simulated threat actors (red team). The blue team
is comprised of members of the security operations center (SOC), but can also include other
stakeholders that would be involved during an actual cyber event. The red team is comprised
of a penetration testing team or key stakeholders that are trained in offensive security. The
red team works collaboratively with the exercise facilitators when designing a scenario so that
the scenario is accurate and feasible. During purple team exercises, the primary focus is on the
detection mechanisms, the tools, and the standard operating procedures (SOPs) supporting the
incident response efforts.
• Red team exercises: During a red team exercise, the offense (red team) conducts a simulation
to achieve a certain objective or set of objectives from a predetermined scope. The defenders
(blue team) will not necessarily have knowledge of the scope and duration of the exercise, which
provides a more realistic assessment of how they would respond to an actual incident. Because
red team exercises can be invasive tests, be cautious and implement controls to verify that the
exercise does not cause actual harm to your environment.
Consider facilitating cyber simulations at a regular interval. Each exercise type can provide unique
benefits to the participants and the organization as a whole, so you might choose to start with less
complex simulation types (such as tabletop exercises) and progress to more complex simulation
types (red team exercises). You should select a simulation type based on your security maturity,
resources, and your desired outcomes. Some customers might not choose to perform red team
exercises due to complexity and cost.
misconfigurations, not only improving your security posture, but also minimizing time lost to
preventable situations.
Implementation guidance
It's important to implement a lessons learned framework that establishes and achieves, at a high
level, the following points:
The framework should not focus on or blame individuals, but instead should focus on improving
tools and processes.
Implementation steps
Aside from the preceding high-level outcomes listed, it’s important to make sure that you ask the
right questions to derive the most value (information that leads to actionable improvements) from
the process. Consider these questions to help get you started in fostering your lessons learned
discussions:
Resources
Related documents:
• AWS Security Incident Response Guide - Establish a framework for learning from incidents
• NCSC CAF guidance - Lessons learned
Application security
Question
• SEC 11. How do you incorporate and validate the security properties of applications throughout
the design, development, and deployment lifecycle?
SEC 11. How do you incorporate and validate the security properties of
applications throughout the design, development, and deployment lifecycle?
Training people, testing using automation, understanding dependencies, and validating the
security properties of tools and applications help to reduce the likelihood of security issues in
production workloads.
Best practices
• SEC11-BP01 Train for application security
• SEC11-BP02 Automate testing throughout the development and release lifecycle
• SEC11-BP03 Perform regular penetration testing
• SEC11-BP04 Manual code reviews
• SEC11-BP05 Centralize services for packages and dependencies
• SEC11-BP06 Deploy software programmatically
• SEC11-BP07 Regularly assess security properties of the pipelines
• SEC11-BP08 Build a program that embeds security ownership in workload teams
Provide training to the builders in your organization on common practices for the secure
development and operation of applications. Adopting security focused development practices
helps reduce the likelihood of issues that are only detected at the security review stage.
Implementation steps
• Start builders with a course on threat modeling to build a good foundation, and help train them
on how to think about security.
• Provide access to AWS Training and Certification, industry, or AWS Partner training.
• Provide training on your organization's security review process, which clarifies the division of
responsibilities between the security team, workload teams, and other stakeholders.
• Publish self-service guidance on how to meet your security requirements, including code
examples and templates, if available.
• Regularly obtain feedback from builder teams on their experience with the security review
process and training, and use that feedback to improve.
• Use game days or bug bash campaigns to help reduce the number of issues, and increase the
skills of your builders.
Resources
Related documents:
Related videos:
Related examples:
Implementation guidance
As you build your software, adopt various mechanisms for software testing to ensure that you are
testing your application for both functional requirements, based on your application’s business
logic, and non-functional requirements, which are focused on application reliability, performance,
and security.
Static application security testing (SAST) analyzes your source code for anomalous security
patterns, and provides indications for defect prone code. SAST relies on static inputs, such as
documentation (requirements specification, design documentation, and design specifications)
and application source code to test for a range of known security issues. Static code analyzers
can help expedite the analysis of large volumes of code. The NIST Quality Group provides a
comparison of Source Code Security Analyzers, which includes open source tools for Byte Code
Scanners and Binary Code Scanners.
Complement your static testing with dynamic analysis security testing (DAST) methodologies,
which performs tests against the running application to identify potentially unexpected behavior.
Dynamic testing can be used to detect potential issues that are not detectable via static analysis.
Testing at the code repository, build, and pipeline stages allows you to check for different
types of potential issues from entering into your code. Amazon CodeWhisperer provides code
recommendations, including security scanning, in the builder’s IDE. Amazon CodeGuru Reviewer
can identify critical issues, security issues, and hard-to-find bugs during application development,
and provides recommendations to improve code quality.
The Security for Developers workshop uses AWS developer tools, such as AWS CodeBuild, AWS
CodeCommit, and AWS CodePipeline, for release pipeline automation that includes SAST and DAST
testing methodologies.
Related videos:
Related examples:
Perform regular penetration testing of your software. This mechanism helps identify potential
software issues that cannot be detected by automated testing or a manual code review. It can
also help you understand the efficacy of your detective controls. Penetration testing should try to
determine if the software can be made to perform in unexpected ways, such as exposing data that
should be protected, or granting broader permissions than expected.
Desired outcome: Penetration testing is used to detect, remediate, and validate your application’s
security properties. Regular and scheduled penetration testing should be performed as part of the
software development lifecycle (SDLC). The findings from penetration tests should be addressed
prior to the software being released. You should analyze the findings from penetration tests to
identify if there are issues that could be found using automation. Having a regular and repeatable
penetration testing process that includes an active feedback mechanism helps inform the guidance
to builders and improves software quality.
Common anti-patterns:
• Use tools to speed up the penetration testing process by automating common or repeatable
tests.
• Analyze penetration testing findings to identify systemic security issues, and use this data to
inform additional automated testing and ongoing builder education.
Resources
Related documents:
• AWS Penetration Testing provides detailed guidance for penetration testing on AWS
• Accelerate deployments on AWS with effective governance
• AWS Security Competency Partners
• Modernize your penetration testing architecture on AWS Fargate
• AWS Fault injection Simulator
Related examples:
Perform a manual code review of the software that you produce. This process helps verify that the
person who wrote the code is not the only one checking the code quality.
Desired outcome: Including a manual code review step during development increases the quality
of the software being written, helps upskill less experienced members of the team, and provides
an opportunity to identify places where automation can be used. Manual code reviews can be
supported by automated tools and testing.
Common anti-patterns:
Resources
Related documents:
Related videos:
Related examples:
Provide centralized services for builder teams to obtain software packages and other
dependencies. This allows the validation of packages before they are included in the software
that you write, and provides a source of data for the analysis of the software being used in your
organization.
Desired outcome: Software is comprised of a set of other software packages in addition to the
code that is being written. This makes it simple to consume implementations of functionality that
are repeatedly used, such as a JSON parser or an encryption library. Logically centralizing the
sources for these packages and dependencies provides a mechanism for security teams to validate
the properties of the packages before they are used. This approach also reduces the risk of an
unexpected issue being caused by a change in an existing package, or by builder teams including
arbitrary packages directly from the internet. Use this approach in conjunction with the manual
• Regularly scan packages in your repository to identify the potential impact of newly discovered
issues.
Resources
Related documents:
Related videos:
Related examples:
Perform software deployments programmatically where possible. This approach reduces the
likelihood that a deployment fails or an unexpected issue is introduced due to human error.
Desired outcome: Keeping people away from data is a key principle of building securely in the AWS
Cloud. This principle includes how you deploy your software.
• Using AWS CodeBuild and AWS Code Pipeline to provide CI/CD capability makes it simple to
integrate security testing into your pipelines.
• Follow the guidance on separation of environments in the Organizing Your AWS Environment
Using Multiple Accounts whitepaper.
• Verify no persistent human access to environments where production workloads are running.
• Use cryptographic tools such as AWS Signer or AWS Key Management Service (AWS KMS) to sign
and verify the software packages that you are deploying.
Resources
Related documents:
• Code signing using AWS Certificate Manager Private CA and AWS Key Management Service
asymmetric keys
Related videos:
Related examples:
pipeline implementation and analyzing logs for unexpected behavior can help you understand the
usage patterns of the pipelines being used to deploy software.
Implementation steps
Resources
Related documents:
Related examples:
Build a program or mechanism that empowers builder teams to make security decisions about the
software that they create. Your security team still needs to validate these decisions during a review,
but embedding security ownership in builder teams allows for faster, more secure workloads to be
built. This mechanism also promotes a culture of ownership that positively impacts the operation
of the systems you build.
Desired outcome: To embed security ownership and decision making in builder teams, you can
either train builders on how to think about security or you can augment their training with security
people embedded or associated with the builder teams. Either approach is valid and allows the
team to make higher quality security decisions earlier in the development cycle. This ownership
Implementation steps
Resources
Related documents:
Related videos:
Reliability
The Reliability pillar encompasses the ability of a workload to perform its intended function
correctly and consistently when it’s expected to. You can find prescriptive guidance on
implementation in the Reliability Pillar whitepaper.
Reliability 449
AWS Well-Architected Framework Framework
automation remediation steps to verify that services quotas and constraints are not reached that
could cause service degradation or disruption.
Common anti-patterns:
• Deploying a workload without understanding the hard or soft quotas and their limits for the
services used.
• Deploying a replacement workload without analyzing and reconfiguring the necessary quotas or
contacting Support in advance.
• Assuming that cloud services have no limits and the services can be used without consideration
to rates, limits, counts, quantities.
• Assuming that quotas will automatically be increased.
• Not knowing the process and timeline of quota requests.
• Assuming that the default cloud service quota is the identical for every service compared across
regions.
• Assuming that service constraints can be breached and the systems will auto-scale or add
increase the limit beyond the resource’s constraints
• Not testing the application at peak traffic in order to stress the utilization of its resources.
• Provisioning the resource without analysis of the required resource size.
• Overprovisioning capacity by choosing resource types that go well beyond actual need or
expected peaks.
• Not assessing capacity requirements for new levels of traffic in advance of a new customer event
or deploying a new technology.
Benefits of establishing this best practice: Monitoring and automated management of service
quotas and resource constraints can proactively reduce failures. Changes in traffic patterns for
a customer’s service can cause a disruption or degradation if best practices are not followed. By
monitoring and managing these values across all regions and all accounts, applications can have
improved resiliency under adverse or unplanned events.
Implementation guidance
Service Quotas is an AWS service that helps you manage your quotas for over 250 AWS services
from one location. Along with looking up the quota values, you can also request and track quota
Foundations 451
AWS Well-Architected Framework Framework
• AWS Management Console provides methods to display services quota values, manage, request
new quotas, monitor status of quota requests, and display history of quotas.
• AWS CLI and CDKs offer programmatic methods to automatically manage and monitor service
quota levels and usage.
Implementation steps
Foundations 453
AWS Well-Architected Framework Framework
Related videos:
Related tools:
Foundations 455
AWS Well-Architected Framework Framework
Implementation guidance
Service quotas are tracked per account. Unless otherwise noted, each quota is AWS Region-specific.
In addition to the production environments, also manage quotas in all applicable non-production
environments so that testing and development are not hindered. Maintaining a high degree of
resiliency requires that service quotas are assessed continually (whether automated or manual).
With more workloads spanning Regions due to the implementation of designs using Active/Active,
Active/Passive – Hot, Active/Passive-Cold, and Active/Passive-Pilot Light approaches, it is essential
to understand all Region and account quota levels. Past traffic patterns are not always a good
indicator if the service quota is set correctly.
Equally important, the service quota name limit is not always the same for every Region. In one
Region, the value could be five, and in another region the value could be ten. Management of these
quotas must span all the same services, accounts, and Regions to provide consistent resilience
under load.
Reconcile all the service quota differences across different Regions (Active Region or Passive
Region) and create processes to continually reconcile these differences. The testing plans of passive
Region failovers are rarely scaled to peak active capacity, meaning that game day or table top
exercises can fail to find differences in service quotas between Regions and also then maintain the
correct limits.
Service quota drift, the condition where service quota limits for a specific named quota is changed
in one Region and not all Regions, is very important to track and assess. Changing the quota in
Regions with traffic or potentially could carry traffic should be considered.
• Select relevant accounts and Regions based on your service requirements, latency, regulatory,
and disaster recovery (DR) requirements.
• Identify service quotas across all relevant accounts, Regions, and Availability Zones. The limits
are scoped to account and Region. These values should be compared for differences.
Implementation steps
• Review Service Quotas values that might have breached beyond the a risk level of usage. AWS
Trusted Advisor provides alerts for 80% and 90% threshold breaches.
• Review values for service quotas in any Passive Regions (in an Active/Passive design). Verify that
load will successfully run in secondary Regions in the event of a failure in the primary Region.
Foundations 457
AWS Well-Architected Framework Framework
Related videos:
Related services:
Foundations 459
AWS Well-Architected Framework Framework
Hard limits are show in the Service Quotas console. If the columns shows ADJUSTABLE = No,
the service has a hard limit. Hard limits are also shown in some resources configuration pages. For
example, Lambda has specific hard limits that cannot be adjusted.
As an example, when designing a python application to run in a Lambda function, the application
should be evaluated to determine if there is any chance of Lambda running longer than 15
minutes. If the code may run more than this service quota limit, alternate technologies or designs
must be considered. If this limit is reached after production deployment, the application will suffer
degradation and disruption until it can be remediated. Unlike soft quotas, there is no method to
change to these limits even under emergency Severity 1 events.
Once the application has been deployed to a testing environment, strategies should be used to find
if any hard limits can be reached. Stress testing, load testing, and chaos testing should be part of
the introduction test plan.
Implementation steps
• Review the complete list of AWS services that could be used in the application design phase.
• Review the soft quota limits and hard quota limits for all these services. Not all limits are shown
in the Service Quotas console. Some services describe these limits in alternate locations.
• As you design your application, review your workload’s business and technology drivers, such
as business outcomes, use case, dependent systems, availability targets, and disaster recovery
objects. Let your business and technology drivers guide the process to identify the distributed
system that is right for your workload.
• Analyze service load across Regions and accounts. Many hard limits are regionally based for
services. However, some limits are account based.
• Analyze resilience architectures for resource usage during a zonal failure and Regional failure. In
the progression of multi-Region designs using active/active, active/passive – hot, active/passive -
cold, and active/passive - pilot light approaches, these failure cases will cause higher usage. This
creates a potential use case for hitting hard limits.
Resources
Foundations 461
AWS Well-Architected Framework Framework
• Automating Service Limit Increases and Enterprise Support with AWS Control Tower
• Actions, resources, and condition keys for Service Quotas
Related videos:
Related tools:
• AWS CodeDeploy
• AWS CloudTrail
• Amazon CloudWatch
• Amazon EventBridge
• Amazon DevOps Guru
• AWS Config
• AWS Trusted Advisor
• AWS CDK
• AWS Systems Manager
• AWS Marketplace
Evaluate your potential usage and increase your quotas appropriately, allowing for planned growth
in usage.
Desired outcome: Active and automated systems that manage and monitor have been deployed.
These operations solutions ensure that quota usage thresholds are nearing being reached. These
would be proactively remediated by requested quota changes.
Common anti-patterns:
Foundations 463
AWS Well-Architected Framework Framework
• Capture your current quotas that are essential and applicable to the services using:
• AWS Service Quotas
• AWS Trusted Advisor
• AWS documentation
• AWS service-specific pages
• AWS Command Line Interface (AWS CLI)
• AWS Cloud Development Kit (AWS CDK)
• Use AWS Service Quotas, an AWS service that helps you manage your quotas for over 250 AWS
services from one location.
• Use Trusted Advisor service limits to monitor your current service limits at various thresholds.
• Use the service quota history (console or AWS CLI) to check on regional increases.
• Compare service quota changes in each Region and each account to create equivalency, if
required.
For management:
• Automated: Set up an AWS Config custom rule to scan service quotas across Regions and
compare for differences.
• Automated: Set up a scheduled Lambda function to scan service quotas across Regions and
compare for differences.
• Manual: Scan services quota through AWS CLI, API, or AWS Console to scan service quotas across
Regions and compare for differences. Report the differences.
• If differences in quotas are identified between Regions, request a quota change, if required.
• Review the result of all requests.
Resources
Foundations 465
AWS Well-Architected Framework Framework
Related videos:
Related tools:
• AWS CodeDeploy
• AWS CloudTrail
• Amazon CloudWatch
• Amazon EventBridge
• Amazon DevOps Guru
• AWS Config
• AWS Trusted Advisor
• AWS CDK
• AWS Systems Manager
• AWS Marketplace
Implement tools to alert you when thresholds are being approached. You can automate quota
increase requests by using AWS Service Quotas APIs.
If you integrate your Configuration Management Database (CMDB) or ticketing system with Service
Quotas, you can automate the tracking of quota increase requests and current quotas. In addition
to the AWS SDK, Service Quotas offers automation using the AWS Command Line Interface (AWS
CLI).
Common anti-patterns:
Foundations 467
AWS Well-Architected Framework Framework
• AWS Trusted Advisor Best Practice Checks (see the Service Limits section)
• Quota Monitor on AWS - AWS Solution
• Amazon EC2 Service Limits
• What is Service Quotas?
Related videos:
REL01-BP06 Ensure that a sufficient gap exists between the current quotas and the maximum
usage to accommodate failover
This article explains how to maintain space between the resource quota and your usage, and
how it can benefit your organization. After you finish using a resource, the usage quota may
continue to account for that resource. This can result in a failing or inaccessible resource. Prevent
resource failure by verifying that your quotas cover the overlap of inaccessible resources and their
replacements. Consider cases like network failure, Availability Zone failure, or Region failures when
calculating this gap.
Desired outcome: Small or large failures in resources or resource accessibility can be covered
within the current service thresholds. Zone failures, network failures, or even Regional failures have
been considered in the resource planning.
Common anti-patterns:
• Setting service quotas based on current needs without accounting for failover scenarios.
• Not considering the principals of static stability when calculating the peak quota for a service.
• Not considering the potential of inaccessible resources in calculating total quota needed for each
Region.
• Not considering AWS service fault isolation boundaries for some services and their potential
abnormal usage patterns.
Benefits of establishing this best practice: When service disruption events impact application
availability, use the cloud to implement strategies to recover from these events. An example
strategy is creating additional resources to replace inaccessible resources to accommodate failover
conditions without exhausting your service limit.
Foundations 469
AWS Well-Architected Framework Framework
Resources
Related documents:
Foundations 471
AWS Well-Architected Framework Framework
• AWS Marketplace
Workloads often exist in multiple environments. These include multiple cloud environments (both
publicly accessible and private) and possibly your existing data center infrastructure. Plans must
include network considerations such as intra- and intersystem connectivity, public IP address
management, private IP address management, and domain name resolution.
Best practices
• REL02-BP01 Use highly available network connectivity for your workload public endpoints
• REL02-BP02 Provision redundant connectivity between private networks in the cloud and on-
premises environments
• REL02-BP03 Ensure IP subnet allocation accounts for expansion and availability
• REL02-BP04 Prefer hub-and-spoke topologies over many-to-many mesh
• REL02-BP05 Enforce non-overlapping private IP address ranges in all private address spaces
where they are connected
REL02-BP01 Use highly available network connectivity for your workload public endpoints
Building highly available network connectivity to public endpoints of your workloads can help
you reduce downtime due to loss of connectivity and improve the availability and SLA of your
workload. To achieve this, use highly available DNS, content delivery networks (CDNs), API
gateways, load balancing, or reverse proxies.
Desired outcome: It is critical to plan, build, and operationalize highly available network
connectivity for your public endpoints. If your workload becomes unreachable due to a loss in
connectivity, even if your workload is running and available, your customers will see your system
as down. By combining the highly available and resilient network connectivity for your workload’s
public endpoints, along with a resilient architecture for your workload itself, you can provide the
best possible availability and service level for your customers.
AWS Global Accelerator, Amazon CloudFront, Amazon API Gateway, AWS Lambda Function URLs,
AWS AppSync APIs, and Elastic Load Balancing (ELB) all provide highly available public endpoints.
Amazon Route 53 provides a highly available DNS service for domain name resolution to verify
that your public endpoint addresses can be resolved.
Foundations 473
AWS Well-Architected Framework Framework
instances. You can also use Amazon API Gateway along with AWS Lambda for a serverless solution.
Customers can also run workloads in multiple AWS Regions. With multi-site active/active pattern,
the workload can serve traffic from multiple Regions. With a multi-site active/passive pattern, the
workload serves traffic from the active region while data is replicated to the secondary region and
becomes active in the event of a failure in the primary region. Route 53 health checks can then be
used to control DNS failover from any endpoint in a primary Region to an endpoint in a secondary
Region, verifying that your workload is reachable and available to your users.
Amazon CloudFront provides a simple API for distributing content with low latency and high data
transfer rates by serving requests using a network of edge locations around the world. Content
delivery networks (CDNs) serve customers by serving content located or cached at a location near
to the user. This also improves availability of your application as the load for content is shifted
away from your servers over to CloudFront’s edge locations. The edge locations and regional edge
caches hold cached copies of your content close to your viewers resulting in quick retrieval and
increasing reachability and availability of your workload.
For workloads with users spread out geographically, AWS Global Accelerator helps you improve
the availability and performance of the applications. AWS Global Accelerator provides Anycast
static IP addresses that serve as a fixed entry point to your application hosted in one or more
AWS Regions. This allows traffic to ingress onto the AWS global network as close to your users as
possible, improving reachability and availability of your workload. AWS Global Accelerator also
monitors the health of your application endpoints by using TCP, HTTP, and HTTPS health checks.
Any changes in the health or configuration of your endpoints permit redirection of user traffic to
healthy endpoints that deliver the best performance and availability to your users. In addition, AWS
Global Accelerator has a fault-isolating design that uses two static IPv4 addresses that are serviced
by independent network zones increasing the availability of your applications.
To help protect customers from DDoS attacks, AWS provides AWS Shield Standard. Shield Standard
comes automatically turned on and protects from common infrastructure (layer 3 and 4) attacks
like SYN/UDP floods and reflection attacks to support high availability of your applications
on AWS. For additional protections against more sophisticated and larger attacks (like UDP
floods), state exhaustion attacks (like TCP SYN floods), and to help protect your applications
running on Amazon Elastic Compute Cloud (Amazon EC2), Elastic Load Balancing (ELB), Amazon
CloudFront, AWS Global Accelerator, and Route 53, you can consider using AWS Shield Advanced.
For protection against Application layer attacks like HTTP POST or GET floods, use AWS WAF. AWS
WAF can use IP addresses, HTTP headers, HTTP body, URI strings, SQL injection, and cross-site
scripting conditions to determine if a request should be blocked or allowed.
Foundations 475
AWS Well-Architected Framework Framework
protections from application layer HTTP POST AND GET floods, review Getting started with
AWS WAF. You can also use AWS WAF with CloudFront see the documentation on how AWS WAF
works with Amazon CloudFront features.
6. Set up additional DDoS protection: By default, all AWS customers receive protection from
common, most frequently occurring network and transport layer DDoS attacks that target
your web site or application with AWS Shield Standard at no additional charge. For additional
protection of internet-facing applications running on Amazon EC2, Elastic Load Balancing,
Amazon CloudFront, AWS Global Accelerator, and Amazon Route 53 you can consider AWS
Shield Advanced and review examples of DDoS resilient architectures. To protect your workload
and your public endpoints from DDoS attacks review Getting started with AWS Shield Advanced.
Resources
Related documents:
Foundations 477
AWS Well-Architected Framework Framework
Implementation guidance
When using AWS Direct Connect to connect your on-premises network to AWS, you can achieve
maximum network resiliency (SLA of 99.99%) by using separate connections that end on distinct
devices in more than one on-premises location and more than one AWS Direct Connect location.
This topology offers resilience against device failures, connectivity issues, and complete location
outages. Alternatively, you can achieve high resiliency (SLA of 99.9%) by using two individual
connections to multiple locations (each on-premises location connected to a single Direct Connect
location). This approach protects against connectivity disruptions caused by fiber cuts or device
failures and helps mitigate complete location failures. The AWS Direct Connect Resiliency Toolkit
can assist in designing your AWS Direct Connect topology.
You can also consider AWS Site-to-Site VPN ending on an AWS Transit Gateway as a cost-effective
backup to your primary AWS Direct Connect connection. This setup enables equal-cost multipath
(ECMP) routing across multiple VPN tunnels, allowing for throughput of up to 50Gbps, even
though each VPN tunnel is capped at 1.25 Gbps. It's important to note, however, that AWS Direct
Connect is still the most effective choice for minimizing network disruptions and providing stable
connectivity.
When using VPNs over the internet to connect your cloud environment to your on-premises data
center, configure two VPN tunnels as part of a single site-to-site VPN connection. Each tunnel
should end in a different Availability Zone for high availability and use redundant hardware
to prevent on-premises device failure. Additionally, consider multiple internet connections
from various internet service providers (ISPs) at your on-premises location to avoid complete
VPN connectivity disruption due to a single ISP outage. Selecting ISPs with diverse routing and
infrastructure, especially those with separate physical paths to AWS endpoints, provides high
connectivity availability.
In addition to physical redundancy with multiple AWS Direct Connect connections and multiple
VPN tunnels (or a combination of both), implementing Border Gateway Protocol (BGP) dynamic
routing is also crucial. Dynamic BGP provides automatic rerouting of traffic from one path to
another based on real-time network conditions and configured policies. This dynamic behavior
is especially beneficial in maintaining network availability and service continuity in the event of
link or network failures. It quickly selects alternative paths, enhancing the network's resilience and
reliability.
Implementation steps
Foundations 479
AWS Well-Architected Framework Framework
Related videos:
• AWS re:Invent 2018: Advanced VPC Design and New Capabilities for Amazon VPC
• AWS re:Invent 2019: AWS Transit Gateway reference architectures for many VPCs
Amazon VPC IP address ranges must be large enough to accommodate workload requirements,
including factoring in future expansion and allocation of IP addresses to subnets across Availability
Zones. This includes load balancers, EC2 instances, and container-based applications.
When you plan your network topology, the first step is to define the IP address space itself. Private
IP address ranges (following RFC 1918 guidelines) should be allocated for each VPC. Accommodate
the following requirements as part of this process:
• Allow IP address space for more than one VPC per Region.
• Within a VPC, allow space for multiple subnets so that you can cover multiple Availability Zones.
• Consider leaving unused CIDR block space within a VPC for future expansion.
• Ensure that there is IP address space to meet the needs of any transient fleets of Amazon EC2
instances that you might use, such as Spot Fleets for machine learning, Amazon EMR clusters, or
Amazon Redshift clusters. Similar consideration should be given to Kubernetes clusters, such as
Amazon Elastic Kubernetes Service (Amazon EKS), as each Kubernetes pod is assigned a routable
address from the VPC CIDR block by default.
• Note that the first four IP addresses and the last IP address in each subnet CIDR block are
reserved and not available for your use.
• Note that the initial VPC CIDR block allocated to your VPC cannot be changed or deleted, but
you can add additional non-overlapping CIDR blocks to the VPC. Subnet IPv4 CIDRs cannot be
changed, however IPv6 CIDRs can.
• The largest possible VPC CIDR block is a /16, and the smallest is a /28.
• Consider other connected networks (VPC, on-premises, or other cloud providers) and ensure non-
overlapping IP address space. For more information, see REL02-BP05 Enforce non-overlapping
private IP address ranges in all private address spaces where they are connected.
Desired outcome: A scalable IP subnet can help you accomodate for future growth and avoid
unnecessary waste.
Foundations 481
AWS Well-Architected Framework Framework
Resources
• REL02-BP05 Enforce non-overlapping private IP address ranges in all private address spaces
where they are connected
Related documents:
Related videos:
• AWS re:Invent 2018: Advanced VPC Design and New Capabilities for Amazon VPC (NET303)
• AWS re:Invent 2019: AWS Transit Gateway reference architectures for many VPCs (NET406-R1)
• AWS re:Invent 2023: AWS Ready for what's next? Designing networks for growth and flexibility
(NET310)
When connecting multiple private networks, such as Virtual Private Clouds (VPCs) and on-premises
networks, opt for a hub-and-spoke topology over a meshed one. Unlike meshed topologies, where
each network connects directly to the others and increases the complexity and management
Foundations 483
AWS Well-Architected Framework Framework
hub-and-spoke topologies. When you use AWS Transit Gateway, you can establish connections and
centralizes traffic routing across multiple networks.
Implementation guidance
• If needed, create VPN connections or Direct Connect gateways and associate them with the
Transit Gateway.
• Define how traffic is routed between the connected VPCs and other connections through
configuration of your Transit Gateway route tables.
• Use Amazon CloudWatch to monitor and adjust configurations as necessary for performance and
cost optimization.
Resources
Related documents:
Related videos:
Foundations 485
AWS Well-Architected Framework Framework
Implementation steps
Resources
• Protecting networks
Related documents:
Related videos:
Foundations 487
AWS Well-Architected Framework Framework
workload built to scale from the start needs. When refactoring an existing monolith, you will
need to consider how well the application will support a decomposition towards statelessness.
Breaking services into smaller pieces allows small, well-defined teams to develop and manage
them. However, smaller services can introduce complexities which include possible increased
latency, more complex debugging, and increased operational burden.
Common anti-patterns:
• The microservice Death Star is a situation in which the atomic components become so highly
interdependent that a failure of one results in a much larger failure, making the components as
rigid and fragile as a monolith.
• More specific segments lead to greater agility, organizational flexibility, and scalability.
• Reduced impact of service interruptions.
• Application components may have different availability requirements, which can be supported by
a more atomic segmentation.
• Well-defined responsibilities for teams supporting the workload.
Implementation guidance
Choose your architecture type based on how you will segment your workload. Choose an SOA or
microservices architecture (or in some rare cases, a monolithic architecture). Even if you choose
to start with a monolith architecture, you must ensure that it’s modular and can ultimately
evolve to SOA or microservices as your product scales with user adoption. SOA and microservices
offer respectively smaller segmentation, which is preferred as a modern scalable and reliable
architecture, but there are trade-offs to consider, especially when deploying a microservice
architecture.
One primary trade-off is that you now have a distributed compute architecture that can make it
harder to achieve user latency requirements and there is additional complexity in the debugging
and tracing of user interactions. You can use AWS X-Ray to assist you in solving this problem.
Another effect to consider is increased operational complexity as you increase the number of
applications that you are managing, which requires the deployment of multiple independency
components.
• For existing monoliths with a single, shared database, choose how to reorganize the data into
smaller segments. This could be by business unit, access pattern, or data structure. At this point
in the refactoring process, you should choose to move forward with a relational or non-relational
(NoSQL) type of database. For more details, see From SQL to NoSQL.
Resources
Related documents:
Related examples:
Related videos:
focused on business domains. When working with existing monolithic applications, you can
take advantage of decomposition patterns that provide established techniques to modernize
applications into services.
Domain-driven design
Implementation steps
• Teams can hold event storming workshops to quickly identify events, commands, aggregates and
domains in a lightweight sticky note format.
• Once domain entities and functions have been formed in a domain context, you can divide your
domain into services using bounded context, where entities that share similar features and
attributes are grouped together. With the model divided into contexts, a template for how to
boundary microservices emerges.
• For example, the Amazon.com website entities might include package, delivery, schedule,
price, discount, and currency.
• Package, delivery, and schedule are grouped into the shipping context, while price, discount,
and currency are grouped into the pricing context.
• Decomposing monoliths into microservices outlines patterns for refactoring microservices. Using
patterns for decomposition by business capability, subdomain, or transaction aligns well with
domain-driven approaches.
• Tactical techniques such as the bubble context allow you to introduce DDD in existing or legacy
applications without up-front rewrites and full commitments to DDD. In a bubble context
approach, a small bounded context is established using a service mapping and coordination, or
anti-corruption layer, which protects the newly defined domain model from external influences.
After teams have performed domain analysis and defined entities and service contracts, they can
take advantage of AWS services to implement their domain-driven design as cloud-based services.
• Microservices
• Test-driven development
• Behavior-driven development
Related examples:
Related tools:
Service contracts are documented agreements between API producers and consumers defined in
a machine-readable API definition. A contract versioning strategy allows consumers to continue
using the existing API and migrate their applications to a newer API when they are ready. Producer
deployment can happen any time as long as the contract is followed. Service teams can use the
technology stack of their choice to satisfy the API contract.
Desired outcome: Applications built with service-oriented or microservice architectures are able to
operate independently while having integrated runtime dependency. Changes deployed to an API
consumer or producer do not interrupt the stability of the overall system when both sides follow a
common API contract. Components that communicate over service APIs can perform independent
functional releases, upgrades to runtime dependencies, or fail over to a disaster recovery (DR) site
with little or no impact to each other. In addition, discrete services are able to independently scale
absorbing resource demand without requiring other services to scale in unison.
Common anti-patterns:
• Creating service APIs without strongly typed schemas. This results in APIs that cannot be used to
generate API bindings and payloads that can’t be programmatically validated.
• Not adopting a versioning strategy, which forces API consumers to update and release or fail
when service contracts evolve.
• Importing an OpenAPI definition simplifies the creation of your API and can be integrated with
AWS infrastructure as code tools like the AWS Serverless Application Model and AWS Cloud
Development Kit (AWS CDK).
• Exporting an API definition simplifies integrating with API testing tools and provides services
consumer an integration specification.
• You can define and manage GraphQL APIs with AWS AppSync by defining a GraphQL schema file
to generate your contract interface and simplify interaction with complex REST models, multiple
database tables or legacy services.
• AWS Amplify projects that are integrated with AWS AppSync generate strongly typed JavaScript
query files for use in your application as well as an AWS AppSync GraphQL client library for
Amazon DynamoDB tables.
• When you consume service events from Amazon EventBridge, events adhere to schemas that
already exist in the schema registry or that you define with the OpenAPI Spec. With a schema
defined in the registry, you can also generate client bindings from the schema contract to
integrate your code with events.
• Extending or version your API. Extending an API is a simpler option when adding fields that can
be configured with optional fields or default values for required fields.
• JSON based contracts for protocols like REST and GraphQL can be a good fit for contract
extension.
• XML based contracts for protocols like SOAP should be tested with service consumers to
determine the feasibility of contract extension.
• When versioning an API, consider implementing proxy versioning where a facade is used to
support versions so that logic can be maintained in a single codebase.
• With API Gateway you can use request and response mappings to simplify absorbing contract
changes by establishing a facade to provide default values for new fields or to strip removed
fields from a request or response. With this approach the underlying service can maintain a
single codebase.
Resources
Best practices
• REL04-BP01 Identify the kind of distributed systems you depend on
• REL04-BP02 Implement loosely coupled dependencies
• REL04-BP03 Do constant work
• REL04-BP04 Make all responses idempotent
Desired outcome: Design a workload that effectively interacts with synchronous, asynchronous,
and batch dependencies.
Common anti-patterns:
• Workload waits indefinitely for a response from its dependencies, which could lead to workload
clients timing out, not knowing if their request has been received.
• Workload uses a chain of dependent systems that call each other synchronously. This requires
each system to be available and to successfully process a request before the whole chain can
succeed, leading to potentially brittle behavior and overall availability.
• Workload communicates with its dependencies asynchronously and rely on the concept of
exactly-once guaranteed delivery of messages, when often it is still possible to receive duplicate
messages.
• Your workload should not rely on multiple synchronous dependencies to perform a single
function. This chain of dependencies increases overall brittleness because all dependencies in the
pathway need to be available in order for the request to complete successfully.
• When a dependency is unhealthy or unavailable, determine your error handling and retry
strategies. Avoid using bimodal behavior. Bimodal behavior is when your workload exhibits
different behavior under normal and failure modes. For more details on bimodal behavior, see
REL11-BP05 Use static stability to prevent bimodal behavior.
• Keep in mind that failing fast is better than making your workload wait. For instance, the AWS
Lambda Developer Guide describes how to handle retries and failures when you invoke Lambda
functions.
• Set timeouts when your workload calls its dependency. This technique avoids waiting too long or
waiting indefinitely for a response. For helpful discussion of this issue, see Tuning AWS Java SDK
HTTP request settings for latency-aware Amazon DynamoDB applications.
• Minimize the number of calls made from your workload to its dependency to fulfill a single
request. Having chatty calls between them increases coupling and latency.
Asynchronous dependency
To temporally decouple your workload from its dependency, they should communicate
asynchronously. Using an asynchronous approach, your workload can continue with any other
processing without having to wait for its dependency, or chain of dependencies, to send a
response.
When your workload needs to communicate asynchronously with its dependency, consider the
following guidance:
• Determine whether to use messaging or event streaming based on your use case and
requirements. Messaging allows your workload to communicate with its dependency by sending
and receiving messages through a message broker. Event streaming allows your workload and
its dependency to use a streaming service to publish and subscribe to events, delivered as
continuous streams of data, that need to be processed as soon as possible.
• Messaging and event streaming handle messages differently so you need to make trade-off
decisions based on:
• Message priority: message brokers can process high-priority messages ahead of normal
messages. In event streaming, all messages have the same priority.
• Define the time window when your workload should run the batch job. Your workload can set
up a recurrence pattern to invoke a batch system, for example, every hour or at the end of every
month.
• Determine the location of the data input and the processed data output. Choose a storage
service, such as Amazon Simple Storage Services (Amazon S3), Amazon Elastic File System
(Amazon EFS), and Amazon FSx for Lustre, that allows your workload to read and write files at
scale.
• If your workload needs to invoke multiple batch jobs, you could leverage AWS Step Functions
to simplify the orchestration of batch jobs that run in AWS or on-premises. This sample project
demonstrates orchestration of batch jobs using Step Functions, AWS Batch, and Lambda.
• Monitor batch jobs to look for abnormalities, such as a job taking longer than it should to
complete. You could use tools like CloudWatch Container Insights to monitor AWS Batch
environments and jobs. In this instance, your workload would stop the next job from beginning
and inform the relevant staff of the exception.
Resources
Related documents:
To further improve resiliency through loose coupling, make component interactions asynchronous
where possible. This model is suitable for any interaction that does not need an immediate
response and where an acknowledgment that a request has been registered will suffice. It involves
one component that generates events and another that consumes them. The two components
do not integrate through direct point-to-point interaction but usually through an intermediate
durable storage layer, such as an Amazon SQS queue, a streaming data platform such as Amazon
Kinesis, or AWS Step Functions.
Figure 4: Dependencies such as queuing systems and load balancers are loosely coupled
Amazon SQS queues and AWS Step Functions are just two ways to add an intermediate layer
for loose coupling. Event-driven architectures can also be built in the AWS Cloud using Amazon
EventBridge, which can abstract clients (event producers) from the services they rely on (event
consumers). Amazon Simple Notification Service (Amazon SNS) is an effective solution when you
need high-throughput, push-based, many-to-many messaging. Using Amazon SNS topics, your
publisher systems can fan out messages to a large number of subscriber endpoints for parallel
processing.
While queues offer several advantages, in most hard real-time systems, requests older than a
threshold time (often seconds) should be considered stale (the client has given up and is no longer
Implementation steps
• Components in an event-driven architecture are initiated by events. Events are actions that
happen in a system, such as a user adding an item to a cart. When an action is successful, an
event is generated that actuates the next component of the system.
• Building Event-driven Applications with Amazon EventBridge
• AWS re:Invent 2022 - Designing Event-Driven Integrations using Amazon EventBridge
• Distributed messaging systems have three main parts that need to be implemented for a queue
based architecture. They include components of the distributed system, the queue that is used
for decoupling (distributed on Amazon SQS servers), and the messages in the queue. A typical
system has producers which initiate the message into the queue, and the consumer which
receives the message from the queue. The queue stores messages across multiple Amazon SQS
servers for redundancy.
• Basic Amazon SQS architecture
• Send Messages Between Distributed Applications with Amazon Simple Queue Service
• Microservices, when well-utilized, enhance maintainability and boost scalability, as loosely
coupled components are managed by independent teams. It also allows for the isolation of
behaviors to a single component in case of changes.
• Implementing Microservices on AWS
• Let's Architect! Architecting microservices with containers
• With AWS Step Functions you can build distributed applications, automate processes, orchestrate
microservices, among other things. The orchestration of multiple components into an automated
workflow allows you to decouple dependencies in your application.
• Create a Serverless Workflow with AWS Step Functions and AWS Lambda
Workload architecture 507
AWS Well-Architected Framework Framework
For example, if the health check system is monitoring 100,000 servers, the load on it is nominal
under the normally light server failure rate. However, if a major event makes half of those servers
unhealthy, then the health check system would be overwhelmed trying to update notification
systems and communicate state to its clients. So instead the health check system should send
the full snapshot of the current state each time. 100,000 server health states, each represented
by a bit, would only be a 12.5-KB payload. Whether no servers are failing, or all of them are, the
health check system is doing constant work, and large, rapid changes are not a threat to the system
stability. This is actually how Amazon Route 53 handles health checks for endpoints (such as IP
addresses) to determine how end users are routed to them.
Implementation guidance
• Do constant work so that systems do not fail when there are large, rapid changes in load.
• Implement loosely coupled dependencies. Dependencies such as queuing systems, streaming
systems, workflows, and load balancers are loosely coupled. Loose coupling helps isolate
behavior of a component from other components that depend on it, increasing resiliency and
agility.
• The Amazon Builders' Library: Reliability, constant work, and a good cup of coffee
• AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and
Small ARC337 (includes constant work)
• For the example of a health check system monitoring 100,000 servers, engineer workloads
so that payload sizes remain constant regardless of number of successes or failures.
Resources
Related documents:
Related videos:
• AWS New York Summit 2019: Intro to Event-driven Architectures and Amazon EventBridge
(MAD205)
• The Amazon Builders' Library: Reliability, constant work, and a good cup of coffee
Related videos:
• AWS New York Summit 2019: Intro to Event-driven Architectures and Amazon EventBridge
(MAD205)
• AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and
Small ARC337 (includes loose coupling, constant work, static stability)
Best practices
critical functionality of the component to callers or customers. These considerations can become
additional requirements that can be tested and verified. Ideally, a component is able to perform its
core function in an acceptable manner even when one or multiple dependencies fail.
This is as much a business discussion as a technical one. All business requirements are important
and should be fulfilled if possible. However, it still makes sense to ask what should happen when
not all of them can be fulfilled. A system can be designed to be available and consistent, but
under circumstances where one requirement must be dropped, which one is more important? For
payment processing, it might be consistency. For a real-time application, it might be availability.
For a customer facing website, the answer may depend on customer expectations.
What this means depends on the requirements of the component and what should be considered
its core function. For example:
• An ecommerce website might display data from multiple different systems like personalized
recommendations, highest ranked products, and status of customer orders on the landing
page. When one upstream system fails, it still makes sense to display everything else instead of
showing an error page to a customer.
• A component performing batch writes can still continue processing a batch if one of the
individual operations fails. It should be simple to implement a retry mechanism. This can be
done by returning information on which operations succeeded, which failed, and why they failed
to the caller, or putting failed requests into a dead letter queue to implement asynchronous
retries. Information about failed operations should be logged as well.
• A system that processes transactions must verify that either all or no individual updates are
executed. For distributed transactions, the saga pattern can be used to roll back previous
operations in case a later operation of the same transaction fails. Here, the core function is
maintaining consistency.
• Time critical systems should be able to deal with dependencies not responding in a timely
manner. In these cases, the circuit breaker pattern can be used. When responses from a
dependency start timing out, the system can switch to a closed state where no additional call are
made.
• An application may read parameters from a parameter store. It can be useful to create container
images with a default set of parameters and use these in case the parameter store is unavailable.
Note that the pathways taken in case of component failure need to be tested and should be
significantly simpler than the primary pathway. Generally, fallback strategies should be avoided.
Resources
Related documents:
Related videos:
• Retry, backoff, and jitter: AWS re:Invent 2019: Introducing The Amazon Builders’ Library
(DOP328)
Related examples:
• Well-Architected lab: Level 300: Implementing Health Checks and Managing Dependencies to
Improve Reliability
Desired outcome: Large volume spikes either from sudden customer traffic increases, flooding
attacks, or retry storms are mitigated by request throttling, allowing workloads to continue normal
processing of supported request volume.
Common anti-patterns:
• API endpoint throttles are not implemented or are left at default values without considering
expected volumes.
Amazon API Gateway implements the token bucket algorithm according to account and region
limits and can be configured per-client with usage plans. Additionally, Amazon Simple Queue
Service (Amazon SQS) and Amazon Kinesis can buffer requests to smooth out the request rate, and
allow higher throttling rates for requests that can be addressed. Finally, you can implement rate
limiting with AWS WAF to throttle specific API consumers that generate unusually high load.
Implementation steps
You can configure API Gateway with throttling limits for your APIs and return 429 Too Many
Requests errors when limits are exceeded. You can use AWS WAF with your AWS AppSync and
API Gateway endpoints to enable rate limiting on a per IP address basis. Additionally, where your
system can tolerate asynchronous processing, you can put messages into a queue or stream to
speed up responses to service clients, which allows you to burst to higher throttle rates.
With asynchronous processing, when you’ve configured Amazon SQS as an event source for AWS
Lambda, you can configure maximum concurrency to avoid high event rates from consuming
available account concurrent execution quota needed for other services in your workload or
account.
While API Gateway provides a managed implementation of the token bucket, in cases where
you cannot use API Gateway, you can take advantage of language specific open-source
implementations (see related examples in Resources) of the token bucket for your services.
• Understand and configure API Gateway throttling limits at the account level per region, API per
stage, and API key per usage plan levels.
Related videos:
Related tools:
Use exponential backoff to retry requests at progressively longer intervals between each retry.
Introduce jitter between retries to randomize retry intervals. Limit the maximum number of retries.
Desired outcome: Typical components in a distributed software system include servers, load
balancers, databases, and DNS servers. During normal operation, these components can respond
to requests with errors that are temporary or limited, and also errors that would be persistent
regardless of retries. When clients make requests to services, the requests consume resources
including memory, threads, connections, ports, or any other limited resources. Controlling and
limiting retries is a strategy to release and minimize consumption of resources so that system
components under strain are not overwhelmed.
When client requests time out or receive error responses, they should determine whether or not
to retry. If they do retry, they do so with exponential backoff with jitter and a maximum retry
value. As a result, backend services and processes are given relief from load and time to self-heal,
resulting in faster recovery and successful request servicing.
when calling services that are idempotent and where retries improve your client availability. Decide
what the timeouts are and when to stop retrying based on your use case. Build and exercise testing
scenarios for those retry use cases.
Implementation steps
• Determine the optimal layer in your application stack to implement retries for the services your
application relies on.
• Be aware of existing SDKs that implement proven retry strategies with exponential backoff and
jitter for your language of choice, and favor these over writing your own retry implementations.
• Verify that services are idempotent before implementing retries. Once retries are implemented,
be sure they are both tested and regularly exercise in production.
• When calling AWS service APIs, use the AWS SDKs and AWS CLI and understand the retry
configuration options. Determine if the defaults work for your use case, test, and adjust as
needed.
Resources
Related documents:
Related examples:
• Spring Retry
• Not clearing backlogged messages from a queue, when there is no value in processing these
messages if the business need no longer exists.
• Configuring first in first out (FIFO) queues when last in first out (LIFO) queues would better serve
client needs, for example when strict ordering is not required and backlog processing is delaying
all new and time sensitive requests resulting in all clients experiencing breached service levels.
• Exposing internal queues to clients instead of exposing APIs that manage work intake and place
requests into internal queues.
• Combining too many work request types into a single queue which can exacerbate backlog
conditions by spreading resource demand across request types.
• Processing complex and simple requests in the same queue, despite needing different
monitoring, timeouts and resource allocations.
• Not validating inputs or using assertions to implement fail fast mechanisms in software that
bubble up exceptions to higher level components that can handle errors gracefully.
• Not removing faulty resources from request routing, especially when failures are grey emitting
both successes and failures due to crashing and restarting, intermittent dependency failure,
reduced capacity, or network packet loss.
Benefits of establishing this best practice: Systems that fail fast are easier to debug and fix, and
often expose issues in coding and configuration before releases are published into production.
Systems that incorporate effective queueing strategies provide greater resilience and reliability to
traffic spikes and intermittent system fault conditions.
Implementation guidance
Fail fast strategies can be coded into software solutions as well as configured into infrastructure.
In addition to failing fast, queues are a straightforward yet powerful architectural technique to
decouple system components smooth load. Amazon CloudWatch provides capabilities to monitor
for and alarm on failures. Once a system is known to be failing, mitigation strategies can be
invoked, including failing away from impaired resources. When systems implement queues with
Amazon SQS and other queue technologies to smooth load, they must consider how to manage
queue backlogs, as well as message consumption failures.
Related examples:
Related videos:
Related tools:
• Amazon SQS
• Amazon MQ
• AWS IoT Core
• Amazon CloudWatch
Set timeouts appropriately on connections and requests, verify them systematically, and do not
rely on default values as they are not aware of workload specifics.
Desired outcome: Client timeouts should consider the cost to the client, server, and workload
associated with waiting for requests that take abnormal amounts of time to complete. Since it is
not possible to know the exact cause of any timeout, clients must use knowledge of services to
develop expectations of probable causes and appropriate timeouts
Client connections time out based on configured values. After encountering a timeout, clients make
decisions to back off and retry or open a circuit breaker. These patterns avoid issuing requests that
may exacerbate an underlying error condition.
Common anti-patterns:
Services should also protect themselves from abnormally expensive content with throttles and
server-side timeouts.
• Requests that take abnormally long due to a service impairment can be timed out and retried.
Consideration should be given to service costs for the request and retry, but if the cause is
a localized impairment, a retry is not likely to be expensive and will reduce client resource
consumption. The timeout may also release server resources depending on the nature of the
impairment.
• Requests that take a long time to complete because the request or response has failed to be
delivered by the network can be timed out and retried. Because the request or response was
not delivered, failure would have been the outcome regardless of the length of timeout. Timing
out in this case will not release server resources, but it will release client resources and improve
workload performance.
Take advantage of well-established design patterns like retries and circuit breakers to handle
timeouts gracefully and support fail-fast approaches. AWS SDKs and AWS CLI allow for
configuration of both connection and request timeouts and for retries with exponential backoff
and jitter. AWS Lambda functions support configuration of timeouts, and with AWS Step Functions,
you can build low code circuit breakers that take advantage of pre-built integrations with AWS
services and SDKs. AWS App Mesh Envoy provides timeout and circuit breaker capabilities.
Implementation steps
• Configure timeouts on remote service calls and take advantage of built-in language timeout
features or open source timeout libraries.
• When your workload makes calls with an AWS SDK, review the documentation for language
specific timeout configuration.
• Python
• PHP
• .NET
• Ruby
• Java
• Go
• Node.js
• C++
Workload architecture 527
AWS Well-Architected Framework Framework
Related examples:
• Using the circuit breaker pattern with AWS Step Functions and Amazon DynamoDB
Related tools:
• AWS SDKs
• AWS Lambda
• Amazon SQS
Systems should either not require state, or should offload state such that between different client
requests, there is no dependence on locally stored data on disk and in memory. This allows servers
to be replaced at will without causing an availability impact.
When users or services interact with an application, they often perform a series of interactions that
form a session. A session is unique data for users that persists between requests while they use
the application. A stateless application is an application that does not need knowledge of previous
interactions and does not store session information.
Once designed to be stateless, you can then use serverless compute services, such as AWS Lambda
or AWS Fargate.
• Design a stateless architecture after you identify which state and user data need to be persisted
with your storage solution of choice.
Resources
Related documents:
Emergency levers are rapid processes that can mitigate availability impact on your workload.
Desired outcome: By implementing emergency levers, you can establish known-good processes
to maintain the availability of critical components in your workload. The workload should degrade
gracefully and continue to perform its business-critical functions during the activation of an
emergency lever. For more detail on graceful degradation, see REL05-BP01 Implement graceful
degradation to transform applicable hard dependencies into soft dependencies.
Common anti-patterns:
• Not testing or verifying critical component behavior during non-critical component impairment.
• No clear and deterministic criteria defined for activation or deactivation of an emergency lever.
• Finding the right metrics to monitor depends on your workload. Some example metrics are
latency or the number of failed request to a dependency.
• Define the procedures, manual or automated, that comprise the emergency lever.
• This may include mechanisms such as load shedding, throttling requests, or implementing
graceful degradation.
Resources
Related documents:
• Any Day Can Be Prime Day: How Amazon.com Search Uses Chaos Engineering to Handle Over
84K Requests Per Second
Related videos:
Change management
Questions
AWS makes an abundance of monitoring and log information available for consumption that can
be used to define workload-specific metrics, change-in-demand processes, and adopt machine
learning techniques regardless of ML expertise.
In addition, monitor all of your external endpoints to ensure that they are independent of your
base implementation. This active monitoring can be done with synthetic transactions (sometimes
referred to as user canaries, but not to be confused with canary deployments) which periodically
run a number of common tasks matching actions performed by clients of the workload. Keep
these tasks short in duration and be sure not to overload your workload during testing. Amazon
CloudWatch Synthetics allows you to create synthetic canaries to monitor your endpoints and APIs.
You can also combine the synthetic canary client nodes with AWS X-Ray console to pinpoint which
synthetic canaries are experiencing issues with errors, faults, or throttling rates for the selected
time frame.
Desired Outcome:
Collect and use critical metrics from all components of the workload to ensure workload reliability
and optimal user experience. Detecting that a workload is not achieving business outcomes allows
you to quickly declare a disaster and recover from an incident.
Common anti-patterns:
Benefits of establishing this best practice: Monitoring at all tiers in your workload allows you to
more rapidly anticipate and resolve problems in the components that comprise the workload.
Implementation guidance
1. Turn on logging where available. Monitoring data should be obtained from all components of
the workloads. Turn on additional logging, such as S3 Access Logs, and permit your workload
User guides:
• Creating a trail
• Monitoring memory and disk metrics for Amazon EC2 Linux instances
• Using CloudWatch Logs with container instances
• VPC Flow Logs
• What is Amazon DevOps Guru?
• What is AWS X-Ray?
Related blogs:
Store log data and apply filters where necessary to calculate metrics, such as counts of a specific
log event, or latency calculated from log event timestamps.
• The Amazon Builders' Library: Instrumenting distributed systems for operational visibility
When organizations detect potential issues, they send real-time notifications and alerts to the
appropriate personnel and systems in order to respond quickly and effectively to these issues.
Desired outcome: Rapid responses to operational events are possible through configuration of
relevant alarms based on service and application metrics. When alarm thresholds are breached, the
appropriate personnel and systems are notified so they can address underlying issues.
Common anti-patterns:
• Configuring alarms with an excessively high threshold, resulting in the failure to send vital
notifications.
• Configuring alarms with a threshold that is too low, resulting in inaction on important alerts due
to the noise of excessive notifications.
• Not updating alarms and their threshold when usage changes.
• For alarms best addressed through automated actions, sending the notification to personnel
instead of generating the automated action results in excessive notifications being sent.
Benefits of establishing this best practice: Sending real-time notifications and alerts to the
appropriate personnel and systems allows for early detection of issues and rapid responses to
operational incidents.
Implementation guidance
Workloads should be equipped with real-time processing and alarming to improve the detectability
of issues that could impact the availability of the application and serve as triggers for automated
response. Organizations can perform real-time processing and alarming by creating alerts with
defined metrics in order to receive notifications whenever significant events occur or a metric
exceeds a threshold.
Amazon CloudWatch allows you to create metric and composite alarms using CloudWatch
alarms based on static threshold, anomaly detection, and other criteria. For more detail on the
types of alarms you can configure using CloudWatch, see the alarms section of the CloudWatch
documentation.
services send Amazon SNS messages by configuring the service to do so (more than 30, including
Amazon EC2, Amazon S3, and Amazon RDS).
Implementation steps
Resources
Related documents:
agreements (SLAs). Automation can range from self-healing activities of single components to full-
site failover.
Common anti-patterns:
Benefits of establishing this best practice: Automating alarm processing can improve system
resiliency. The system takes corrective actions automatically, reducing manual activities that allow
for human, error-prone interventions. Workload operates meet availability goals, and reduces
service disruption.
Implementation guidance
To effectively manage alerts and automate their response, categorize alerts based on their
criticality and impact, document response procedures, and plan responses before ranking tasks.
Identify tasks requiring specific actions (often detailed in runbooks), and examine all runbooks and
playbooks to determine which tasks can be automated. If actions can be defined, often they can be
automated. If actions cannot be automated, document manual steps in an SOP and train operators
on them. Continually challenge manual processes for automation opportunities where you can
establish and maintain a plan to automate alert responses.
Implementation steps
1. Create an inventory of alarms: To obtain a list of all alarms, you can use the AWS CLI using the
Amazon CloudWatch command describe-alarms. Depending upon how many alarms you
have set up, you might have to use pagination to retrieve a subset of alarms for each call, or
alternatively you can use the AWS SDK to obtain the alarms using an API call.
Related documents:
Related videos:
Related examples:
• Reliability Workshops
• Amazon CloudWatch and Systems Manager Workshop
Collect log files and metrics histories and analyze these for broader trends and workload insights.
Amazon CloudWatch Logs Insights supports a simple yet powerful query language that you can use
to analyze log data. Amazon CloudWatch Logs also supports subscriptions that allow data to flow
seamlessly to Amazon S3 where you can use or Amazon Athena to query the data. It also supports
Frequently review how workload monitoring is implemented and update it based on significant
events and changes.
Effective monitoring is driven by key business metrics. Ensure these metrics are accommodated in
your workload as business priorities change.
Auditing your monitoring helps ensure that you know when an application is meeting its
availability goals. Root cause analysis requires the ability to discover what happened when failures
occur. AWS provides services that allow you to track the state of your services during an incident:
• Amazon CloudWatch Logs: You can store your logs in this service and inspect their contents.
• Amazon CloudWatch Logs Insights: Is a fully managed service that allows you to analyze
massive logs in seconds. It gives you fast, interactive queries and visualizations.
• AWS Config: You can see what AWS infrastructure was in use at different points in time.
• AWS CloudTrail: You can see which AWS APIs were invoked at what time and by what principal.
At AWS, we conduct a weekly meeting to review operational performance and to share learnings
between teams. Because there are so many teams in AWS, we created The Wheel to randomly pick
a workload to review. Establishing a regular cadence for operational performance reviews and
knowledge sharing enhances your ability to achieve higher performance from your operational
teams.
Common anti-patterns:
Trace requests as they process through service components so product teams can more easily
analyze and debug issues and improve performance.
Desired outcome: Workloads with comprehensive tracing across all components are easy to
debug, improving mean time to resolution (MTTR) of errors and latency by simplifying root cause
discovery. End-to-end tracing reduces the time it takes to discover impacted components and drill
into the detailed root causes of errors or latency.
Common anti-patterns:
• Tracing is used for some components but not for all. For example, without tracing for AWS
Lambda, teams might not clearly understand latency caused by cold starts in a spiky workload.
• Synthetic canaries or real-user monitoring (RUM) are not configured with tracing. Without
canaries or RUM, client interaction telemetry is omitted from the trace analysis yielding an
incomplete performance profile.
• Hybrid workloads include both cloud native and third party tracing tools, but steps have not
been taken elect and fully integrate a single tracing solution. Based on the elected tracing
solution, cloud native tracing SDKs should be used to instrument components that are not cloud
native or third party tools should be configured to ingest cloud native trace telemetry.
Benefits of establishing this best practice: When development teams are alerted to issues, they
can see a full picture of system component interactions, including component by component
correlation to logging, performance, and failures. Because tracing makes it easy to visually identify
root causes, less time is spent investigating root causes. Teams that understand component
interactions in detail make better and faster decisions when resolving issues. Decisions like when
to invoke disaster recovery (DR) failover or where to best implement self-healing strategies can
be improved by analyzing systems traces, ultimately improving customer satisfaction with your
services.
Implementation guidance
Teams that operate distributed applications can use tracing tools to establish a correlation
identifier, collect traces of requests, and build service maps of connected components. All
Related documents:
• The Amazon Builders' Library: Instrumenting distributed systems for operational visibility
• Availability and Beyond: Understanding and Improving the Resilience of Distributed Systems on
AWS
Related examples:
Related videos:
Related tools:
• AWS X-Ray
• Amazon CloudWatch
• Amazon Route 53
S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per
second.
Configure and use Amazon CloudFront or a trusted content delivery network (CDN). A CDN can
provide faster end-user response times and can serve requests for content from cache, therefore
reducing the need to scale your workload.
Common anti-patterns:
• Implementing Auto Scaling groups for automated healing, but not implementing elasticity.
• Using automatic scaling to respond to large increases in traffic.
• Deploying highly stateful applications, eliminating the option of elasticity.
Benefits of establishing this best practice: Automation removes the potential for manual error
in deploying and decommissioning resources. Automation removes the risk of cost overruns and
denial of service due to slow response on needs for deployment or decommissioning.
Implementation guidance
• Configure and use AWS Auto Scaling. This monitors your applications and automatically adjusts
capacity to maintain steady, predictable performance at the lowest possible cost. Using AWS
Auto Scaling, you can setup application scaling for multiple resources across multiple services.
• What is AWS Auto Scaling?
• Configure Auto Scaling on your Amazon EC2 instances and Spot Fleets, Amazon ECS tasks,
Amazon DynamoDB tables and indexes, Amazon Aurora Replicas, and AWS Marketplace
appliances as applicable.
• Managing throughput capacity automatically with DynamoDB Auto Scaling
• Use service API operations to specify the alarms, scaling policies, warm up times, and
cool down times.
• Use Elastic Load Balancing. Load balancers can distribute load by path or by network
connectivity.
• What is Elastic Load Balancing?
• Application Load Balancers can distribute load by path.
• What is an Application Load Balancer?
Change management 553
AWS Well-Architected Framework Framework
• Configure Amazon CloudFront distributions for your workloads, or use a third-party CDN.
• You can limit access to your workloads so that they are only accessible from CloudFront by
using the IP ranges for CloudFront in your endpoint security groups or access policies.
Resources
Related documents:
• APN Partner: partners that can help you create automated compute solutions
• AWS Auto Scaling: How Scaling Plans Work
• AWS Marketplace: products that can be used with auto scaling
• Managing Throughput Capacity Automatically with DynamoDB Auto Scaling
• Using a load balancer with an Auto Scaling group
• What Is AWS Global Accelerator?
• What Is Amazon EC2 Auto Scaling?
• What is AWS Auto Scaling?
• What is Amazon CloudFront?
• What is Amazon Route 53?
• What is Elastic Load Balancing?
• What is a Network Load Balancer?
• What is an Application Load Balancer?
• Working with records
You first must configure health checks and the criteria on these checks to indicate when availability
is impacted by lack of resources. Then, either notify the appropriate personnel to manually scale
the resource, or start automation to automatically scale it.
Scale can be manually adjusted for your workload (for example, changing the number of EC2
instances in an Auto Scaling group, or modifying throughput of a DynamoDB table through the
capacity to handle sudden increases in traffic, without throttling. For more detail, see
Managing throughput capacity automatically with DynamoDB auto scaling.
Resources
Related documents:
REL07-BP03 Obtain resources upon detection that more resources are needed for a workload
Many AWS services automatically scale to meet demand. If using Amazon EC2 instances or Amazon
ECS clusters, you can configure automatic scaling of these to occur based on usage metrics that
correspond to demand for your workload. For Amazon EC2, average CPU utilization, load balancer
request count, or network bandwidth can be used to scale out (or scale in) EC2 instances. For
Amazon ECS, average CPU utilization, load balancer request count, and memory utilization can be
used to scale out (or scale in) ECS tasks. Using Target Auto Scaling on AWS, the autoscaler acts like
a household thermostat, adding or removing resources to maintain the target value (for example,
70% CPU utilization) that you specify.
Amazon EC2 Auto Scaling can also do Predictive Auto Scaling, which uses machine learning to
analyze each resource's historical workload and regularly forecasts the future load.
Little’s Law helps calculate how many instances of compute (EC2 instances, concurrent Lambda
functions, etc.) that you need.
L = λW
It’s important to perform sustained load testing. Load tests should discover the breaking point
and test the performance of your workload. AWS makes it easy to set up temporary testing
environments that model the scale of your production workload. In the cloud, you can create a
production-scale test environment on demand, complete your testing, and then decommission the
resources. Because you only pay for the test environment when it's running, you can simulate your
live environment for a fraction of the cost of testing on premises.
Load testing in production should also be considered as part of game days where the production
system is stressed, during hours of lower customer usage, with all personnel on hand to interpret
results and address any problems that arise.
Common anti-patterns:
• Performing load testing on deployments that are not the same configuration as your production.
• Performing load testing only on individual pieces of your workload, and not on the entire
workload.
• Performing load testing with a subset of requests and not a representative set of actual requests.
Benefits of establishing this best practice: You know what components in your architecture fail
under load and be able to identify what metrics to watch to indicate that you are approaching that
load in time to address the problem, preventing the impact of that failure.
Implementation guidance
• Perform load testing to identify which aspect of your workload indicates that you must add or
remove capacity. Load testing should have representative traffic similar to what you receive in
production. Increase the load while watching the metrics you have instrumented to determine
which metric indicates when you must add or remove resources.
• Identify the mix of requests. You may have varied mixes of requests, so you should look at
various time frames when identifying the mix of traffic.
• Implement a load driver. You can use custom code, open source, or commercial software to
implement a load driver.
For example, put processes in place to ensure rollback safety during deployments. Ensuring that
you can roll back a deployment without any disruption for your customers is critical in making a
service reliable.
For runbook procedures, start with a valid effective manual process, implement it in code, and
invoke it to automatically run where appropriate.
Even for sophisticated workloads that are highly automated, runbooks are still useful for running
game days or meeting rigorous reporting and auditing requirements.
Note that playbooks are used in response to specific incidents, and runbooks are used to achieve
specific outcomes. Often, runbooks are for routine activities, while playbooks are used for
responding to non-routine events.
Common anti-patterns:
Benefits of establishing this best practice: Effective change planning increases your ability to
successfully run the change because you are aware of all the systems impacted. Validating your
change in test environments increases your confidence.
Implementation guidance
components such as user interfaces, APIs, databases, and source code. When you examine these
components of the system, functional tests verify that each feature behaves as expected, which
protects both user expectations and the software's integrity. Integrate functional tests as part of
your regular deployment, and use automation to deploy all changes, which reduces the potential
for introduction of human errors.
Implementation guidance
Integrate functional testing as part of your deployment. Functional tests are run as part of
automated deployment. If success criteria are not met, the pipeline is halted or rolled back. AWS
CodePipeline provides a continuous delivery pipeline for automated testing, which allows testers
to automate the entire testing and deployment process. It integrates with AWS services such as
AWS CodeBuild and AWS CodeDeploy to automate the build, test, and deployment phases of the
software development lifecycle.
Implementation steps
• Configure your pipeline: Set up your source, build, test, and deploy stages using the AWS
CodePipeline console or AWS Command Line Interface (CLI).
• Define your source: With AWS CodePipeline, you can automatically retrieve source code from
version control systems like GitHub, AWS CodeCommit, or Bitbucket, which verifies that the
latest code is always used for testing.
• Automate builds and tests: AWS CodeBuild can automatically build and test your code and
generate test reports. It supports popular testing frameworks like JUnit, NUnit, and TestNG.
• Deploy your code: Once the code has been built and tested, AWS CodeDeploy can deploy it
to your testing environment, including Amazon EC2 instances, AWS Lambda functions, or on-
premises servers.
• Monitor pipelines: AWS CodePipeline can track the progress of your pipeline and the status of
each stage. You can also use quality checks to block the pipeline as per test execution status.
You can also receive notifications for any pipeline stage failure or pipeline completion.
Resources
Related documents:
• Use AWS CodePipeline with AWS CodeBuild to test code and run builds
• Include updates to your disaster recovery plans and standard operating procedures (SOPs) with
any significant deployment.
• Integrate reliability testing into your automated deployment pipelines. Services such asAWS
Resilience Hubcan be integrated into your CI/CD pipeline to establish continuous resilience
assessments that are automatically evaluated as part of every deployment.
• Define your applications in AWS Resilience Hub. Resilience assessments generate code snippets
that help you create recovery procedures as AWS Systems Manager documents for your
applications and provide a list of recommended Amazon CloudWatch monitors and alarms.
• Once your DR plans and SOPs are updated, complete disaster recovery testing to verify that they
are effective. Disaster recovery testing helps you determine if you can restore your system after
an event and return to normal operations. You can simulate various disaster recovery strategies
and identify whether your planning is sufficient to meet your uptime requirements. Common
disaster recovery strategies include backup and restore, pilot light, cold standby, warm standby,
hot standby, and active-active, and they all differ in cost and complexity. Before disaster recovery
testing, we recommend that you define your recovery time objective (RTO) and recovery point
objective (RPO) to simplify the choice of strategy to simulate. AWS offers disaster recovery tools
like AWS Elastic Disaster Recovery to help you get started with your planning and testing.
• Chaos engineering experiments introduce disruptions to the system, such as network outages
and service failures. By simulating with controlled failures, you can discover your system's
vulnerabilities while containing the impacts of the injected failures. Just like the other strategies,
run controlled failure simulations in non-production environments using services like AWS Fault
Injection Service to gain confidence before deploying in production.
Resources
Related documents:
Related videos:
• Safer deployments with fast rollback and recovery processes: Deployments are safer because
the previous working version is not changed. You can roll back to it if errors are detected.
• Enhanced security posture: By not allowing changes to infrastructure, remote access
mechanisms (such as SSH) can be disabled. This reduces the attack vector, improving your
organization's security posture.
Implementation guidance
Automation
With Infrastructure as code (IaC), infrastructure provisioning, orchestration, and deployment steps
are defined in a programmatic, descriptive, and declarative way and stored in a source control
system. Leveraging infrastructure as code makes it simpler to automate infrastructure deployment
and helps achieve infrastructure immutability.
Deployment patterns
When a change in the workload is required, the immutable infrastructure deployment strategy
mandates that a new set of infrastructure resources is deployed, including all necessary changes.
It is important for this new set of resources to follow a rollout pattern that minimizes user impact.
There are two main strategies for this deployment:
Canary deployment: The practice of directing a small number of your customers to the new
version, usually running on a single service instance (the canary). You then deeply scrutinize any
behavior changes or errors that are generated. You can remove traffic from the canary if you
encounter critical problems and send the users back to the previous version. If the deployment
is successful, you can continue to deploy at your desired velocity, while monitoring the changes
for errors, until you are fully deployed. AWS CodeDeploy can be configured with a deployment
configuration that allows a canary deployment.
Blue/green deployment: Similar to the canary deployment, except that a full fleet of the
application is deployed in parallel. You alternate your deployments across the two stacks (blue
and green). Once again, you can send traffic to the new version, and fall back to the old version
maintenance, validation, sharing, and deployment of customized, secure, and up-to-date Linux
or Windows custom AMI.
• Some of the services that support automation are:
• AWS Elastic Beanstalk is a service to rapidly deploy and scale web applications developed
with Java, .NET, PHP, Node.js, Python, Ruby, Go, and Docker on familiar servers such as
Apache, NGINX, Passenger, and IIS.
• AWS Proton helps platform teams connect and coordinate all the different tools your
development teams need for infrastructure provisioning, code deployments, monitoring,
and updates. AWS Proton enables automated infrastructure as code provisioning and
deployment of serverless and container-based applications.
• AWS CloudFormation helps developers create AWS resources in an orderly and predictable
fashion. Resources are written in text files using JSON or YAML format. The templates
require a specific syntax and structure that depends on the types of resources being created
and managed. You author your resources in JSON or YAML with any code editor such as AWS
Cloud9, check it into a version control system, and then CloudFormation builds the specified
services in safe, repeatable manner.
• AWS Serverless Application Model (AWS SAM) is an open-source framework that you can use
to build serverless applications on AWS. AWS SAM integrates with other AWS services, and is
an extension of AWS CloudFormation.
• AWS Cloud Development Kit (AWS CDK) is an open-source software development framework
to model and provision your cloud application resources using familiar programming
languages. You can use AWS CDK to model application infrastructure using TypeScript,
Python, Java, and .NET. AWS CDK uses AWS CloudFormation in the background to provision
resources in a safe, repeatable manner.
• AWS Cloud Control API introduces a common set of Create, Read, Update, Delete, and
List (CRUDL) APIs to help developers manage their cloud infrastructure in an easy and
consistent way. The Cloud Control API common APIs allow developers to uniformly manage
the lifecycle of AWS and third-party services.
• Canary deployments:
Common anti-patterns:
Benefits of establishing this best practice: When you use automation to deploy all changes, you
remove the potential for introduction of human error and provide the ability to test before you
change production. Performing this process prior to production push verifies that your plans are
complete. Additionally, automatic rollback into your release process can identify production issues
and return your workload to its previously-working operational state.
Implementation guidance
Automate your deployment pipeline. Deployment pipelines allow you to invoke automated testing
and detection of anomalies, and either halt the pipeline at a certain step before production
deployment, or automatically roll back a change. An integral part of this is the adoption of the
culture of continuous integration and continuous delivery/deployment (CI/CD), where a commit
or code change passes through various automated stage gates from build and test stages to
deployment on production environments.
Although conventional wisdom suggests that you keep people in the loop for the most difficult
operational procedures, we suggest that you automate the most difficult procedures for that very
reason.
Implementation steps
You can automate deployments to remove manual operations by following these steps:
• Set up a code repository to store your code securely: Use AWS CodeCommit, to create a secure
Git-based repository.
• Configure a continuous integration service to compile your source code, run tests, and create
deployment artifacts: To set up a build project for this purpose, see Getting started with AWS
CodeBuild using the console.
• Set up a deployment service that automates application deployments and handles the
complexity of application updates without reliance on error-prone manual deployments:
Failure management
Questions
• REL 9. How do you back up data?
• REL 10. How do you use fault isolation to protect your workload?
• REL 11. How do you design your workload to withstand component failures?
• REL 12. How do you test reliability?
• REL 13. How do you plan for disaster recovery (DR)?
Back up data, applications, and configuration to meet your requirements for recovery time
objectives (RTO) and recovery point objectives (RPO).
Best practices
• REL09-BP01 Identify and back up all data that needs to be backed up, or reproduce the data
from sources
• REL09-BP02 Secure and encrypt backups
• REL09-BP03 Perform data backup automatically
• REL09-BP04 Perform periodic recovery of the data to verify backup integrity and processes
REL09-BP01 Identify and back up all data that needs to be backed up, or reproduce the data
from sources
Understand and use the backup capabilities of the data services and resources used by the
workload. Most services provide capabilities to back up workload data.
Desired outcome: Data sources have been identified and classified based on criticality. Then,
establish a strategy for data recovery based on the RPO. This strategy involves either backing up
these data sources, or having the ability to reproduce data from other sources. In the case of data
loss, the strategy implemented allows recovery or the reproduction of data within the defined RPO
and RTO.
backup. Another example, if working with Amazon EMR, it might not be necessary to backup your
HDFS data store, as long as you can reproduce the data into Amazon EMR from Amazon S3.
When selecting a backup strategy, consider the time it takes to recover data. The time needed to
recover data depends on the type of backup (in the case of a backup strategy), or the complexity of
the data reproduction mechanism. This time should fall within the RTO for the workload.
Implementation steps
1. Identify all data sources for the workload. Data can be stored on a number of resources such
as databases, volumes, filesystems, logging systems, and object storage. Refer to the Resources
section to find Related documents on different AWS services where data is stored, and the
backup capability these services provide.
2. Classify data sources based on criticality. Different data sets will have different levels of
criticality for a workload, and therefore different requirements for resiliency. For example, some
data might be critical and require a RPO near zero, while other data might be less critical and
can tolerate a higher RPO and some data loss. Similarly, different data sets might have different
RTO requirements as well.
3. Use AWS or third-party services to create backups of the data. AWS Backup is a managed
service that allows creating backups of various data sources on AWS. AWS Elastic Disaster
Recovery handles automated sub-second data replication to an AWS Region. Most AWS services
also have native capabilities to create backups. The AWS Marketplace has many solutions that
provide these capabilites as well. Refer to the Resources listed below for information on how to
create backups of data from various AWS services.
4. For data that is not backed up, establish a data reproduction mechanism. You might choose
not to backup data that can be reproduced from other sources for various reasons. There might
be a situation where it is cheaper to reproduce data from sources when needed rather than
creating a backup as there may be a cost associated with storing backups. Another example is
where restoring from a backup takes longer than reproducing the data from sources, resulting
in a breach in RTO. In such situations, consider tradeoffs and establish a well-defined process
for how data can be reproduced from these sources when data recovery is necessary. For
example, if you have loaded data from Amazon S3 to a data warehouse (like Amazon Redshift),
or MapReduce cluster (like Amazon EMR) to do analysis on that data, this may be an example
of data that can be reproduced from other sources. As long as the results of these analyses are
either stored somewhere or reproducible, you would not suffer a data loss from a failure in the
data warehouse or MapReduce cluster. Other examples that can be reproduced from sources
include caches (like Amazon ElastiCache) or RDS read replicas.
Related videos:
• AWS re:Invent 2021 - Backup, disaster recovery, and ransomware protection with AWS
• AWS Backup Demo: Cross-Account and Cross-Region Backup
• AWS re:Invent 2019: Deep dive on AWS Backup, ft. Rackspace (STG341)
Related examples:
Control and detect access to backups using authentication and authorization. Prevent and detect if
data integrity of backups is compromised using encryption.
Common anti-patterns:
• Having the same access to the backups and restoration automation as you do to the data.
• Not encrypting your backups.
Benefits of establishing this best practice: Securing your backups prevents tampering with the
data, and encryption of the data prevents access to that data if it is accidentally exposed.
Implementation guidance
Control and detect access to backups using authentication and authorization, such as AWS Identity
and Access Management (IAM). Prevent and detect if data integrity of backups is compromised
using encryption.
Amazon S3 supports several methods of encryption of your data at rest. Using server-side
encryption, Amazon S3 accepts your objects as unencrypted data, and then encrypts them as they
are stored. Using client-side encryption, your workload application is responsible for encrypting the
Related examples:
Desired outcome: An automated process that creates backups of data sources at an established
cadence.
Common anti-patterns:
Benefits of establishing this best practice: Automating backups verifies that they are taken
regularly based on your RPO, and alerts you if they are not taken.
4. For data sources not supported by an automated backup solution or managed service such as
on-premises data sources or message queues, consider using a trusted third-party solution to
create automated backups. Alternatively, you can create automation to do this using the AWS
CLI or SDKs. You can use AWS Lambda Functions or AWS Step Functions to define the logic
involved in creating a data backup, and use Amazon EventBridge to invoke it at a frequency
based on your RPO.
Resources
Related documents:
Related videos:
• AWS re:Invent 2019: Deep dive on AWS Backup, ft. Rackspace (STG341)
Related examples:
Using AWS, you can stand up a testing environment and restore your backups to assess RTO and
RPO capabilities, and run tests on data content and integrity.
Additionally, Amazon RDS and Amazon DynamoDB allow point-in-time recovery (PITR). Using
continuous backup, you can restore your dataset to the state it was in at a specified date and time.
If all the data is available, is not corrupted, is accessible, and any data loss falls within the RPO
for the workload. Such tests can also help ascertain if recovery mechanisms are fast enough to
accommodate the workload's RTO.
AWS Elastic Disaster Recovery offers continual point-in-time recovery snapshots of Amazon EBS
volumes. As source servers are replicated, point-in-time states are chronicled over time based on
the configured policy. Elastic Disaster Recovery helps you verify the integrity of these snapshots by
launching instances for test and drill purposes without redirecting the traffic.
Implementation steps
1. Identify data sources that are currently being backed up and where these backups are being
stored. For implementation guidance, see REL09-BP01 Identify and back up all data that needs
to be backed up, or reproduce the data from sources.
2. Establish criteria for data validation for each data source. Different types of data will have
different properties which might require different validation mechanisms. Consider how this
data might be validated before you are confident to use it in production. Some common ways to
validate data are using data and backup properties such as data type, format, checksum, size, or
a combination of these with custom validation logic. For example, this might be a comparison of
the checksum values between the restored resource and the data source at the time the backup
was created.
3. Establish RTO and RPO for restoring the data based on data criticality. For implementation
guidance, see REL13-BP01 Define recovery objectives for downtime and data loss.
4. Assess your recovery capability. Review your backup and restore strategy to understand if
it can meet your RTO and RPO, and adjust the strategy as necessary. Using AWS Resilience
Hub, you can run an assessment of your workload. The assessment evaluates your application
configuration against the resiliency policy and reports if your RTO and RPO targets can be met.
5. Do a test restore using currently established processes used in production for data restoration.
These processes depend on how the original data source was backed up, the format and storage
location of the backup itself, or if the data is reproduced from other sources. For example, if
you are using a managed service such as AWS Backup, this might be as simple as restoring the
Level of effort for the Implementation Plan: Moderate to high depending on the complexity of
the validation criteria.
Resources
Related documents:
Related examples:
independent data centers) can be treated as a single logical deployment target for your workload,
including the ability to synchronously replicate data (for example, between databases). This allows
you to use Availability Zones in an active/active or active/standby configuration.
Availability Zones are independent, and therefore workload availability is increased when the
workload is architected to use multiple zones. Some AWS services (including the Amazon EC2
instance data plane) are deployed as strictly zonal services where they have shared fate with the
Availability Zone they are in. Amazon EC2 instances in the other AZs will however be unaffected
and continue to function. Similarly, if a failure in an Availability Zone causes an Amazon Aurora
database to fail, a read-replica Aurora instance in an unaffected AZ can be automatically promoted
to primary. Regional AWS services, such as Amazon DynamoDB on the other hand internally use
multiple Availability Zones in an active/active configuration to achieve the availability design goals
for that service, without you needing to configure AZ placement.
Figure 9: Multi-tier architecture deployed across three Availability Zones. Note that Amazon S3 and
Amazon DynamoDB are always Multi-AZ automatically. The ELB also is deployed to all three zones.
While AWS control planes typically provide the ability to manage resources within the entire
Region (multiple Availability Zones), certain control planes (including Amazon EC2 and Amazon
EBS) have the ability to filter results to a single Availability Zone. When this is done, the request
is processed only in the specified Availability Zone, reducing exposure to disruption in other
Availability Zones. This AWS CLI example illustrates getting Amazon EC2 instance information from
only the us-east-2c Availability Zone:
(Amazon S3) Replication, Amazon RDS Read Replicas (including Aurora Read Replicas), and Amazon
DynamoDB Global Tables. With continuous replication, versions of your data are available for near
immediate use in each of your active Regions.
Using AWS CloudFormation, you can define your infrastructure and deploy it consistently
across AWS accounts and across AWS Regions. And AWS CloudFormation StackSets extends this
functionality by allowing you to create, update, or delete AWS CloudFormation stacks across
multiple accounts and regions with a single operation. For Amazon EC2 instance deployments, an
AMI (Amazon Machine Image) is used to supply information such as hardware configuration and
installed software. You can implement an Amazon EC2 Image Builder pipeline that creates the
AMIs you need and copy these to your active regions. This ensures that these Golden AMIs have
everything you need to deploy and scale-out your workload in each new region.
To route traffic, both Amazon Route 53 and AWS Global Accelerator permit the definition of
policies that determine which users go to which active regional endpoint. With Global Accelerator
you set a traffic dial to control the percentage of traffic that is directed to each application
endpoint. Route 53 supports this percentage approach, and also multiple other available policies
including geoproximity and latency based ones. Global Accelerator automatically leverages the
extensive network of AWS edge servers, to onboard traffic to the AWS network backbone as soon
as possible, resulting in lower request latencies.
All of these capabilities operate so as to preserve each Region’s autonomy. There are very few
exceptions to this approach, including our services that provide global edge delivery (such as
Amazon CloudFront and Amazon Route 53), along with the control plane for the AWS Identity and
Access Management (IAM) service. Most services operate entirely within a single Region.
For workloads that run in an on-premises data center, architect a hybrid experience when possible.
AWS Direct Connect provides a dedicated network connection from your premises to AWS allowing
you to run in both.
Another option is to run AWS infrastructure and services on premises using AWS Outposts. AWS
Outposts is a fully managed service that extends AWS infrastructure, AWS services, APIs, and tools
to your data center. The same hardware infrastructure used in the AWS Cloud is installed in your
data center. AWS Outposts are then connected to the nearest AWS Region. You can then use AWS
Outposts to support your workloads that have low latency or local data processing requirements.
• Determine if AWS Local Zones helps you provide service to your users. If you have low-latency
requirements, see if AWS Local Zones is located near your users. If yes, then use it to deploy
workloads closer to those users.
Resources
Related documents:
Related videos:
• AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-
R2)
• AWS re:Invent 2019: Innovation and operation of the AWS global network infrastructure
(NET339)
Desired outcome: For high availability, always (when possible) deploy your workload components
to multiple Availability Zones (AZs). For workloads with extreme resilience requirements, carefully
evaluate the options for a multi-Region architecture.
Implementation guidance
For a disaster event based on disruption or partial loss of one Availability Zone, implementing
a highly available workload in multiple Availability Zones within a single AWS Region helps
mitigate against natural and technical disasters. Each AWS Region is comprised of multiple
Availability Zones, each isolated from faults in the other zones and separated by a meaningful
distance. However, for a disaster event that includes the risk of losing multiple Availability Zone
components, which are a significant distance away from each other, you should implement
disaster recovery options to mitigate against failures of a Region-wide scope. For workloads that
require extreme resilience (critical infrastructure, health-related applications, financial system
infrastructure, etc.), a multi-Region strategy may be required.
Implementation Steps
1. Evaluate your workload and determine whether the resilience needs can be met by a multi-
AZ approach (single AWS Region), or if they require a multi-Region approach. Implementing a
multi-Region architecture to satisfy these requirements will introduce additional complexity,
therefore carefully consider your use case and its requirements. Resilience requirements can
almost always be met using a single AWS Region. Consider the following possible requirements
when determining whether you need to use multiple Regions:
a. Disaster recovery (DR): For a disaster event based on disruption or partial loss of one
Availability Zone, implementing a highly available workload in multiple Availability Zones
within a single AWS Region helps mitigate against natural and technical disasters. For a
disaster event that includes the risk of losing multiple Availability Zone components, which
are a significant distance away from each other, you should implement disaster recovery
across multiple Regions to mitigate against natural disasters or technical failures of a Region-
wide scope.
b. High availability (HA): A multi-Region architecture (using multiple AZs in each Region) can be
used to achieve greater then four 9’s (> 99.99%) availability.
c. Stack localization: When deploying a workload to a global audience, you can deploy localized
stacks in different AWS Regions to serve audiences in those Regions. Localization can include
language, currency, and types of data stored.
d. Proximity to users: When deploying a workload to a global audience, you can reduce latency
by deploying stacks in AWS Regions close to where the end users are.
e. Data residency: Some workloads are subject to data residency requirements, where data
from certain users must remain within a specific country’s borders. Based on the regulation in
i. Endpoints for standard accelerators in AWS Global Accelerator - AWS Global Accelerator
(amazon.com)
d. For applications that leverage AWS EventBridge, consider cross-Region buses to forward
events to other Regions you select.
i. Sending and receiving Amazon EventBridge events between AWS Regions
e. For Amazon Aurora databases, consider Aurora global databases, which span multiple AWS
regions. Existing clusters can be modified to add new Regions as well.
i. Getting started with Amazon Aurora global databases
f. If your workload includes AWS Key Management Service (AWS KMS) encryption keys, consider
whether multi-Region keys are appropriate for your application.
i. Multi-Region keys in AWS KMS
g. For other AWS service features, see this blog series on Creating a Multi-Region Application
with AWS Services series
Resources
Related documents:
Related videos:
• AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-
R2)
• Auth0: Multi-Region High-Availability Architecture that Scales to 1.5B+ Logins a Month with
automated failover
For stateful server-based workloads deployed to an on-premise data center, you can use AWS
Elastic Disaster Recovery to protect your workloads in AWS. If you are already hosted in AWS, you
can use Elastic Disaster Recovery to protect your workload to an alternative Availability Zone or
Region. Elastic Disaster Recovery uses continual block-level replication to a lightweight staging
area to provide fast, reliable recovery of on-premises and cloud-based applications.
Implementation steps
1. Implement self-healing. Deploy your instances or containers using automatic scaling when
possible. If you cannot use automatic scaling, use automatic recovery for EC2 instances or
implement self-healing automation based on Amazon EC2 or ECS container lifecycle events.
• Use Amazon EC2 Auto Scaling groups for instances and container workloads that have no
requirements for a single instance IP address, private IP address, Elastic IP address, and
instance metadata.
• The launch template user data can be used to implement automation that can self-heal
most workloads.
• Use automatic recovery of Amazon EC2 instances for workloads that require a single instance
ID address, private IP address, elastic IP address, and instance metadata.
• Automatic Recovery will send recovery status alerts to a SNS topic as the instance failure is
detected.
• Use Amazon EC2 instance lifecycle events or Amazon ECS events to automate self-healing
where automatic scaling or EC2 recovery cannot be used.
• Use the events to invoke automation that will heal your component according to the
process logic you require.
• Protect stateful workloads that are limited to a single location using AWS Elastic Disaster
Recovery.
Resources
Related documents:
and Regions to provide fault isolation, but the concept of fault isolation can be extended to your
workload’s architecture as well.
The overall workload is partitioned cells by a partition key. This key needs to align with the grain of
the service, or the natural way that a service's workload can be subdivided with minimal cross-cell
interactions. Examples of partition keys are customer ID, resource ID, or any other parameter easily
accessible in most API calls. A cell routing layer distributes requests to individual cells based on the
partition key and presents a single endpoint to clients.
Implementation steps
When designing a cell-based architecture, there are several design considerations to consider:
1. Partition key: Special consideration should be taken while choosing the partition key.
• It should align with the grain of the service, or the natural way that a service's workload can
be subdivided with minimal cross-cell interactions. Examples are customer ID or resource
ID.
• The partition key must be available in all requests, either directly or in a way that could be
easily inferred deterministically by other parameters.
6. Code deployment: A staggered code deployment strategy should be preferred over deploying
code changes to all cells at the same time.
• This helps minimize potential failure to multiple cells due to a bad deployment or human
error. For more detail, see Automating safe, hands-off deployment.
Resources
Related documents:
Related videos:
• AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and
Small
• AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)
• Shuffle-sharding: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)
• AWS Summit ANZ 2021 - Everything fails, all the time: Designing for resilience
Related examples:
Benefits of establishing this best practice: Having appropriate monitoring at all layers allows you
to reduce recovery time by reducing time to detection.
Implementation guidance
Identify all workloads that will be reviewed for monitoring. Once you have identified all
components of the workload that will need to monitored, you will now need to determine the
monitoring interval. The monitoring interval will have a direct impact on how fast recovery can be
initiated based on the time it takes to detect a failure. The mean time to detection (MTTD) is the
amount of time between a failure occurring and when repair operations begin. The list of services
should be extensive and complete.
Monitoring must cover all layers of the application stack including application, platform,
infrastructure, and network.
Your monitoring strategy should consider the impact of gray failures. For more detail on gray
failures, see Gray failures in the Advanced Multi-AZ Resilience Patterns whitepaper.
Implementation steps
• Your monitoring interval is dependent on how quickly you must recover. Your recovery time
is driven by the time it takes to recover, so you must determine the frequency of collection by
accounting for this time and your recovery time objective (RTO).
• Configure detailed monitoring for components and managed services.
• Determine if detailed monitoring for EC2 instances and Auto Scaling is necessary. Detailed
monitoring provides one minute interval metrics, and default monitoring provides five minute
interval metrics.
• Determine if enhanced monitoring for RDS is necessary. Enhanced monitoring uses an agent
on RDS instances to get useful information about different process or threads.
• Determine the monitoring requirements of critical serverless components for Lambda, API
Gateway, Amazon EKS, Amazon ECS, and all types of load balancers.
• Determine the monitoring requirements of storage components for Amazon S3, Amazon FSx,
Amazon EFS, and Amazon EBS.
Related videos:
Related examples:
• Well-Architected Lab: Level 300: Implementing Health Checks and Managing Dependencies to
Improve Reliability
• One Observability Workshop: Explore X-Ray
Related tools:
• CloudWatch
• CloudWatch X-Ray
If a resource failure occurs, healthy resources should continue to serve requests. For location
impairments (such as Availability Zone or AWS Region), ensure that you have systems in place to
fail over to healthy resources in unimpaired locations.
When designing a service, distribute load across resources, Availability Zones, or Regions.
Therefore, failure of an individual resource or impairment can be mitigated by shifting traffic to
Implementation guidance
AWS services, such as Elastic Load Balancing and Amazon EC2 Auto Scaling, help distribute load
across resources and Availability Zones. Therefore, failure of an individual resource (such as an EC2
instance) or impairment of an Availability Zone can be mitigated by shifting traffic to remaining
healthy resources.
For multi-Region workloads, designs are more complicated. For example, cross-Region read replicas
allow you to deploy your data to multiple AWS Regions. However, failover is still required to
promote the read replica to primary and then point your traffic to the new endpoint. Amazon
Route 53, Amazon Application Recovery Controller (ARC), Amazon CloudFront, and AWS Global
Accelerator can help route traffic across AWS Regions.
AWS services, such as Amazon S3, Lambda, API Gateway, Amazon SQS, Amazon SNS, Amazon SES,
Amazon Pinpoint, Amazon ECR, AWS Certificate Manager, EventBridge, or Amazon DynamoDB, are
automatically deployed to multiple Availability Zones by AWS. In case of failure, these AWS services
automatically route traffic to healthy locations. Data is redundantly stored in multiple Availability
Zones and remains available.
For Amazon RDS, Amazon Aurora, Amazon Redshift, Amazon EKS, or Amazon ECS, Multi-AZ is
a configuration option. AWS can direct traffic to the healthy instance if failover is initiated. This
failover action may be taken by AWS or as required by the customer
For Amazon EC2 instances, Amazon Redshift, Amazon ECS tasks, or Amazon EKS pods, you choose
which Availability Zones to deploy to. For some designs, Elastic Load Balancing provides the
solution to detect instances in unhealthy zones and route traffic to the healthy ones. Elastic Load
Balancing can also route traffic to components in your on-premises data center.
For Multi-Region traffic failover, rerouting can leverage Amazon Route 53, Amazon Application
Recovery Controller, AWS Global Accelerator, Route 53 Private DNS for VPCs, or CloudFront to
provide a way to define internet domains and assign routing policies, including health checks, to
route traffic to healthy Regions. AWS Global Accelerator provides static IP addresses that act as a
fixed entry point to your application, then route to endpoints in AWS Regions of your choosing,
using the AWS global network instead of the internet for better performance and reliability.
Implementation steps
• Create failover designs for all appropriate applications and services. Isolate each architecture
component and create failover designs meeting RTO and RPO for each component.
Related examples:
For self-managed applications and cross-Region healing, recovery designs and automated healing
processes can be pulled from existing best practices.
The ability to restart or remove a resource is an important tool to remediate failures. A best
practice is to make services stateless where possible. This prevents loss of data or availability
on resource restart. In the cloud, you can (and generally should) replace the entire resource (for
example, a compute instance or serverless function) as part of the restart. The restart itself is a
simple and reliable way to recover from failure. Many different types of failures occur in workloads.
Failures can occur in hardware, software, communications, and operations.
reduced capacity while it's recovering a new node. Example services are Mongo, DynamoDB
Accelerator, Amazon Redshift, Amazon EMR, Cassandra, Kafka, MSK-EC2, Couchbase, ELK, and
Amazon OpenSearch Service. Many of these services can be designed with additional auto healing
features. Some cluster technologies must generate an alert upon the loss a node triggering an
automated or manual workflow to recreate a new node. This workflow can be automated using
AWS Systems Manager to remediate issues quickly.
Amazon EventBridge can be used to monitor and filter for events such as CloudWatch alarms
or changes in state in other AWS services. Based on event information, it can then invoke AWS
Lambda, Systems Manager Automation, or other targets to run custom remediation logic on your
workload. Amazon EC2 Auto Scaling can be configured to check for EC2 instance health. If the
instance is in any state other than running, or if the system status is impaired, Amazon EC2 Auto
Scaling considers the instance to be unhealthy and launches a replacement instance. For large-
scale replacements (such as the loss of an entire Availability Zone), static stability is preferred for
high availability.
Implementation steps
• Use Auto Scaling groups to deploy tiers in a workload. Auto Scaling can perform self-healing on
stateless applications and add or remove capacity.
• For compute instances noted previously, use load balancing and choose the appropriate type of
load balancer.
• Consider healing for Amazon RDS. With standby instances, configure for auto failover to the
standby instance. For Amazon RDS Read Replica, automated workflow is required to make a read
replica primary.
• Implement automatic recovery on EC2 instances that have applications deployed that cannot be
deployed in multiple locations, and can tolerate rebooting upon failures. Automatic recovery can
be used to replace failed hardware and restart the instance when the application is not capable
of being deployed in multiple locations. The instance metadata and associated IP addresses are
kept, as well as the EBS volumes and mount points to Amazon Elastic File System or File Systems
for Lustre and Windows. Using AWS OpsWorks, you can configure automatic healing of EC2
instances at the layer level.
• Implement automated recovery using AWS Step Functions and AWS Lambda when you cannot
use automatic scaling or automatic recovery, or when automatic recovery fails. When you cannot
use automatic scaling, and either cannot use automatic recovery or automatic recovery fails, you
can automate the healing using AWS Step Functions and AWS Lambda.
Related tools:
• CloudWatch
• CloudWatch X-Ray
REL11-BP04 Rely on the data plane and not the control plane during recovery
Control planes provide the administrative APIs used to create, read and describe, update,
delete, and list (CRUDL) resources, while data planes handle day-to-day service traffic. When
implementing recovery or mitigation responses to potentially resiliency-impacting events, focus on
using a minimal number of control plane operations to recover, rescale, restore, heal, or failover the
service. Data plane action should supersede any activity during these degradation events.
For example, the following are all control plane actions: launching a new compute instance,
creating block storage, and describing queue services. When you launch compute instances, the
control plane has to perform multiple tasks like finding a physical host with capacity, allocating
network interfaces, preparing local block storage volumes, generating credentials, and adding
security rules. Control planes tend to be complicated orchestration.
Desired outcome: When a resource enters an impaired state, the system is capable of
automatically or manually recovering by shifting traffic from impaired to healthy resources.
Common anti-patterns:
• Relying on extensive, multi service, multi-API control plane actions to remediate any category of
impairment.
Benefits of establishing this best practice: Increased success rate for automated remediation can
reduce your mean time to recovery and improve availability of the workload.
Implementation steps
For each workload that needs to be restored after a degradation event, evaluate the failover
runbook, high availability design, auto healing design, or HA resource restoration plan. Identity
each action that might be considered a control plane action.
• Auto Scaling (control plane) to pre-scaled Amazon EC2 resources (data plane)
• Amazon EC2 instance scaling (control plane) to AWS Lambda scaling (data plane)
• Assess any designs using Kubernetes and the nature of the control plane actions. Adding pods
is a data plane action in Kubernetes. Actions should be limited to adding pods and not adding
nodes. Using over-provisioned nodes is the preferred method to limit control plane actions
Consider alternate approaches that allow for data plane actions to affect the same remediation.
• Route 53 Record change (control plane) or Amazon Application Recovery Controller (data plane)
• Route 53 Health checks for more automated updates
Consider some services in a secondary Region, if the service is mission critical, to allow for more
control plane and data plane actions in an unaffected Region.
• Amazon EC2 Auto Scaling or Amazon EKS in a primary Region compared to Amazon EC2 Auto
Scaling or Amazon EKS in a secondary Region and routing traffic to secondary Region (control
plane action)
• Make read replica in secondary primary or attempting same action in primary Region (control
plane action)
Resources
• Availability Definition
• REL11-BP01 Monitor all components of the workload to detect failures
Related documents:
• Amazon CloudWatch
• AWS X-Ray
Workloads should be statically stable and only operate in a single normal mode. Bimodal behavior
is when your workload exhibits different behavior under normal and failure modes.
For example, you might try and recover from an Availability Zone failure by launching new
instances in a different Availability Zone. This can result in a bimodal response during a failure
mode. You should instead build workloads that are statically stable and operate within only one
mode. In this example, those instances should have been provisioned in the second Availability
Zone before the failure. This static stability design verifies that the workload only operates in a
single mode.
Desired outcome: Workloads do not exhibit bimodal behavior during normal and failure modes.
Common anti-patterns:
Benefits of establishing this best practice: Workloads running with statically stable designs are
capable of having predictable outcomes during normal and failure events.
Implementation guidance
Bimodal behavior occurs when your workload exhibits different behavior under normal and failure
modes (for example, relying on launching new instances if an Availability Zone fails). An example
of bimodal behavior is when stable Amazon EC2 designs provision enough instances in each
Availability Zone to handle the workload load if one AZ were removed. Elastic Load Balancing or
Amazon Route 53 health would check to shift a load away from the impaired instances. After traffic
has shifted, use AWS Auto Scaling to asynchronously replace instances from the failed zone and
Another example of bimodal behavior is allowing clients to bypass your workload cache when
failures occur. This might seem to be a solution that accommodates client needs but it can
significantly change the demands on your workload and is likely to result in failures.
Assess critical workloads to determine what workloads require this type of resilience design. For
those that are deemed critical, each application component must be reviewed. Example types of
services that require static stability evaluations are:
• Storage: Amazon S3 (Single Zone), Amazon EFS (mounts), Amazon FSx (mounts)
Implementation steps
• Build systems that are statically stable and operate in only one mode. In this case, provision
enough instances in each Availability Zone or Region to handle the workload capacity if one
Availability Zone or Region were removed. A variety of services can be used for routing to
healthy resources, such as:
• Configure database read replicas to account for the loss of a single primary instance or a read
replica. If traffic is being served by read replicas, the quantity in each Availability Zone and each
Region should equate to the overall need in case of the zone or Region failure.
• Configure critical data in Amazon S3 storage that is designed to be statically stable for data
stored in case of an Availability Zone failure. If Amazon S3 One Zone-IA storage class is used, this
should not be considered statically stable, as the loss of that zone minimizes access to this stored
data.
• Load balancers are sometimes configured incorrectly or by design to service a specific Availability
Zone. In this case, the statically stable design might be to spread a workload across multiple
AZs in a more complex design. The original design may be used to reduce interzone traffic for
security, latency, or cost reasons.
can detect patterns of problems, including those addressed by auto healing, so that you can
resolve root cause issues.
Resilient systems are designed so that degradation events are immediately communicated to
the appropriate teams. These notifications should be sent through one or many communication
channels.
Desired outcome: Alerts are immediately sent to operations teams when thresholds are breached,
such as error rates, latency, or other critical key performance indicator (KPI) metrics, so that these
issues are resolved as soon as possible and user impact is avoided or minimized.
Common anti-patterns:
Benefits of establishing this best practice: Notifications of recovery make operational and
business teams aware of service degradations so that they can react immediately to minimize both
mean time to detect (MTTD) and mean time to repair (MTTR). Notifications of recovery events also
assure that you don't ignore problems that occur infrequently.
Level of risk exposed if this best practice is not established: Medium. Failure to implement
appropriate monitoring and events notification mechanisms can result in failure to detect patterns
of problems, including those addressed by auto healing. A team will only be made aware of system
degradation when users contact customer service or by chance.
Implementation guidance
When defining a monitoring strategy, a triggered alarm is a common event. This event would
likely contain an identifier for the alarm, the alarm state (such as IN ALARM or OK), and details
of what triggered it. In many cases, an alarm event should be detected and an email notification
sent. This is an example of an action on an alarm. Alarm notification is critical in observability,
as it informs the right people that there is an issue. However, when action on events mature in
Related tools:
• CloudWatch
• CloudWatch X-Ray
REL11-BP07 Architect your product to meet availability targets and uptime service level
agreements (SLAs)
Architect your product to meet availability targets and uptime service level agreements (SLAs). If
you publish or privately agree to availability targets or uptime SLAs, verify that your architecture
and operational processes are designed to support them.
Desired outcome: Each application has a defined target for availability and SLA for performance
metrics, which can be monitored and maintained in order to meet business outcomes.
Common anti-patterns:
Benefits of establishing this best practice: Designing applications based on key resiliency targets
helps you meet business objectives and customer expectations. These objectives help drive the
application design process that evaluates different technologies and considers various tradeoffs.
Resources
Related documents:
Common anti-patterns:
• Planning to deploy a workload without knowing the processes to diagnose issues or respond to
incidents.
• Unplanned decisions about which systems to gather logs and metrics from when investigating an
event.
• Not retaining metrics and events long enough to be able to retrieve the data.
Benefits of establishing this best practice: Capturing playbooks ensures that processes can be
consistently followed. Codifying your playbooks limits the introduction of errors from manual
activity. Automating playbooks shortens the time to respond to an event by eliminating the
requirement for team member intervention or providing them additional information when their
intervention begins.
Implementation guidance
• Use playbooks to identify issues. Playbooks are documented processes to investigate issues.
Allow consistent and prompt responses to failure scenarios by documenting processes in
playbooks. Playbooks must contain the information and guidance necessary for an adequately
skilled person to gather applicable information, identify potential sources of failure, isolate
faults, and determine contributing factors (perform post-incident analysis).
• Implement playbooks as code. Perform your operations as code by scripting your playbooks
to ensure consistency and limit reduce errors caused by manual processes. Playbooks can
be composed of multiple scripts representing the different steps that might be necessary to
identify the contributing factors to an issue. Runbook activities can be invoked or performed
as part of playbook activities, or might prompt to run a playbook in response to identified
events.
• Focus on assigning blame rather than understanding the root cause, creating a culture of fear
and hindering open communication
• Failure to share insights, which keeps incident analysis findings within a small group and
prevents others from benefiting from the lessons learned
• No mechanism to capture institutional knowledge, thereby losing valuable insights by not
preserving the lessons-learned in the form of updated best practices and resulting in repeat
incidents with the same or similar root cause
Benefits of establishing this best practice: Conducting post-incident analysis and sharing
the results permits other workloads to mitigate the risk if they have implemented the same
contributing factors, and allows them to implement the mitigation or automated recovery before
an incident occurs.
Implementation guidance
Good post-incident analysis provides opportunities to propose common solutions for problems
with architecture patterns that are used in other places in your systems.
Encourage a culture that focuses on learning and improvement rather than assigning blame.
Emphasize that the goal is to prevent future incidents, not to penalize individuals.
Develop well-defined procedures for conducting post-incident analyses. These procedures should
outline the steps to be taken, the information to be collected, and the key questions to be
addressed during the analysis. Investigate incidents thoroughly, going beyond immediate causes to
identify root causes and contributing factors. Use techniques like the five whys to delve deep into
the underlying issues.
Maintain a repository of lessons learned from incident analyses. This institutional knowledge can
serve as a reference for future incidents and prevention efforts. Share findings and insights from
post-incident analyses, and consider holding open-invite post-incident review meetings to discuss
lessons learned.
Related videos:
Use techniques such as unit tests and integration tests that validate required functionality.
You achieve the best outcomes when these tests are run automatically as part of build and
deployment actions. For instance, using AWS CodePipeline, developers commit changes to a source
repository where CodePipeline automatically detects the changes. Those changes are built, and
tests are run. After the tests are complete, the built code is deployed to staging servers for testing.
From the staging server, CodePipeline runs more tests, such as integration or load tests. Upon
the successful completion of those tests, CodePipeline deploys the tested and approved code to
production instances.
Additionally, experience shows that synthetic transaction testing (also known as canary testing,
but not to be confused with canary deployments) that can run and simulate customer behavior is
among the most important testing processes. Run these tests constantly against your workload
endpoints from diverse remote locations. Amazon CloudWatch Synthetics allows you to create
canaries to monitor your endpoints and APIs.
Implementation guidance
• Test functional requirements. These include unit tests and integration tests that validate required
functionality.
• Use CodePipeline with AWS CodeBuild to test code and run builds
• AWS CodePipeline Adds Support for Unit and Custom Integration Testing with AWS CodeBuild
Resources
Related documents:
Run chaos experiments regularly in environments that are in or as close to production as possible
to understand how your system responds to adverse conditions.
Desired outcome:
The resilience of the workload is regularly verified by applying chaos engineering in the form
of fault injection experiments or injection of unexpected load, in addition to resilience testing
that validates known expected behavior of your workload during an event. Combine both chaos
engineering and resilience testing to gain confidence that your workload can survive component
failure and can recover from unexpected disruptions with minimal to no impact.
Common anti-patterns:
• Designing for resiliency, but not verifying how the workload functions as a whole when faults
occur.
• Never experimenting under real-world conditions and expected load.
• Not treating your experiments as code or maintaining them through the development cycle.
• Not running chaos experiments both as part of your CI/CD pipeline, as well as outside of
deployments.
• Neglecting to use past post-incident analyses when determining which faults to experiment with.
Benefits of establishing this best practice: Injecting faults to verify the resilience of your workload
allows you to gain confidence that the recovery procedures of your resilient design will work in the
case of a real fault.
these faults, including networking effects such as latency, dropped messages, and DNS failures,
could include the inability to resolve a name, reach the DNS service, or establish connections to
dependent services.
AWS Fault Injection Service (AWS FIS) is a fully managed service for running fault injection
experiments that can be used as part of your CD pipeline, or outside of the pipeline. AWS FIS is a
good choice to use during chaos engineering game days. It supports simultaneously introducing
faults across different types of resources including Amazon EC2, Amazon Elastic Container Service
(Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon RDS. These faults
include termination of resources, forcing failovers, stressing CPU or memory, throttling, latency,
and packet loss. Since it is integrated with Amazon CloudWatch Alarms, you can set up stop
conditions as guardrails to rollback an experiment if it causes unexpected impact.
AWS Fault Injection Service integrates with AWS resources to allow you to run fault injection
experiments for your workloads.
There are also several third-party options for fault injection experiments. These include open-
source tools such as Chaos Toolkit, Chaos Mesh, and Litmus Chaos, as well as commercial options
like Gremlin. To expand the scope of faults that can be injected on AWS, AWS FIS integrates
with Chaos Mesh and Litmus Chaos, allowing you to coordinate fault injection workflows among
multiple tools. For example, you can run a stress test on a pod’s CPU using Chaos Mesh or Litmus
faults while terminating a randomly selected percentage of cluster nodes using AWS FIS fault
actions.
Chaos engineering and continuous resilience flywheel, using the scientific method by Adrian
Hornsby.
a. Define steady state as some measurable output of a workload that indicates normal behavior.
Your workload exhibits steady state if it is operating reliably and as expected. Therefore,
validate that your workload is healthy before defining steady state. Steady state does not
necessarily mean no impact to the workload when a fault occurs, as a certain percentage
in faults could be within acceptable limits. The steady state is your baseline that you will
observe during the experiment, which will highlight anomalies if your hypothesis defined in
the next step does not turn out as expected.
For example, a steady state of a payments system can be defined as the processing of 300
TPS with a success rate of 99% and round-trip time of 500 ms.
b. Form a hypothesis about how the workload will react to the fault.
with the experiment. There are several options for injecting the faults. For workloads on
AWS, AWS FIS provides many predefined fault simulations called actions. You can also define
custom actions that run in AWS FIS using AWS Systems Manager documents.
We discourage the use of custom scripts for chaos experiments, unless the scripts have
the capabilities to understand the current state of the workload, are able to emit logs, and
provide mechanisms for rollbacks and stop conditions where possible.
An effective framework or toolset which supports chaos engineering should track the current
state of an experiment, emit logs, and provide rollback mechanisms to support the controlled
running of an experiment. Start with an established service like AWS FIS that allows you
to perform experiments with a clearly defined scope and safety mechanisms that rollback
the experiment if the experiment introduces unexpected turbulence. To learn about a wider
variety of experiments using AWS FIS, also see the Resilient and Well-Architected Apps with
Chaos Engineering lab. Also, AWS Resilience Hub will analyze your workload and create
experiments that you can choose to implement and run in AWS FIS.
Note
For every experiment, clearly understand the scope and its impact. We recommend
that faults should be simulated first on a non-production environment before being
run in production.
Experiments should run in production under real-world load using canary deployments
that spin up both a control and experimental system deployment, where feasible. Running
experiments during off-peak times is a good practice to mitigate potential impact when first
experimenting in production. Also, if using actual customer traffic poses too much risk, you
can run experiments using synthetic traffic on production infrastructure against the control
and experimental deployments. When using production is not possible, run experiments in
pre-production environments that are as close to production as possible.
You must establish and monitor guardrails to ensure the experiment does not impact
production traffic or other systems beyond acceptable limits. Establish stop conditions
to stop an experiment if it reaches a threshold on a guardrail metric that you define. This
should include the metrics for steady state for the workload, as well as the metric against the
components into which you’re injecting the fault. A synthetic monitor (also known as a user
canary) is one metric you should usually include as a user proxy. Stop conditions for AWS FIS
In our two previous examples, we include the steady state metrics of less than 0.01% increase
in server-side (5xx) errors and less than one minute of database read and write errors.
The 5xx errors are a good metric because they are a consequence of the failure mode that
a client of the workload will experience directly. The database errors measurement is good
as a direct consequence of the fault, but should also be supplemented with a client impact
measurement such as failed customer requests or errors surfaced to the client. Additionally,
include a synthetic monitor (also known as a user canary) on any APIs or URIs directly
accessed by the client of your workload.
If steady state was not maintained, then investigate how the workload design can be
improved to mitigate the fault, applying the best practices of the AWS Well-Architected
Reliability pillar. Additional guidance and resources can be found in the AWS Builder’s Library,
which hosts articles about how to improve your health checks or employ retries with backoff
in your application code, among others.
After these changes have been implemented, run the experiment again (shown by the dotted
line in the chaos engineering flywheel) to determine their effectiveness. If the verify step
indicates the hypothesis holds true, then the workload will be in steady state, and the cycle
continues.
A chaos experiment is a cycle, and experiments should be run regularly as part of chaos
engineering. After a workload meets the experiment’s hypothesis, the experiment should be
automated to run continually as a regression part of your CI/CD pipeline. To learn how to do
this, see this blog on how to run AWS FIS experiments using AWS CodePipeline. This lab on
recurrent AWS FIS experiments in a CI/CD pipeline allows you to work hands-on.
Fault injection experiments are also a part of game days (see REL12-BP06 Conduct game
days regularly). Game days simulate a failure or event to verify systems, processes, and team
responses. The purpose is to actually perform the actions the team would perform as if an
exceptional event happened.
Results for fault injection experiments must be captured and persisted. Include all necessary
data (such as time, workload, and conditions) to be able to later analyze experiment results and
Related tools:
Use game days to regularly exercise your procedures for responding to events and failures as close
to production as possible (including in production environments) with the people who will be
involved in actual failure scenarios. Game days enforce measures to ensure that production events
do not impact users.
Game days simulate a failure or event to test systems, processes, and team responses. The
purpose is to actually perform the actions the team would perform as if an exceptional event
happened. This will help you understand where improvements can be made and can help develop
organizational experience in dealing with events. These should be conducted regularly so that your
team builds muscle memory on how to respond.
After your design for resiliency is in place and has been tested in non-production environments,
a game day is the way to ensure that everything works as planned in production. A game day,
especially the first one, is an “all hands on deck” activity where engineers and operations are
all informed when it will happen, and what will occur. Runbooks are in place. Simulated events
are run, including possible failure events, in the production systems in the prescribed manner,
and impact is assessed. If all systems operate as designed, detection and self-healing will occur
with little to no impact. However, if negative impact is observed, the test is rolled back and the
workload issues are remedied, manually if necessary (using the runbook). Since game days often
take place in production, all precautions should be taken to ensure that there is no impact on
availability to your customers.
Common anti-patterns:
Best practices
• REL13-BP01 Define recovery objectives for downtime and data loss
The workload has a recovery time objective (RTO) and recovery point objective (RPO).
Recovery Time Objective (RTO) is the maximum acceptable delay between the interruption of
service and restoration of service. This determines what is considered an acceptable time window
when service is unavailable.
Recovery Point Objective (RPO) is the maximum acceptable amount of time since the last data
recovery point. This determines what is considered an acceptable loss of data between the last
recovery point and the interruption of service.
RTO and RPO values are important considerations when selecting an appropriate Disaster Recovery
(DR) strategy for your workload. These objectives are determined by the business, and then used by
technical teams to select and implement a DR strategy.
Desired Outcome:
Every workload has an assigned RTO and RPO, defined based on business impact. The workload
is assigned to a predefined tier, defining service availability and acceptable loss of data, with
an associated RTO and RPO. If such tiering is not possible then this can be assigned bespoke
per workload, with the intent to create tiers later. RTO and RPO are used as one of the primary
considerations for selection of a disaster recovery strategy implementation for the workload.
Additional considerations in picking a DR strategy are cost constraints, workload dependencies, and
operational requirements.
For RTO, understand impact based on duration of an outage. Is it linear, or are there nonlinear
implications? (for example. after four hours, you shut down a manufacturing line until the start of
the next shift).
Implementation guidance
For the given workload, you must understand the impact of downtime and lost data on your
business. The impact generally grows larger with greater downtime or data loss, but the shape
of this growth can differ based on the workload type. For example, you may be able to tolerate
downtime for up to an hour with little impact, but after that impact quickly rises. Impact to
business manifests in many forms including monetary cost (such as lost revenue), customer trust
(and impact to reputation), operational issues (such as missing payroll or decreased productivity),
and regulatory risk. Use the following steps to understand these impacts, and set RTO and RPO for
your workload.
Implementation Steps
1. Determine your business stakeholders for this workload, and engage with them to implement
these steps. Recovery objectives for a workload are a business decision. Technical teams then
work with business stakeholders to use these objectives to select a DR strategy.
Note
For steps 2 and 3, you can use the the section called “Implementation worksheet”.
2. Gather the necessary information to make a decision by answering the questions below.
3. Do you have categories or tiers of criticality for workload impact in your organization?
b. If no, then establish these categories. Create five or fewer categories and refine the range of
your recovery time objective for each one. Example categories include: critical, high, medium,
low. To understand how workloads map to categories, consider whether the workload is
mission critical, business important, or non-business driving.
c. Set workload RTO and RPO based on category. Always choose a category more strict (lower
RTO and RPO) than the raw values calculated entering this step. If this results in an unsuitably
large change in value, then consider creating a new category.
4. Based on these answers, assign RTO and RPO values to the workload. This can be done directly,
or by assigning the workload to a predefined tier of service.
5. Document the disaster recovery plan (DRP) for this workload, which is a part of your
organization’s business continuity plan (BCP), in a location accessible to the workload team and
stakeholders
b. Choose recovery objectives that are achievable given the recovery capabilities of downstream
dependencies. Non-critical downstream dependencies (ones you can “work around”) can
be excluded. Or, work with critical downstream dependencies to improve their recovery
capabilities where necessary.
Additional questions
Consider these questions, and how they may apply to this workload:
4. Do you have different RTO and RPO depending on the type of outage (Region vs. AZ, etc.)?
5. Is there a specific time (seasonality, sales events, product launches) when your RTO/RPO may
change? If so, what is the different measurement and time boundary?
8. What other operational impacts may occur if workload is disrupted? For example, impact to
employee productivity if email systems are unavailable, or if Payroll systems are unable to
submit transactions.
9. How does workload RTO and RPO align with Line of Business and Organizational DR Strategy?
10.Are there internal contractual obligations for providing a service? Are there penalties for not
meeting them?
Implementation worksheet
You can use this worksheet for implementation steps 2 and 3. You may adjust this worksheet to
suit your specific needs, such as adding additional questions.
Related videos:
• AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-
R2)
Define a disaster recovery (DR) strategy that meets your workload's recovery objectives. Choose a
strategy such as backup and restore, standby (active/passive), or active/active.
Desired outcome: For each workload, there is a defined and implemented DR strategy that allows
the workload to achieve DR objectives. DR strategies between workloads make use of reusable
patterns (such as the strategies previously described),
Common anti-patterns:
• Using defined recovery strategies allows you to use common tooling and test procedures.
• Using defined recovery strategies improves knowledge sharing between teams and
implementation of DR on the workloads they own.
Level of risk exposed if this best practice is not established: High. Without a planned,
implemented, and tested DR strategy, you are unlikely to achieve recovery objectives in the event
of a disaster.
• Pilot light (RPO in minutes, RTO in tens of minutes): Provision a copy of your core workload
infrastructure in the recovery Region. Replicate your data into the recovery Region and create
backups of it there. Resources required to support data replication and backup, such as
databases and object storage, are always on. Other elements such as application servers or
serverless compute are not deployed, but can be created when needed with the necessary
configuration and application code.
• Warm standby (RPO in seconds, RTO in minutes): Maintain a scaled-down but fully functional
version of your workload always running in the recovery Region. Business-critical systems are
fully duplicated and are always on, but with a scaled down fleet. Data is replicated and live in the
recovery Region. When the time comes for recovery, the system is scaled up quickly to handle
the production load. The more scaled-up the warm standby is, the lower RTO and control plane
reliance will be. When fully scales this is known as hot standby.
• Multi-Region (multi-site) active-active (RPO near zero, RTO potentially zero): Your workload is
deployed to, and actively serving traffic from, multiple AWS Regions. This strategy requires you
to synchronize data across Regions. Possible conflicts caused by writes to the same record in two
different regional replicas must be avoided or handled, which can be complex. Data replication is
useful for data synchronization and will protect you against some types of disaster, but it will not
protect you against data corruption or destruction unless your solution also includes options for
point-in-time recovery.
Note
The difference between pilot light and warm standby can sometimes be difficult to
understand. Both include an environment in your recovery Region with copies of your
primary region assets. The distinction is that pilot light cannot process requests without
additional action taken first, while warm standby can handle traffic (at reduced capacity
levels) immediately. Pilot light will require you to turn on servers, possibly deploy
additional (non-core) infrastructure, and scale up, while warm standby only requires you
to scale up (everything is already deployed and running). Choose between these based on
your RTO and RPO needs.
When cost is a concern, and you wish to achieve a similar RPO and RTO objectives as
defined in the warm standby strategy, you could consider cloud native solutions, like AWS
Elastic Disaster Recovery, that take the pilot light approach and offer improved RPO and
RTO targets.
Backup and restore is the least complex strategy to implement, but will require more time and
effort to restore the workload, leading to higher RTO and RPO. It is a good practice to always
make backups of your data, and copy these to another site (such as another AWS Region).
For more details on this strategy see Disaster Recovery (DR) Architecture on AWS, Part II: Backup
and Restore with Rapid Recovery.
Pilot light
With the pilot light approach, you replicate your data from your primary Region to your recovery
Region. Core resources used for the workload infrastructure are deployed in the recovery Region,
however additional resources and any dependencies are still needed to make this a functional
stack. For example, in Figure 20, no compute instances are deployed.
Using warm standby or pilot light requires scaling up resources in the recovery Region. To verify
capacity is available when needed, consider the use for capacity reservations for EC2 instances.
If using AWS Lambda, then provisioned concurrency can provide runtime environments so that
they are prepared to respond immediately to your function's invocations.
For more details on this strategy, see Disaster Recovery (DR) Architecture on AWS, Part III: Pilot
Light and Warm Standby.
Multi-site active/active
You can run your workload simultaneously in multiple Regions as part of a multi-site active/
active strategy. Multi-site active/active serves traffic from all regions to which it is deployed.
Customers may select this strategy for reasons other than DR. It can be used to increase
availability, or when deploying a workload to a global audience (to put the endpoint closer to
users and/or to deploy stacks localized to the audience in that region). As a DR strategy, if the
workload cannot be supported in one of the AWS Regions to which it is deployed, then that
Region is evacuated, and the remaining Regions are used to maintain availability. Multi-site
active/active is the most operationally complex of the DR strategies, and should only be selected
when business requirements necessitate it.
With all strategies, you must also mitigate against a data disaster. Continuous data replication
protects you against some types of disaster, but it may not protect you against data corruption
or destruction unless your strategy also includes versioning of stored data or options for point-
in-time recovery. You must also back up the replicated data in the recovery site to create point-
in-time backups in addition to the replicas.
When using multiple AZs within a single Region, your DR implementation uses multiple
elements of the above strategies. First you must create a high-availability (HA) architecture,
using multiple AZs as shown in Figure 23. This architecture makes use of a multi-site active/
active approach, as the Amazon EC2 instances and the Elastic Load Balancer have resources
deployed in multiple AZs, actively handing requests. The architecture also demonstrates hot
3. Assess the resources of your workload, and what their configuration will be in the recovery
Region prior to failover (during normal operation).
For infrastructure and AWS resources use infrastructure as code such as AWS CloudFormation
or third-party tools like Hashicorp Terraform. To deploy across multiple accounts and Regions
with a single operation you can use AWS CloudFormation StackSets. For Multi-site active/
active and Hot Standby strategies, the deployed infrastructure in your recovery Region has
the same resources as your primary Region. For Pilot Light and Warm Standby strategies, the
deployed infrastructure will require additional actions to become production ready. Using
CloudFormation parameters and conditional logic, you can control whether a deployed stack is
active or standby with a single template. When using Elastic Disaster Recovery, the service will
replicate and orchestrate the restoration of application configurations and compute resources.
All DR strategies require that data sources are backed up within the AWS Region, and then those
backups are copied to the recovery Region. AWS Backup provides a centralized view where
you can configure, schedule, and monitor backups for these resources. For Pilot Light, Warm
Standby, and Multi-site active/active, you should also replicate data from the primary Region
to data resources in the recovery Region, such as Amazon Relational Database Service (Amazon
RDS) DB instances or Amazon DynamoDB tables. These data resources are therefore live and
ready to serve requests in the recovery Region.
To learn more about how AWS services operate across Regions, see this blog series on Creating a
Multi-Region Application with AWS Services.
4. Determine and implement how you will make your recovery Region ready for failover when
needed (during a disaster event).
For multi-site active/active, failover means evacuating a Region, and relying on the remaining
active Regions. In general, those Regions are ready to accept traffic. For Pilot Light and Warm
Standby strategies, your recovery actions will need to deploy the missing resources, such as the
EC2 instances in Figure 20, plus any other missing resources.
For all of the above strategies you may need to promote read-only instances of databases to
become the primary read/write instance.
For backup and restore, restoring data from backup creates resources for that data such as EBS
volumes, RDS DB instances, and DynamoDB tables. You also need to restore the infrastructure
and deploy code. You can use AWS Backup to restore data in the recovery Region. See REL09-
BP01 Identify and back up all data that needs to be backed up, or reproduce the data from
is maintained. Therefore, the former read/write instance in the primary Region will become a
replica and receive updates from the recovery Region.
In cases where this is not automatic, you will need to re-establish the database in the primary
Region as a replica of the database in the recovery Region. In many cases this will involve
deleting the old primary database, and creating new replicas.
After a failover, if you can continue running in your recovery Region, consider making this the
new primary Region. You would still do all the above steps to make the former primary Region
into a recovery Region. Some organizations do a scheduled rotation, swapping their primary and
recovery Regions periodically (for example every three months).
All of the steps required to fail over and fail back should be maintained in a playbook that is
available to all members of the team, and is periodically reviewed.
When using Elastic Disaster Recovery, the service will assist in orchestrating and automating the
failback process. For more details, see Performing a failback.
Resources
• the section called “REL09-BP01 Identify and back up all data that needs to be backed up, or
reproduce the data from sources”
• the section called “REL11-BP04 Rely on the data plane and not the control plane during
recovery”
• the section called “REL13-BP01 Define recovery objectives for downtime and data loss”
Related documents:
Implementation guidance
A pattern to avoid is developing recovery paths that are rarely exercised. For example, you might
have a secondary data store that is used for read-only queries. When you write to a data store and
the primary fails, you might want to fail over to the secondary data store. If you don’t frequently
test this failover, you might find that your assumptions about the capabilities of the secondary
data store are incorrect. The capacity of the secondary, which might have been sufficient when you
last tested, might be no longer be able to tolerate the load under this scenario. Our experience has
shown that the only error recovery that works is the path you test frequently. This is why having
a small number of recovery paths is best. You can establish recovery patterns and regularly test
them. If you have a complex or critical recovery path, you still need to regularly exercise that failure
in production to convince yourself that the recovery path works. In the example we just discussed,
you should fail over to the standby regularly, regardless of need.
Implementation steps
1. Engineer your workloads for recovery. Regularly test your recovery paths. Recovery-oriented
computing identifies the characteristics in systems that enhance recovery: isolation and
redundancy, system-wide ability to roll back changes, ability to monitor and determine health,
ability to provide diagnostics, automated recovery, modular design, and ability to restart.
Exercise the recovery path to verify that you can accomplish the recovery in the specified time
to the specified state. Use your runbooks during this recovery to document problems and find
solutions for them before the next test.
2. For Amazon EC2-based workloads, use AWS Elastic Disaster Recovery to implement and launch
drill instances for your DR strategy. AWS Elastic Disaster Recovery provides the ability to
efficiently run drills, which helps you prepare for a failover event. You can also frequently launch
of your instances using Elastic Disaster Recovery for test and drill purposes without redirecting
the traffic.
Resources
Related documents:
Implementation guidance
• Ensure that your delivery pipelines deliver to both your primary and backup sites. Delivery
pipelines for deploying applications into production must distribute to all the specified disaster
recovery strategy locations, including dev and test environments.
• Permit AWS Config to track potential drift locations. Use AWS Config rules to create systems that
enforce your disaster recovery strategies and generate alerts when they detect drift.
• Remediating Noncompliant AWS Resources by AWS Config Rules
• AWS Systems Manager Automation
• Use AWS CloudFormation to deploy your infrastructure. AWS CloudFormation can detect drift
between what your CloudFormation templates specify and what is actually deployed.
• AWS CloudFormation: Detect Drift on an Entire CloudFormation Stack
Resources
Related documents:
Related videos:
• AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-
R2)
Use AWS or third-party tools to automate system recovery and route traffic to the DR site or
Region.
Related videos:
• AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-
R2)
Performance efficiency
The Performance Efficiency pillar includes the ability to use cloud resources efficiently to meet
performance requirements, and to maintain that efficiency as demand changes and technologies
evolve. You can find prescriptive guidance on implementation in the Performance Efficiency Pillar
whitepaper.
• Architecture selection
• Data management
Architecture selection
Questions
• PERF 1. How do you select appropriate cloud resources and architecture for your workload?
Implementation guidance
AWS continually releases new services and features that can improve performance and reduce
the cost of cloud workloads. Staying up-to-date with these new services and features is crucial for
maintaining performance efficacy in the cloud. Modernizing your workload architecture also helps
you accelerate productivity, drive innovation, and unlock more growth opportunities.
Implementation steps
• Inventory your workload software and architecture for related services. Decide which category of
products to learn more about.
• Explore AWS offerings to identify and learn about the relevant services and configuration
options that can help you improve performance and reduce cost and operational complexity.
• Amazon Web Services Cloud
• AWS Academy
• What’s New with AWS?
• AWS Blog
• AWS Skill Builder
• AWS Events and Webinars
• AWS Training and Certifications
• AWS Youtube Channel
• AWS Workshops
• AWS Communities
• Use Amazon Q to get relevant information and advice about services.
• Use sandbox (non-production) environments to learn and experiment with new services without
incurring extra cost.
• Continually learn about new cloud services and features.
Resources
Related documents:
Benefits of establishing this best practice: Using guidance from a cloud provider or an
appropriate partner can help you to make the right architectural choices for your workload and
give you confidence in your decisions.
Implementation guidance
AWS offers a wide range of guidance, documentation, and resources that can help you build
and manage efficient cloud workloads. AWS documentation provides code samples, tutorials,
and detailed service explanations. In addition to documentation, AWS provides training and
certification programs, solutions architects, and professional services that can help customers
explore different aspects of cloud services and implement efficient cloud architecture on AWS.
Leverage these resources to gain insights into valuable knowledge and best practices, save time,
and achieve better outcomes in the AWS Cloud.
Implementation steps
• Review AWS documentation and guidance and follow the best practices. These resources can
help you effectively choose and configure services and achieve better performance.
• AWS documentation (like user guides and whitepapers)
• AWS Blog
• AWS Training and Certifications
• AWS Youtube Channel
• Join AWS partner events (like AWS Global Summits, AWS re:Invent, user groups, and workshops)
to learn from AWS experts about best practices for using AWS services.
• Learn step-by-step with an AWS Partner Learning Plan
• AWS Events and Webinars
• AWS Workshops
• AWS Communities
• Reach out to AWS for assistance when you need additional guidance or product information.
AWS Solutions Architects and AWS Professional Services provide guidance for solution
implementation. AWS Partners provide AWS expertise to help you unlock agility and innovation
for your business.
• Use AWS Support if you need technical support to use a service effectively. Our Support plans
are designed to give you the right mix of tools and access to expertise so that you can be
Benefits of establishing this best practice: Factoring cost into your decision making allows you to
use more efficient resources and explore other investments.
Implementation guidance
Optimizing workloads for cost can improve resource utilization and avoid waste in a cloud
workload. Factoring cost into architectural decisions usually includes right-sizing workload
components and enabling elasticity, which results in improved cloud workload performance
efficiency.
Implementation steps
• Establish cost objectives like budget limits for your cloud workload.
• Identify the key components (like instances and storage) that drive cost of your workload.
You can use AWS Pricing Calculator and AWS Cost Explorer to identify key cost drivers in your
workload.
• Understand pricing models in the cloud, such as On-Demand, Reserved Instances, Savings Plans,
and Spot Instances.
• Use Well-Architected cost optimization best practices to optimize these key components for cost.
• Continually monitor and analyze cost to identify cost optimization opportunities in your
workload.
• Use AWS Budgets to get alerts for unacceptable costs.
• Use AWS Compute Optimizer or AWS Trusted Advisor to get cost optimization
recommendations.
• Use AWS Cost Anomaly Detection to get automated cost anomaly detection and root cause
analysis.
Resources
Related documents:
Common anti-patterns:
• You assume that all performance gains should be implemented, even if there are tradeoffs for
implementation.
• You only evaluate changes to workloads when a performance issue has reached a critical point.
Benefits of establishing this best practice: When you are evaluating potential performance-
related improvements, you must decide if the tradeoffs for the changes are acceptable with
the workload requirements. In some cases, you may have to implement additional controls to
compensate for the tradeoffs.
Implementation guidance
Identify critical areas in your architecture in terms of performance and customer impact. Determine
how you can make improvements, what trade-offs those improvements bring, and how they
impact the system and the user experience. For example, implementing caching data can help
dramatically improve performance but requires a clear strategy for how and when to update or
invalidate cached data to prevent incorrect system behavior.
Implementation steps
Resources
Related documents:
Deploy your workload using policies or reference architectures. Integrate the services into your
cloud deployment, then use your performance tests to verify that you can continue to meet your
performance requirements.
Implementation steps
Resources
Related documents:
Related videos:
• This is my Architecture
• AWS re:Invent 2022 - Accelerate value for your business with SAP & AWS reference architecture
Related examples:
• AWS Samples
• AWS SDK Examples
• Define the objectives, baseline, testing scenarios, metrics (like CPU utilization, latency, or
throughput), and KPIs for your benchmark.
• Focus on user requirements in terms of user experience and factors such as response time and
accessibility.
• Identify a benchmarking tool that is suitable for your workload. You can use AWS services like
Amazon CloudWatch or a third-party tool that is compatible with your workload.
• Configure and instrument:
• Set up your environment and configure your resources.
• Implement monitoring and logging to capture testing results.
• Benchmark and monitor:
• Perform your benchmark tests and monitor the metrics during the test.
• Analyze and document:
• Document your benchmarking process and findings.
• Analyze the results to identify bottlenecks, trends, and areas of improvement.
• Use test results to make architectural decisions and adjust your workload. This may include
changing services or adopting new features.
• Optimize and repeat:
• Adjust resource configurations and allocations based on your benchmarks.
• Retest your workload after the adjustment to validate your improvements.
• Document your learnings, and repeat the process to identify other areas of improvement.
Resources
Related documents:
Implementation guidance
Use internal experience and knowledge of the cloud or external resources such as published use
cases or whitepapers to choose resources and services in your architecture. You should have a well-
defined process that encourages experimentation and benchmarking with the services that could
be used in your workload.
Backlogs for critical workloads should consist of not just user stories which deliver functionality
relevant to business and users, but also technical stories which form an architecture runway for
the workload. This runway is informed by new advancements in technology and new services and
adopts them based on data and proper justification. This verifies that the architecture remains
future-proof and does not stagnate.
Implementation steps
• Create an architecture runway or a technology backlog which is prioritized along with the
functional backlog.
• Evaluate and assess different cloud services (for more detail, see PERF01-BP01 Learn about and
understand available cloud services and features).
• Explore different architectural patterns, like microservices or serverless, that meet your
performance requirements (for more detail, see PERF01-BP02 Use guidance from your cloud
provider or an appropriate partner to learn about architecture patterns and best practices).
• Consult other teams, architecture diagrams, and resources, such as AWS Solution Architects, AWS
Architecture Center, and AWS Partner Network, to help you choose the right architecture for
your workload.
• Define performance metrics like throughput and response time that can help you evaluate the
performance of your workload.
• Experiment and use defined metrics to validate the performance of the selected architecture.
• Continually monitor and make adjustments as needed to maintain the optimal performance of
your architecture.
components and allow different features to improve performance. Selecting the wrong compute
choice for an architecture can lead to lower performance efficiency.
Best practices
• PERF02-BP01 Select the best compute options for your workload
• PERF02-BP02 Understand the available compute configuration and features
• PERF02-BP03 Collect compute-related metrics
• PERF02-BP04 Configure and right-size compute resources
• PERF02-BP05 Scale your compute resources dynamically
• PERF02-BP06 Use optimized hardware-based compute accelerators
Selecting the most appropriate compute option for your workload allows you to improve
performance, reduce unnecessary infrastructure costs, and lower the operational efforts required
to maintain your workload.
Common anti-patterns:
• You use the same compute option that was used on premises.
• You lack awareness of the cloud compute options, features, and solutions, and how those
solutions might improve your compute performance.
• You over-provision an existing compute option to meet scaling or performance requirements
when an alternative compute option would align to your workload characteristics more precisely.
Benefits of establishing this best practice: By identifying the compute requirements and
evaluating against the options available, you can make your workload more resource efficient.
Implementation guidance
To optimize your cloud workloads for performance efficiency, it is important to select the most
appropriate compute options for your use case and performance requirements. AWS provides a
variety of compute options that cater to different workloads in the cloud. For instance, you can
use Amazon EC2 to launch and manage virtual servers, AWS Lambda to run code without having
to provision or manage servers, Amazon ECS or Amazon EKS to run and manage containers, or
• Evaluate cost (like hourly charge or data transfer) and management overhead (like patching and
scaling) associated to each compute option.
• Perform experiments and benchmarking in a non-production environment to identify which
compute option can best address your workload requirements.
• Once you have experimented and identified your new compute solution, plan your migration and
validate your performance metrics.
• Use AWS monitoring tools like Amazon CloudWatch and optimization services like AWS Compute
Optimizer to continually optimize your compute resources based on real-world usage patterns.
Resources
Related documents:
• You do not evaluate compute options or available instance families against workload
characteristics.
• You over-provision compute resources to meet peak-demand requirements.
Benefits of establishing this best practice: Be familiar with AWS compute features and
configurations so that you can use a compute solution optimized to meet your workload
characteristics and needs.
Implementation guidance
Each compute solution has unique configurations and features available to support different
workload characteristics and requirements. Learn how these options complement your workload,
and determine which configuration options are best for your application. Examples of these
options include instance family, sizes, features (GPU, I/O), bursting, time-outs, function sizes,
container instances, and concurrency. If your workload has been using the same compute option
for more than four weeks and you anticipate that the characteristics will remain the same in the
future, you can use AWS Compute Optimizer to find out if your current compute option is suitable
for the workloads from CPU and memory perspective.
Implementation steps
• Review AWS documentation and best practices to learn about recommended configuration
options that can help improve compute performance. Here are some key configuration options
to consider:
Resources
Related documents:
Related videos:
• AWS re:Invent 2023 – AWS Graviton: The best price performance for your AWS workloads
• AWS re:Invent 2023 – New Amazon EC2 generative AI capabilities in AWS Management Console
• AWS re:Invent 2023 – What's new with Amazon EC2
• AWS re:Invent 2023 – Smart savings: Amazon EC2 cost-optimization strategies
• AWS re:Invent 2021 – Powering next-gen Amazon EC2: Deep dive on the Nitro System
• AWS re:Invent 2019 – Amazon EC2 foundations
• AWS re:Invent 2022 – Optimizing Amazon EKS for performance and cost on AWS
Related examples:
into utilization levels or performance bottlenecks. Use these metrics as part of a data-driven
approach to actively tune and optimize your workload's resources. In an ideal case, you should
collect all metrics related to your compute resources in a single platform with retention policies
implemented to support cost and operational goals.
Implementation steps
• Identify which performance-related metrics are relevant to your workload. You should collect
metrics around resource utilization and the way your cloud workload is operating (like response
time and throughput).
• Choose and set up the right logging and monitoring solution for your workload.
• Define the required filter and aggregation for the metrics based on your workload requirements.
• Quantify custom application metrics with Amazon CloudWatch Logs and metric filters
• If required, create alarms and notifications for your metrics to help you proactively respond to
performance-related issues.
• Create alarms for custom metrics using Amazon CloudWatch anomaly detection
• Create metrics and alarms for specific web pages with Amazon CloudWatch RUM
• OpenTelemetry Collector
Compute and hardware 693
AWS Well-Architected Framework Framework
Configure and right-size compute resources to match your workload’s performance requirements
and avoid under- or over-utilized resources.
Common anti-patterns:
Benefits of establishing this best practice: Right-sizing compute resources ensures optimal
operation in the cloud by avoiding over-provisioning and under-provisioning resources. Properly
sizing compute resources typically results in better performance and enhanced customer
experience, while also lowering cost.
Implementation guidance
Right-sizing allows organizations to operate their cloud infrastructure in an efficient and cost-
effective manner while addressing their business needs. Over-provisioning cloud resources can lead
to extra costs, while under-provisioning can result in poor performance and a negative customer
experience. AWS provides tools such as AWS Compute Optimizer and AWS Trusted Advisor that use
historical data to provide recommendations to right-size your compute resources.
Implementation steps
Related examples:
Use the elasticity of the cloud to scale your compute resources up or down dynamically to match
your needs and avoid over- or under-provisioning capacity for your workload.
Common anti-patterns:
Benefits of establishing this best practice: Configuring and testing the elasticity of compute
resources can help you save money, maintain performance benchmarks, and improve reliability as
traffic changes.
Implementation guidance
AWS provides the flexibility to scale your resources up or down dynamically through a variety of
scaling mechanisms in order to meet changes in demand. Combined with compute-related metrics,
a dynamic scaling allows workloads to automatically respond to changes and use the optimal set of
compute resources to achieve its goal.
You can use a number of different approaches to match supply of resources with demand.
• Target-tracking approach: Monitor your scaling metric and automatically increase or decrease
capacity as you need it.
• Predictive scaling: Scale in anticipation of daily and weekly trends.
• Schedule-based approach: Set your own scaling schedule according to predictable load changes.
• Verify that workload deployments can handle both scaling events (up and down). As an example,
you can use Activity history to verify a scaling activity for an Auto Scaling group.
• Evaluate your workload for predictable patterns and proactively scale as you anticipate predicted
and planned changes in demand. With predictive scaling, you can eliminate the need to
overprovision capacity. For more detail, see Predictive Scaling with Amazon EC2 Auto Scaling.
Resources
Related documents:
Related videos:
• AWS re:Invent 2023 – AWS Graviton: The best price performance for your AWS workloads
• AWS re:Invent 2023 – New Amazon EC2 generative AI capabilities in AWS Management Console
• AWS re:Invent 2023 – What’s new with Amazon EC2
• AWS re:Invent 2023 – Smart savings: Amazon EC2 cost-optimization strategies
• AWS re:Invent 2021 – Powering next-gen Amazon EC2: Deep dive on the Nitro System
• AWS re:Invent 2019 – Amazon EC2 foundations
Related examples:
• Optimize the code, network operation, and settings of hardware accelerators to make sure that
underlying hardware is fully utilized.
• Optimize GPU settings
• Optimizing I/O for GPU performance tuning of deep learning training in Amazon SageMaker
Resources
Related documents:
• Accelerated Computing
• How do I choose the appropriate Amazon EC2 instance type for my workload?
• Choose the best AI accelerator and model compilation for computer vision inference with
Amazon SageMaker
Related videos:
• AWS re:Invent 2021 - How to select Amazon Elastic Compute Cloud GPU instances for deep
learning
• AWS re:Invent 2022 - [NEW LAUNCH!] Introducing AWS Inferentia2-based Amazon EC2 Inf2
instances
• AWS re:Invent 2022 - Accelerate deep learning and innovate faster with AWS Trainium
• AWS re:Invent 2022 - Deep learning on AWS with NVIDIA: From training to deployment
Common anti-patterns:
• You stick to one data store because there is internal experience and knowledge of one particular
type of database solution.
• You assume that all workloads have similar data storage and access requirements.
• You have not implemented a data catalog to inventory your data assets.
Benefits of establishing this best practice: Understanding data characteristics and requirements
allows you to determine the most efficient and performant storage technology appropriate for
your workload needs.
Implementation guidance
When selecting and implementing data storage, make sure that the querying, scaling, and storage
characteristics support the workload data requirements. AWS provides numerous data storage
and database technologies including block storage, object storage, streaming storage, file system,
relational, key-value, document, in-memory, graph, time series, and ledger databases. Each data
management solution has options and configurations available to you to support your use-cases
and data models. By understanding data characteristics and requirements, you can break away
from monolithic storage technology and restrictive, one-size-fits-all approaches to focus on
managing data appropriately.
Implementation steps
• Conduct an inventory of the various data types that exist in your workload.
• Understand and document data characteristics and requirements, including:
• Data type (unstructured, semi-structured, relational)
• Data volume and growth
• Data durability: persistent, ephemeral, transient
• ACID (atomicity, consistency, isolation, durability) requirements
• Data access patterns (read-heavy or write-heavy)
• Latency
• Throughput
• IOPS (input/output operations per second)
How will the storage requirements change • Serverless databases such as DynamoDB
over time? How does this impact scalability? and Amazon Quantum Ledger Database
(Amazon QLDB) will scale dynamically.
• Relational databases have upper bounds
on provisioned storage, and often must be
horizontally partitioned using mechanism
s such as sharding once they reach these
limits.
What is the proportion of read queries in • Read-heavy workloads can benefit from a
relation to write queries? Would caching be caching layer, like ElastiCache or DAX if the
likely to improve performance? database is DynamoDB.
• Reads can also be offloaded to read
replicas with relational databases such as
Amazon RDS.
What is the operational expectation for the • Leveraging Amazon RDS instead of
database? Is moving to managed services a Amazon EC2, and DynamoDB or Amazon
primary concern? DocumentDB instead of self-hosting a
NoSQL database can reduce operational
overhead.
Resources
Related documents:
Understand and evaluate the various features and configuration options available for your data
stores to optimize storage space and performance for your workload.
Common anti-patterns:
• You only use one storage type, such as Amazon EBS, for all workloads.
• You use provisioned IOPS for all workloads without real-world testing against all storage tiers.
• You are not aware of the configuration options of your chosen data management solution.
• You rely solely on increasing instance size without looking at other available configuration
options.
• You are not testing the scaling characteristics of your data store.
Benefits of establishing this best practice: By exploring and experimenting with the data store
configurations, you may be able to reduce the cost of infrastructure, improve performance, and
lower the effort required to maintain your workloads.
Implementation guidance
A workload could have one or more data stores used based on data storage and access
requirements. To optimize your performance efficiency and cost, you must evaluate data access
patterns to determine the appropriate data store configurations. While you explore data store
options, take into consideration various aspects such as the storage options, memory, compute,
read replica, consistency requirements, connection pooling, and caching options. Experiment with
these various configuration options to improve performance efficiency metrics.
Implementation steps
• Understand the current configurations (like instance type, storage size, or database engine
version) of your data store.
• Review AWS documentation and best practices to learn about recommended configuration
options that can help improve the performance of your data store. Key data store options to
consider are the following:
Scaling writes (like partition key sharding or • For relational databases, you can increase
introducing a queue) the size of the instance to accommoda
te an increased workload or increase the
provisioned IOPs to allow for an increased
throughput to the underlying storage.
• You can also introduce a queue in front of
your database rather than writing directly
to the database. This pattern allows you to
decouple the ingestion from the database
and control the flow-rate so the database
does not get overwhelmed.
• Batching your write requests rather than
creating many short-lived transactions can
help improve throughput in high-write
volume relational databases.
• Serverless databases like DynamoDB can
scale the write throughput automatically or
by adjusting the provisioned write capacity
units (WCU) depending on the capacity
mode.
• You can still run into issues with hot
partitions when you reach the throughpu
t limits for a given partition key. This can
be mitigated by choosing a more evenly
distributed partition key or by write-sha
rding the partition key.
Related videos:
• AWS re:Invent 2023: Improve Amazon Elastic Block Store efficiency and be more cost-efficient
• AWS re:Invent 2023: Optimize storage price and performance with Amazon Simple Storage
Service
• AWS re:Invent 2023: Building and optimizing a data lake on Amazon Simple Storage Service
• AWS re:Invent 2023: What's new with AWS file storage
• AWS re:Invent 2023: Dive deep into Amazon DynamoDB
Related examples:
disk storage, disk I/O, cache hit ratio, and network inbound and outbound metrics, while the data
store metrics might include transactions per second, top queries, average queries rates, response
times, index usage, table locks, query timeouts, and number of connections open. This data is
crucial to understand how the workload is performing and how the data management solution is
used. Use these metrics as part of a data-driven approach to tune and optimize your workload's
resources.
Use tools, libraries, and systems that record performance measurements related to database
performance.
Implementation steps
• Identify the key performance metrics for your data store to track.
• Use an approved logging and monitoring solution to collect these metrics. Amazon CloudWatch
can collect metrics across the resources in your architecture. You can also collect and publish
custom metrics to surface business or derived metrics. Use CloudWatch or third-party solutions
to set alarms that indicate when thresholds are breached.
• Check if data store monitoring can benefit from a machine learning solution that detects
performance anomalies.
• Amazon DevOps Guru for Amazon RDS provides visibility into performance issues and makes
recommendations for corrective actions.
Data management 719
AWS Well-Architected Framework Framework
Implement strategies to optimize data and improve data query to enable more scalability and
efficient performance for your workload.
Common anti-patterns:
Benefits of establishing this best practice: Optimizing data and query performance results in
more efficiency, lower cost, and improved user experience.
Implementation guidance
Data optimization and query tuning are critical aspects of performance efficiency in a data store,
as they impact the performance and responsiveness of the entire cloud workload. Unoptimized
queries can result in greater resource usage and bottlenecks, which reduce the overall efficiency of
a data store.
Data optimization includes several techniques to ensure efficient data storage and access. This also
help to improve the query performance in a data store. Key strategies include data partitioning,
data compression, and data denormalization, which help data to be optimized for both storage and
access.
Implementation steps
• Understand and analyze the critical data queries which are performed in your data store.
• Identify the slow-running queries in your data store and use query plans to understand their
current state.
• Analyzing the query plan in Amazon Redshift
Related examples:
Implement access patterns that can benefit from caching data for fast retrieval of frequently
accessed data.
Common anti-patterns:
Benefits of establishing this best practice: Storing data in a cache can improve read latency, read
throughput, user experience, and overall efficiency, as well as reduce costs.
Implementation guidance
A cache is a software or hardware component aimed at storing data so that future requests for the
same data can be served faster or more efficiently. The data stored in a cache can be reconstructed
if lost by repeating an earlier calculation or fetching it from another data store.
Data caching can be one of the most effective strategies to improve your overall application
performance and reduce burden on your underlying primary data sources. Data can be cached
at multiple levels in the application, such as within the application making remote calls, known
as client-side caching, or by using a fast secondary service for storing the data, known as remote
caching.
Client-side caching
• Enable features such as automatic connection retries, exponential backoff, client-side timeouts,
and connection pooling in the client, if available, as they can improve performance and
reliability.
• Best practices: Redis clients and Amazon ElastiCache (Redis OSS)
• Monitor cache hit rate with a goal of 80% or higher. Lower values may indicate insufficient cache
size or an access pattern that does not benefit from caching.
• Which metrics should I monitor?
• Best practices for monitoring Redis workloads on Amazon ElastiCache
• Monitoring best practices with Amazon ElastiCache (Redis OSS) using Amazon CloudWatch
• Implement data replication to offload reads to multiple instances and improve data read
performance and availability.
Resources
Related documents:
Related videos:
Related examples:
• You use on-premises concepts and strategies for networking solutions in the cloud.
Benefits of establishing this best practice: Understanding how networking impacts workload
performance helps you identify potential bottlenecks, improve user experience, increase reliability,
and lower operational maintenance as the workload changes.
Implementation guidance
The network is responsible for the connectivity between application components, cloud services,
edge networks, and on-premises data, and therefore it can heavily impact workload performance.
In addition to workload performance, user experience can be also impacted by network latency,
bandwidth, protocols, location, network congestion, jitter, throughput, and routing rules.
Have a documented list of networking requirements from the workload including latency, packet
size, routing rules, protocols, and supporting traffic patterns. Review the available networking
solutions and identify which service meets your workload networking characteristics. Cloud-based
networks can be quickly rebuilt, so evolving your network architecture over time is necessary to
improve performance efficiency.
Implementation steps:
• Define and document networking performance requirements, including metrics such as network
latency, bandwidth, protocols, locations, traffic patterns (spikes and frequency), throughput,
encryption, inspection, and routing rules.
• Learn about key AWS networking services like VPCs, AWS Direct Connect, Elastic Load Balancing
(ELB), and Amazon Route 53.
• Capture the following key networking characteristics:
Related videos:
Related examples:
Evaluate networking features in the cloud that may increase performance. Measure the impact of
these features through testing, metrics, and analysis. For example, take advantage of network-level
features that are available to reduce latency, network distance, or jitter.
• Use an existing configuration management database (CMDB) tool or a service such as AWS
Config to create an inventory of your workload and how it’s configured.
• If this is an existing workload, identify and document the benchmark for your performance
metrics, focusing on the bottlenecks and areas to improve. Performance-related networking
metrics will differ per workload based on business requirements and workload characteristics.
As a start, these metrics might be important to review for your workload: bandwidth, latency,
packet loss, jitter, and retransmits.
• If this is a new workload, perform load tests to identify performance bottlenecks.
• For the performance bottlenecks you identify, review the configuration options for your
solutions to identify performance improvement opportunities. Check out the following key
networking options and features:
Related videos:
• AWS re:Invent 2023 – Ready for what's next? Designing networks for growth and flexibility
• AWS re:Invent 2023 – Advanced VPC designs and new capabilities
• AWS re:Invent 2023 – A developer's guide to cloud networking
• AWS re:Invent 2022 – Dive deep on AWS networking infrastructure
• AWS re:Invent 2019 – Connectivity to AWS and hybrid AWS network architectures
• AWS re:Invent 2018 – Optimizing Network Performance for Amazon EC2 Instances
• AWS Global Accelerator
Related examples:
When hybrid connectivity is required to connect on-premises and cloud resources, provision
adequate bandwidth to meet your performance requirements. Estimate the bandwidth and latency
requirements for your hybrid workload. These numbers will drive your sizing requirements.
Common anti-patterns:
• AWS Direct Connect provides dedicated connectivity to the AWS environment, from 50 Mbps
up to 100 Gbps, using either dedicated connections or hosted connections. This gives you
managed and controlled latency and provisioned bandwidth so your workload can connect
efficiently to other environments. Using AWS Direct Connect partners, you can have end-to-
end connectivity from multiple environments, providing an extended network with consistent
performance. AWS offers scaling direct connect connection bandwidth using either native 100
Gbps, link aggregation group (LAG), or BGP equal-cost multipath (ECMP).
• The AWS Site-to-Site VPN provides a managed VPN service supporting internet protocol
security (IPsec). When a VPN connection is created, each VPN connection includes two tunnels
for high availability.
• If you decide to use AWS Direct Connect, select the appropriate bandwidth for your
connectivity.
• If you are using an AWS Site-to-Site VPN across multiple locations to connect to an AWS
Region, use an accelerated Site-to-Site VPN connection for the opportunity to improve
network performance.
• If your network design consists of IPSec VPN connection over AWS Direct Connect, consider
using Private IP VPN to improve security and achieve segmentation. AWS Site-to-Site Private
IP VPN is deployed on top of transit virtual interface (VIF).
• AWS Direct Connect SiteLink allows creating low-latency and redundant connections between
your data centers worldwide by sending data over the fastest path between AWS Direct
Connect locations, bypassing AWS Regions.
• Validate your connectivity setup before deploying to production. Perform security and
performance testing to assure it meets your bandwidth, reliability, latency, and compliance
requirements.
• Regularly monitor your connectivity performance and usage and optimize if required.
Distribute traffic across multiple resources or services to allow your workload to take advantage
of the elasticity that the cloud provides. You can also use load balancing for offloading encryption
termination to improve performance, reliability and manage and route traffic effectively.
Common anti-patterns:
• You don’t consider your workload requirements when choosing the load balancer type.
• You don’t leverage the load balancer features for performance optimization.
• The workload is exposed directly to the internet without a load balancer.
• You route all internet traffic through existing load balancers.
• You use generic TCP load balancing and making each compute node handle SSL encryption.
Benefits of establishing this best practice: A load balancer handles the varying load of your
application traffic in a single Availability Zone or across multiple Availability Zones and enables
high availability, automatic scaling, and better utilization for your workload.
Implementation guidance
Load balancers act as the entry point for your workload, from which point they distribute the
traffic to your backend targets, such as compute instances or containers, to improve utilization.
Choosing the right load balancer type is the first step to optimize your architecture. Start by listing
your workload characteristics, such as protocol (like TCP, HTTP, TLS, or WebSockets), target type
(like instances, containers, or serverless), application requirements (like long running connections,
user authentication, or stickiness), and placement (like Region, Local Zone, Outpost, or zonal
isolation).
AWS provides multiple models for your applications to use load balancing. Application Load
Balancer is best suited for load balancing of HTTP and HTTPS traffic and provides advanced
request routing targeted at the delivery of modern application architectures, including
microservices and containers.
Implementation steps
• Define your load balancing requirements including traffic volume, availability and application
scalability.
• Use Network Load Balancer for non-HTTP workloads that run on TCP or UDP.
• Use a combination of both (ALB as a target of NLB) if you want to leverage features of both
products. For example, you can do this if you want to use the static IPs of NLB together with
HTTP header based routing from ALB, or if you want to expose your HTTP workload to an AWS
PrivateLink.
• Configure HTTPS/TLS listeners with both Application Load Balancer and Network Load
Balancer integrated with AWS Certificate Manager.
• Note that some workloads may require end-to-end encryption for compliance reasons. In this
case, it is a requirement to allow encryption at the targets.
• Least outstanding requests: Use to achieve a better load distribution to your backend targets
for cases when the requests for your application vary in complexity or your targets vary in
processing capability.
• Round robin: Use when the requests and targets are similar, or if you need to distribute
requests equally among targets.
• Use cross-zone turned off (zonal isolation) for latency improvements and zonal failure
domains. It is turned off by default in NLB and in ALB you can turn it off per target group.
Networking and content delivery 741
AWS Well-Architected Framework Framework
• AWS re:Invent 2018: Elastic Load Balancing: Deep Dive and Best Practices
• AWS re:Invent 2021 - How to choose the right load balancer for your AWS workloads
• AWS re:Invent 2019: Get the most from Elastic Load Balancing for different workloads
Related examples:
Make decisions about protocols for communication between systems and networks based on the
impact to the workload’s performance.
There is a relationship between latency and bandwidth to achieve throughput. If your file transfer
is using Transmission Control Protocol (TCP), higher latencies will most likely reduce overall
throughput. There are approaches to fix this with TCP tuning and optimized transfer protocols, but
one solution is to use User Datagram Protocol (UDP).
Common anti-patterns:
Benefits of establishing this best practice: Verifying that an appropriate protocol is used for
communication between users and workload components helps improve overall user experience
for your applications. For instance, connection-less UDP allows for high speed, but it doesn't offer
retransmission or high reliability. TCP is a full featured protocol, but it requires greater overhead
for processing the packets.
Implementation guidance
If you have the ability to choose different protocols for your application and you have expertise in
this area, optimize your application and end-user experience by using a different protocol. Note
that this approach comes with significant difficulty and should only be attempted if you have
optimized your application in other ways first.
lower latency between your client devices and your workload on AWS. With AWS Transfer
Family, you can use TCP-based protocols such as Secure Shell File Transfer Protocol (SFTP) and
File Transfer Protocol over SSL (FTPS) to securely scale and manage your file transfers to AWS
storage services.
• Use network latency to determine if TCP is appropriate for communication between workload
components. If the network latency between your client application and server is high, then
the TCP three-way handshake can take some time, thereby impacting on the responsiveness
of your application. Metrics such as time to first byte (TTFB) and round-trip time (RTT) can be
used to measure network latency. If your workload serves dynamic content to users, consider
using Amazon CloudFront, which establishes a persistent connection to each origin for dynamic
content to remove the connection setup time that would otherwise slow down each client
request.
• Using TLS with TCP or UDP can result in increased latency and reduced throughput for your
workload due to the impact of encryption and decryption. For such workloads, consider SSL/
TLS offloading on Elastic Load Balancing to improve workload performance by allowing the
load balancer to handle SSL/TLS encryption and decryption process instead of having backend
instances do it. This can help reduce the CPU utilization on the backend instances, which can
improve performance and increase capacity.
• Use the Network Load Balancer (NLB) to deploy services that rely on the UDP protocol, such
as authentication and authorization, logging, DNS, IoT, and streaming media, to improve the
performance and reliability of your workload. The NLB distributes incoming UDP traffic across
multiple targets, allowing you to scale your workload horizontally, increase capacity, and reduce
the overhead of a single target.
• For your High Performance Computing (HPC) workloads, consider using the Elastic Network
Adapter (ENA) Express functionality that uses the SRD protocol to improve network performance
by providing a higher single flow bandwidth (25 Gbps) and lower tail latency (99.9 percentile) for
network traffic between EC2 instances.
• Use the Application Load Balancer (ALB) to route and load balance your gRPC (Remote Procedure
Calls) traffic between workload components or between gRPC clients and services. gRPC uses the
TCP-based HTTP/2 protocol for transport and it provides performance benefits such as lighter
network footprint, compression, efficient binary serialization, support for numerous languages,
and bi-directional streaming.
Resources
Related documents:
Implementation guidance
Resources, such as Amazon EC2 instances, are placed into Availability Zones within AWS Regions,
AWS Local Zones, AWS Outposts, or AWS Wavelength zones. Selection of this location influences
network latency and throughput from a given user location. Edge services like Amazon CloudFront
and AWS Global Accelerator can also be used to improve network performance by either caching
content at edge locations or providing users with an optimal path to the workload through the
AWS global network.
Amazon EC2 provides placement groups for networking. A placement group is a logical grouping
of instances to decrease latency. Using placement groups with supported instance types and an
Elastic Network Adapter (ENA) enables workloads to participate in a low-latency, reduced jitter 25
Gbps network. Placement groups are recommended for workloads that benefit from low network
latency, high network throughput, or both.
Latency-sensitive services are delivered at edge locations using AWS global network, such as
Amazon CloudFront. These edge locations commonly provide services like content delivery
network (CDN) and domain name system (DNS). By having these services at the edge, workloads
can respond with low latency to requests for content or DNS resolution. These services also provide
geographic services, such as geotargeting of content (providing different content based on the
end users’ location) or latency-based routing to direct end users to the nearest Region (minimum
latency).
Use edge services to reduce latency and to enable content caching. Configure cache control
correctly for both DNS and HTTP/HTTPS to gain the most benefit from these approaches.
Implementation steps
• Capture information about the IP traffic going to and from network interfaces.
• Logging IP traffic using VPC Flow Logs
• How the client IP address is preserved in AWS Global Accelerator
• Analyze network access patterns in your workload to identify how users use your application.
• Use monitoring tools, such as Amazon CloudWatch and AWS CloudTrail, to gather data on
network activities.
• Analyze the data to identify the network access pattern.
• Select Regions for your workload deployment based on the following key elements:
Amazon CloudFront Functions Use for simple use cases like HTTP(s) requests
or response manipulations that can be
initiated by short-lived functions.
• Some applications require fixed entry points or higher performance by reducing first byte latency
and jitter, and increasing throughput. These applications can benefit from networking services
that provide static anycast IP addresses and TCP termination at edge locations. AWS Global
Accelerator can improve performance for your applications by up to 60% and provide quick
failover for multi-region architectures. AWS Global Accelerator provides you with static anycast
IP addresses that serve as a fixed entry point for your applications hosted in one or more AWS
Regions. These IP addresses permit traffic to ingress onto the AWS global network as close to
your users as possible. AWS Global Accelerator reduces the initial connection setup time by
establishing a TCP connection between the client and the AWS edge location closest to the
client. Review the use of AWS Global Accelerator to improve the performance of your TCP/UDP
workloads and provide quick failover for multi-Region architectures.
Resources
• SUS01-BP01 Choose Region based on both business requirements and sustainability goals
Related documents:
Use collected and analyzed data to make informed decisions about optimizing your network
configuration.
Common anti-patterns:
Benefits of establishing this best practice: Collecting necessary metrics of your AWS network
and implementing network monitoring tools allows you to understand network performance and
optimize network configurations.
Implementation guidance
Monitoring traffic to and from VPCs, subnets, or network interfaces is crucial to understand how to
utilize AWS network resources and optimize network configurations. By using the following AWS
networking tools, you can further inspect information about the traffic usage, network access and
logs.
Implementation steps
• Identify the key performance metrics such as latency or packet loss to collect. AWS provides
several tools that can help you to collect these metrics. By using the following tools, you can
further inspect information about the traffic usage, network access, and logs:
Amazon VPC IP Address Manager. Use IPAM to plan, track, and monitor IP
addresses for your AWS and on-premises
workloads. This is a best practice to optimize
IP address usage and allocation.
• Optimize performance and reduce costs for network analytics with VPC Flow Logs in Apache
Parquet format
• Monitoring your global and core networks with Amazon CloudWatch metrics
• Continuously monitor network traffic and resources
Related videos:
Related examples:
When architecting workloads, there are principles and practices that you can adopt to help you
better run efficient high-performing cloud workloads. To adopt a culture that fosters performance
efficiency of cloud workloads, consider these key principles and practices:
might use page load time as an indication of overall performance. This metric would be one of
multiple data points that measures user experience. In addition to identifying the page load time
thresholds, you should document the expected outcome or business risk if ideal performance is not
met. A long page load time affects your end users directly, decreases their user experience rating,
and can lead to a loss of customers. When you define your KPI thresholds, combine both industry
benchmarks and your end user expectations. For example, if the current industry benchmark is a
webpage loading within a two-second time period, but your end users expect a webpage to load
within a one-second time period, then you should take both of these data points into consideration
when establishing the KPI.
Your team must evaluate your workload KPIs using real-time granular data and historical data for
reference and create dashboards that perform metric math on your KPI data to derive operational
and utilization insights. KPIs should be documented and include thresholds that support business
goals and strategies, and should be mapped to metrics being monitored. KPIs should be revisited
when business goals, strategies, or end user requirements change.
Implementation steps
• Identify stakeholders: Identify and document key business stakeholders, including development
and operation teams.
• Define objectives: Work with these stakeholders to define and document objectives of your
workload. Consider the critical performance aspects of your workloads, such as throughput,
response time, and cost, as well as business goals, such as user satisfaction.
• Review industry best practices: Review industry best practices to identify relevant KPIs aligned
with your workload objectives.
• Identify metrics: Identify metrics that are aligned with your workload objectives and can help
you measure performance and business goals. Establish KPIs based on these metrics. Example
metrics are measurements like average response time or number of concurrent users.
• Define and document KPIs: Use industry best practices and your workload objectives to set
targets for your workload KPI. Use this information to set KPI thresholds for severity or alarm
level. Identify and document the risk and impact of a KPI is not met.
• Implement monitoring: Use monitoring tools such as Amazon CloudWatch or AWS Config to
collect metrics and measure KPIs.
• Visually communicate KPIs: Use dashboard tools like Amazon QuickSight to visualize and
communicate KPIs with stakeholders.
PERF05-BP02 Use monitoring solutions to understand the areas where performance is most
critical
Understand and identify areas where increasing the performance of your workload will have a
positive impact on efficiency or customer experience. For example, a website that has a large
amount of customer interaction can benefit from using edge services to move content delivery
closer to customers.
Common anti-patterns:
• You assume that standard compute metrics such as CPU utilization or memory pressure are
enough to catch performance issues.
• You only use the default metrics recorded by your selected monitoring software.
Benefits of establishing this best practice: Understanding critical areas of performance helps
workload owners monitor KPIs and prioritize high-impact improvements.
Implementation guidance
Set up end-to-end tracing to identify traffic patterns, latency, and critical performance areas.
Monitor your data access patterns for slow queries or poorly fragmented and partitioned data.
Identify the constrained areas of the workload using load testing or monitoring.
Increase performance efficiency by understanding your architecture, traffic patterns, and data
access patterns, and identify your latency and processing times. Identify the potential bottlenecks
that might affect the customer experience as the workload grows. After investigating these areas,
look at which solution you could deploy to remove those performance concerns.
Implementation steps
• Set up end-to-end monitoring to capture all workload components and metrics. Here are
examples of monitoring solutions on AWS.
Resources
Related documents:
Related videos:
Related examples:
Define a process to evaluate new services, design patterns, resource types, and configurations as
they become available. For example, run existing performance tests on new instance offerings to
determine their potential to improve your workload.
• Revisit and refine: Regularly review your performance improvement process to identify areas for
enhancement.
Resources
Related documents:
• AWS Blog
• What's New with AWS
• AWS Skill Builder
Related videos:
Related examples:
• AWS Github
Load test your workload to verify it can handle production load and identify any performance
bottleneck.
Common anti-patterns:
• You load test individual parts of your workload but not your entire workload.
• You load test on infrastructure that is not the same as your production environment.
• You only conduct load testing to your expected load and not beyond, to help foresee where you
may have future problems.
• You perform load testing without consulting the Amazon EC2 Testing Policy and submitting
a Simulated Event Submissions Form. This results in your test failing to run, as it looks like a
denial-of-service event.
• Continually iterate: Load testing should be performed at regular cadence, especially after a
system change of update.
Resources
Related documents:
Related videos:
• AWS Summit ANZ 2023: Accelerate with confidence through AWS Distributed Load Testing
• AWS re:Invent 2022 - Scaling on AWS for your first 10 million users
• AWS re:Invent 2021 - Optimize applications through end user insights with Amazon CloudWatch
RUM
Related examples:
Use key performance indicators (KPIs), combined with monitoring and alerting systems, to
proactively address performance-related issues.
Common anti-patterns:
• You only allow operations staff the ability to make operational changes to the workload.
• You let all alarms filter to the operations team with no proactive remediation.
• Review and refine: Regularly assess the effectiveness of the automated remediation workflow.
Adjust initiation events and remediation logic if necessary.
Resources
Related documents:
• CloudWatch Documentation
• X-Ray Documentation
• Build a Cloud Automation Practice for Operational Excellence: Best Practices from AWS Managed
Services
• Automate your Amazon Redshift performance tuning with automatic table optimization
Related videos:
• AWS re:Invent 2023 - Strategies for automated scaling, remediation, and smart self-healing
• AWS re:Invent 2022 - Automating patch management and compliance using AWS
• AWS re:Invent 2022 - How Amazon uses better metrics for improved website performance
• AWS re:Invent 2023 - Take a load off: Diagnose & resolve performance issues with Amazon RDS
• AWS re:Invent 2021 -{New Launch} Automatically detect and resolve issues with Amazon
DevOps Guru
Related examples:
instances are running the software and configurations required by your software policy and
which instances need to be updated.
• Assess the new update: Understand how to update the components of your workload. Take
advantage of agility in the cloud to quickly test how new features can improve your workload to
gain performance efficiency.
• Use automation: Use automation for the update process to reduce the level of effort to deploy
new features and limit errors caused by manual processes.
• You can use CI/CD to automatically update AMIs, container images, and other artifacts related
to your cloud application.
• You can use tools such as AWS Systems Manager Patch Manager to automate the process of
system updates, and schedule the activity using AWS Systems Manager Maintenance Windows.
• Document the process: Document your process for evaluating updates and new services. Provide
your owners the time and space needed to research, test, experiment, and validate updates and
new services. Refer back to the documented business requirements and KPIs to help prioritize
which update will make a positive business impact.
Resources
Related documents:
• AWS Blog
Related videos:
• AWS re:Inforce 2022 - Automating patch management and compliance using AWS
Related examples:
• Identify corrective actions: Use your analysis to identify corrective actions. This may include
parameter tuning, fixing bugs, and scaling resources.
• Document findings: Document your findings, including identified issues, root causes, and
corrective actions.
• Iterate and improve: Continually assess and improve the metrics review process. Use the lesson
learned from previous review to enhance the process over time.
Resources
Related documents:
• CloudWatch Documentation
• Collect metrics and logs from Amazon EC2 Instances and on-premises servers with the
CloudWatch Agent
• X-Ray Documentation
Related videos:
• AWS re:Invent 2022 - How Amazon uses better metrics for improved website performance
• AWS re:Invent 2023 - Take a load off: Diagnose & resolve performance issues with Amazon RDS
Related examples:
• CloudWatch Dashboards
Create a team (Cloud Business Office, Cloud Center of Excellence, or FinOps team) that is
responsible for establishing and maintaining cost awareness across your organization. The owner
of cost optimization can be an individual or a team (requires people from finance, technology, and
business teams) that understands the entire organization and cloud finance.
Implementation guidance
This is the introduction of a Cloud Business Office (CBO) or Cloud Center of Excellence (CCOE)
function or team that is responsible for establishing and maintaining a culture of cost awareness in
cloud computing. This function can be an existing individual, a team within your organization, or a
new team of key finance, technology, and organization stakeholders from across the organization.
The function (individual or team) prioritizes and spends the required percentage of their time on
cost management and cost optimization activities. For a small organization, the function might
spend a smaller percentage of time compared to a full-time function for a larger enterprise.
The function requires a multi-disciplinary approach, with capabilities in project management, data
science, financial analysis, and software or infrastructure development. They can improve workload
efficiency by running cost optimizations within three different ownerships:
• Centralized: Through designated teams such as FinOps team, Cloud Financial Management
(CFM) team, Cloud Business Office (CBO), or Cloud Center of Excellence (CCoE), customers can
design and implement governance mechanisms and drive best practices company-wide.
• Decentralized: Influencing technology teams to run cost optimizations.
• Hybrid: Combination of both centralized and decentralized teams can work together to run cost
optimizations.
The function may be measured against their ability to run and deliver against cost optimization
goals (for example, workload efficiency metrics).
You must secure executive sponsorship for this function, which is a key success factor. The sponsor
is regarded as a champion for cost efficient cloud consumption, and provides escalation support
for the team to ensure that cost optimization activities are treated with the level of priority defined
by the organization. Otherwise, guidance can be ignored and cost saving opportunities will not be
During these regular reviews, you can review workload efficiency (cost) and business outcome.
For example, a 20% cost increase for a workload may align with increased customer usage. In
this case, this 20% cost increase can be interpreted as an investment. These regular cadence calls
can help teams to identify value KPIs that provide meaning to the entire organization.
Resources
Related documents:
Related videos:
Related examples:
Involve finance and technology teams in cost and usage discussions at all stages of your cloud
journey. Teams regularly meet and discuss topics such as organizational goals and targets, current
state of cost and usage, and financial and accounting practices.
Implementation guidance
Technology teams innovate faster in the cloud due to shortened approval, procurement, and
infrastructure deployment cycles. This can be an adjustment for finance organizations previously
used to running time-consuming and resource-intensive processes for procuring and deploying
capital in data center and on-premises environments, and cost allocation only at project approval.
teams to innovate faster – the agility and ability to spin up and then tear down experiments. While
the variable nature of cloud consumption may impact predictability from a capital budgeting
and forecasting perspective, cloud provides organizations with the ability to reduce the cost of
over-provisioning, as well as reduce the opportunity cost associated with conservative under-
provisioning.
Establish a partnership between key finance and technology stakeholders to create a shared
understanding of organizational goals and develop mechanisms to succeed financially in the
variable spend model of cloud computing. Relevant teams within your organization must be
involved in cost and usage discussions at all stages of your cloud journey, including:
• Financial leads: CFOs, financial controllers, financial planners, business analysts, procurement,
sourcing, and accounts payable must understand the cloud model of consumption, purchasing
options, and the monthly invoicing process. Finance needs to partner with technology teams
to create and socialize an IT value story, helping business teams understand how technology
spend is linked to business outcomes. This way, technology expenditures are viewed not as
costs, but rather as investments. Due to the fundamental differences between the cloud (such
as the rate of change in usage, pay as you go pricing, tiered pricing, pricing models, and detailed
billing and usage information) compared to on-premises operation, it is essential that the finance
engagement models and a return on investment (ROI). Typically, third parties will contribute to
reporting and analysis of any workloads that they manage, and they will provide cost analysis of
any workloads that they design.
Implementing CFM and achieving success requires collaboration across finance, technology,
and business teams, and a shift in how cloud spend is communicated and evaluated across
the organization. Include engineering teams so that they can be part of these cost and usage
discussions at all stages, and encourage them to follow best practices and take agreed-upon
actions accordingly.
Implementation steps
• Define key members: Verify that all relevant members of your finance and technology teams
participate in the partnership. Relevant finance members will be those having interaction with
the cloud bill. This will typically be CFOs, financial controllers, financial planners, business
analysts, procurement, and sourcing. Technology members will typically be product and
application owners, technical managers and representatives from all teams that build on the
cloud. Other members may include business unit owners, such as marketing, that will influence
usage of products, and third parties such as consultants, to achieve alignment to your goals and
mechanisms, and to assist with reporting.
• Define topics for discussion: Define the topics that are common across the teams, or will need
a shared understanding. Follow cost from that time it is created, until the bill is paid. Note any
members involved, and organizational processes that are required to be applied. Understand
each step or process it goes through and the associated information, such as pricing models
available, tiered pricing, discount models, budgeting, and financial requirements.
• Establish regular cadence: To create a finance and technology partnership, establish a regular
communication cadence to create and maintain alignment. The group needs to come together
regularly against their goals and metrics. A typical cadence involves reviewing the state of the
organization, reviewing any programs currently running, and reviewing overall financial and
optimization metrics. Then key workloads are reported on in greater detail.
Resources
Related documents:
Identify the business drivers that can impact your usage cost, and forecast for each of them
separately to calculate expected usage in advance. Some of the drivers might be linked to IT
and product teams within the organization. Other business drivers, such as marketing events,
promotions, geographic expansions, mergers, and acquisitions, are known by your sales, marketing,
and business leaders, and it's important to collaborate and account for all those demand drivers as
well.
You can use AWS Cost Explorer for trend-based forecasting in a defined future time range based
on your past spend. AWS Cost Explorer's forecasting engine segments your historical data based on
charge types (for example, Reserved Instances) and uses a combination of machine learning and
rule-based models to predict spend across all charge types individually.
Once you've established your forecast process and built models, you can use AWS Budgets to set
custom budgets at a granular level by specifying the time period, recurrence, or amount (fixed or
variable) and add filters such as service, AWS Region, and tags. The budget is usually prepared for a
single year and remains fixed, which requires strict adherence from everyone involved. In contrast,
forecasts are more flexible, which allows for readjustments throughout the year and provides
dynamic projections over a period of one, two, or three years. Both budgets and forecasts play
a crucial role when you establish financial expectations among various technology and business
stakeholders. Accurate forecasts and implementation also provides accountability to stakeholders
who are directly responsible for provisioning cost in the first place, and it can also raise their overall
cost awareness.
To stay informed on the performance of your existing budgets, you can create and schedule AWS
Budgets reports to email you and your stakeholders on a regular cadence. You can also create AWS
• Update existing forecast and budget processes: Based on adopted forecast methods such
as trend-based, business driver-based, or a combination of both forecasting methods, define
your forecast and budget processes. Budgets should be calculated, realistic, and based on your
forecasts.
• Configure alerts and notifications: Use AWS Budgets alerts and cost anomaly detection to get
alerts and notifications.
• Perform regular reviews with key stakeholders: For example, align on changes in business
direction and usage with stakeholders in IT, finance, platform teams, and other areas of the
business.
Resources
Related documents:
• Amazon Forecast
• AWS Budgets
Related videos:
Related examples:
Implementation steps
• Identify relevant organizational processes: Each organizational unit reviews their processes
and identifies processes that impact cost and usage. Any processes that result in the creation or
termination of a resource need to be included for review. Look for processes that can support
cost awareness in your business, such as incident management and training.
• Establish self-sustaining cost-aware culture: Make sure all the relevant stakeholders align with
cause-of-change and impact as a cost so that they understand cloud cost. This will allow your
organization to establish a self-sustaining cost-aware culture of innovation.
• Update processes with cost awareness: Each process is modified to be made cost aware. The
process may require additional pre-checks, such as assessing the impact of cost, or post-checks
validating that the expected changes in cost and usage occurred. Supporting processes such as
training and incident management can be extended to include items for cost and usage.
To get help, reach out to CFM experts through your Account team, or explore the resources and
related documents below.
Resources
Related documents:
Related examples:
Set up cloud budgets and configure mechanisms to detect anomalies in usage. Configure related
tools for cost and usage alerts against pre-defined targets and receive notifications when any
usage exceeds those targets. Have regular meetings to analyze the cost-effectiveness of your
workloads and promote cost awareness.
commitment, providing insight into estimated savings, Savings Plans coverage, and Savings Plans
utilization. This helps organizations to understand how their Savings Plans apply to each hour of
spend without having to invest time and resources into building models to analyze their spend.
Periodically create reports containing a highlight of Savings Plans, Reserved Instances, and
Amazon EC2 rightsizing recommendations from AWS Cost Explorer to start reducing the cost
associated with steady-state workloads, idle, and underutilized resources. Identify and recoup
spend associated with cloud waste for resources that are deployed. Cloud waste occurs when
incorrectly-sized resources are created or different usage patterns are observed instead what is
expected. Follow AWS best practices to reduce your waste or ask your account team and partner to
help you to optimize and save your cloud costs.
Generate reports regularly for better purchasing options for your resources to drive down unit
costs for your workloads. Purchasing options such as Savings Plans, Reserved Instances, or
Amazon EC2 Spot Instances offer the deepest cost savings for fault-tolerant workloads and
allow stakeholders (business owners, finance, and tech teams) to be part of these commitment
discussions.
Share the reports that contain opportunities or new release announcements that may help you to
reduce total cost of ownership (TCO) of the cloud. Adopt new services, Regions, features, solutions,
or new ways to achieve further cost reductions.
Implementation steps
• Configure AWS Budgets: Configure AWS Budgets on all accounts for your workload. Set a
budget for the overall account spend, and a budget for the workload by using tags.
• Well-Architected Labs: Cost and Governance Usage
• Report on cost optimization: Set up a regular cycle to discuss and analyze the efficiency of
the workload. Using the metrics established, report on the metrics achieved and the cost of
achieving them. Identify and fix any negative trends, as well as positive trends that you can
promote across your organization. Reporting should involve representatives from the application
teams and owners, finance, and key decision makers with respect to cloud expenditure.
Resources
Related documents:
• Report on cost optimization: Set up a regular cycle to discuss and analyze the efficiency of
the workload. Using the metrics established, report on the metrics achieved and the cost of
achieving them. Identify and fix any negative trends, and identify positive trends to promote
across your organization. Reporting should involve representatives from the application teams
and owners, finance, and management.
• Create and activate daily granularity AWS Budgets for the cost and usage to take timely
actions to prevent any potential cost overruns: AWS Budgets allow you to configure alert
notifications, so you stay informed if any of your budget types fall out of your pre-configured
thresholds. The best way to leverage AWS Budgets is to set your expected cost and usage as your
limits, so that anything above your budgets can be considered overspend.
• Create AWS Cost Anomaly Detection for cost monitor: AWS Cost Anomaly Detection uses
advanced Machine Learning technology to identify anomalous spend and root causes, so you
can quickly take action. It allows you to configure cost monitors that define spend segments you
want to evaluate (for example, individual AWS services, member accounts, cost allocation tags,
and cost categories), and lets you set when, where, and how you receive your alert notifications.
For each monitor, attach multiple alert subscriptions for business owners and technology
teams, including a name, a cost impact threshold, and alerting frequency (individual alerts, daily
summary, weekly summary) for each subscription.
• Use AWS Cost Explorer or integrate your AWS Cost and Usage Report (CUR) data with Amazon
QuickSight dashboards to visualize your organization’s costs: AWS Cost Explorer has an easy-
to-use interface that lets you visualize, understand, and manage your AWS costs and usage over
time. The Cost Intelligence Dashboard is a customizable and accessible dashboard to help create
the foundation of your own cost management and optimization tool.
Resources
Related documents:
• AWS Budgets
• AWS Cost Explorer
• Daily Cost and Usage Budgets
• AWS Cost Anomaly Detection
Related examples:
• Meet with your account team: Schedule a regular cadence with your account team, meet with
them and discuss industry trends and AWS services. Speak with your account manager, Solutions
Architect, and support team.
Resources
Related documents:
Related examples:
Implementation guidance
A cost-aware culture allows you to scale cost optimization and Cloud Financial Management
(financial operations, cloud center of excellence, cloud operations teams, and so on) through best
practices that are performed in an organic and decentralized manner across your organization.
Cost awareness allows you to create high levels of capability across your organization with minimal
effort, compared to a strict top-down, centralized approach.
Creating cost awareness in cloud computing, especially for primary cost drivers in cloud computing,
allows teams to understand expected outcomes of any changes in cost perspective. Teams who
access the cloud environments should be aware of pricing models and the difference between
traditional on-premesis datacenters and cloud computing.
• AWS events and meetups: Attend local AWS summits, and any local meetups with other
organizations from your local area.
• Subscribe to blogs: Go to the AWS blogs pages and subscribe to the What's New Blog and other
relevant blogs to follow new releases, implementations, examples, and changes shared by AWS.
Resources
Related documents:
• AWS Blog
• AWS Cost Management
• AWS News Blog
Related examples:
Quantifying business value from cost optimization allows you to understand the entire set of
benefits to your organization. Because cost optimization is a necessary investment, quantifying
business value allows you to explain the return on investment to stakeholders. Quantifying
business value can help you gain more buy-in from stakeholders on future cost optimization
investments, and provides a framework to measure the outcomes for your organization’s cost
optimization activities.
Implementation guidance
Quantifying the business value means measuring the benefits that businesses gain from the
actions and decisions they take. Business value can be tangible (like reduced expenses or increased
profits) or intangible (like improved brand reputation or increased customer satisfaction).
To quantify business value from cost optimization means determining how much value or benefit
you’re getting from your efforts to spend more efficiently. For example, if a company spends
$100,000 to deploy a workload on AWS and later optimizes it, the new cost becomes only $80,000
Related videos:
Related examples:
Establish policies and mechanisms to verify that appropriate costs are incurred while objectives are
achieved. By employing a checks-and-balances approach, you can innovate without overspending.
Best practices
• COST02-BP01 Develop policies based on your organization requirements
• COST02-BP02 Implement goals and targets
• COST02-BP03 Implement an account structure
• COST02-BP04 Implement groups and roles
• COST02-BP05 Implement cost controls
• COST02-BP06 Track project lifecycle
lower performance storage in test and development environments), which types of resources can
be used by different groups (for example, the largest size of resource in a development account
is medium) and how long these resources will be in use (whether temporary, short term, or for a
specific period of time).
Policy example
The following is a sample policy you can review to create your own cloud governance policies,
which focus on cost optimization. Make sure you adjust policy based on your organization’s
requirements and your stakeholders’ requests.
• Policy name: Define a clear policy name, such as Resource Optimization and Cost Reduction
Policy.
• Purpose: Explain why this policy should be used and what is the expected outcome. The
objective of this policy is to verify that there is a minimum cost required to deploy and run the
desired workload to meet business requirements.
• Scope: Clearly define who should use this policy and when it should be used, such as DevOps X
Team to use this policy in us-east customers for X environment (production or non-production).
Policy statement
1. Select us-east-1or multiple us-east regions based on your workload’s environment and business
requirement (development, user acceptance testing, pre-production, or production).
2. Schedule Amazon EC2 and Amazon RDS instances to run between six in the morning and eight
at night (Eastern Standard Time (EST)).
3. Stop all unused Amazon EC2 instances after eight hours and unused Amazon RDS instances
after 24 hours of inactivity.
4. Terminate all unused Amazon EC2 instances after 24 hours of inactivity in non-production
environments. Remind Amazon EC2 instance owner (based on tags) to review their stopped
Amazon EC2 instances in production and inform them that their Amazon EC2 instances will be
terminated within 72 hours if they are not in use.
5. Use generic instance family and size such as m5.large and then resize the instance based on CPU
and memory utilization using AWS Compute Optimizer.
6. Prioritize using auto scaling to dynamically adjust the number of running instances based on
traffic.
7. Use spot instances for non-critical workloads.
• Define locations for your workload: Define where your workload operates, including the
country and the area within the country. This information is used for mapping to AWS Regions
and Availability Zones.
• Define and group services and resources: Define the services that the workloads require. For
each service, specify the types, the size, and the number of resources required. Define groups for
the resources by function, such as application servers or database storage. Resources can belong
to multiple groups.
• Define and group the users by function: Define the users that interact with the workload,
focusing on what they do and how they use the workload, not on who they are or their position
in the organization. Group similar users or functions together. You can use the AWS managed
policies as a guide.
• Define the actions: Using the locations, resources, and users identified previously, define
the actions that are required by each to achieve the workload outcomes over its life time
(development, operation, and decommission). Identify the actions based on the groups, not the
individual elements in the groups, in each location. Start broadly with read or write, then refine
down to specific actions to each service.
• Define the review period: Workloads and organizational requirements can change over time.
Define the workload review schedule to ensure it remains aligned with organizational priorities.
• Document the policies: Verify the policies that have been defined are accessible as required
by your organization. These policies are used to implement, maintain, and audit access of your
environments.
Resources
Related documents:
you can achieve this through the establishment of capability in cost optimization, as well as new
service and feature releases.
Targets are the quantifiable benchmarks you want to reach to meet your goals and benchmarks
compare your actual results against a target. Establish benchmarks with KPIs for the cost per
unit of compute services (such as Spot adoption, Graviton adoption, latest instance types, and
On-Demands coverage), storage services (such as EBS GP3 adoption, obsolete EBS snapshots,
and Amazon S3 standard storage), or database service usage (such as RDS open-source engines,
Graviton adoption, and On-demand coverage). These benchmarks and KPIs can help you verify that
you use AWS services in the most cost-effective manner.
The following table provides a list of standard AWS metrics for reference. Each organization can
have different target values for these KPIs.
performance metrics within an organization, distinguish between different types of metrics that
serve distinct purposes. These metrics primarily measure the performance and efficiency of the
technical infrastructure rather than directly the overall business impact. For instance, they might
track server response times, network latency, or system uptime. These metrics are crucial to assess
how well the infrastructure supports the organization's technical operations. However, they don't
provide direct insight into broader business objectives like customer satisfaction, revenue growth,
or market share. To gain a comprehensive understanding of business performance, complement
these efficiency metrics with strategic business metrics that directly correlate with business
outcomes.
Establish near real-time visibility over your KPIs and related savings opportunities and track your
progress over time. To get started with the definition and tracking of KPI goals, we recommend the
KPI dashboard from Cloud Intelligence Dashboards (CID). Based on the data from Cost and Usage
Report (CUR), the KPI dashboard provides a series of recommended cost optimization KPIs, with
the ability to set custom goals and track progress over time.
If you have other solutions to set and track KPI goals, make sure these methods are adopted by all
cloud financial management stakeholders in your organization.
Implementation steps
• Define expected usage levels: To begin, focus on usage levels. Engage with the application
owners, marketing, and greater business teams to understand what the expected usage levels
are for the workload. How might customer demand change over time, and what can change due
to seasonal increases or marketing campaigns?
• Define workload resourcing and costs: With usage levels defined, quantify the changes in
workload resources required to meet those usage levels. You may need to increase the size or
number of resources for a workload component, increase data transfer, or change workload
components to a different service at a specific level. Specify the costs at each of these major
points, and predict the change in cost when there is a change in usage.
• Define business goals: Take the output from the expected changes in usage and cost, combine
this with expected changes in technology, or any programs that you are running, and develop
goals for the workload. Goals must address usage and cost, as well as the relationship between
the two. Goals must be simple, high-level, and help people understand what the business
expects in terms of outcomes (such as making sure unused resources are kept below certain cost
level). You don't need to define goals for each unused resource type or define costs that can
cause losses in goals and targets. Verify that there are organizational programs (for example,
Implement a structure of accounts that maps to your organization. This assists in allocating and
managing costs throughout your organization.
Implementation guidance
AWS Organizations allows you to create multiple AWS accounts which can help you centrally
govern your environment as you scale your workloads on AWS. You can model your organizational
hierarchy by grouping AWS accounts in organizational unit (OU) structure and creating multiple
AWS accounts under each OU. To create an account structure, you need to decide which of your
AWS accounts will be the management account first. After that, you can create new AWS accounts
or select existing accounts as member accounts based on your designed account structure by
following management account best practices and member account best practices.
It is advised to always have at least one management account with one member account linked
to it, regardless of your organization size or usage. All workload resources should reside only
within member accounts and no resource should be created in management account. There is
no one size fits all answer for how many AWS accounts you should have. Assess your current and
future operational and cost models to ensure that the structure of your AWS accounts reflects
your organization’s goals. Some companies create multiple AWS accounts for business reasons, for
example:
• Administrative or fiscal and billing isolation is required between organization units, cost centers,
or specific workloads.
• AWS service limits are set to be specific to particular workloads.
• There is a requirement for isolation and separation between workloads and resources.
Within AWS Organizations, consolidated billing creates the construct between one or more
member accounts and the management account. Member accounts allow you to isolate and
distinguish your cost and usage by groups. A common practice is to have separate member
accounts for each organization unit (such as finance, marketing, and sales), or for each environment
lifecycle (such as development, testing and production), or for each workload (workload a, b, and
c), and then aggregate these linked accounts using consolidated billing.
Consolidated billing allows you to consolidate payment for multiple member AWS accounts under
a single management account, while still providing visibility for each linked account’s activity. As
AWS Control Tower can quickly set up and configure multiple AWS accounts, ensuring that
governance is aligned with your organization’s requirements.
Implementation steps
Resources
Related documents:
Implementation guidance
User roles and groups are fundamental building blocks in the design and implementation of secure
and efficient systems. Roles and groups help organizations balance the need for control with
the requirement for flexibility and productivity, ultimately supporting organizational objectives
and user needs. As recommended in Identity and access management section of AWS Well-
Architected Framework Security Pillar, you need robust identity management and permissions in
place to provide access to the right resources for the right people under the right conditions. Users
receive only the access necessary to complete their tasks. This minimizes the risk associated with
unauthorized access or misuse.
After you develop policies, you can create logical groups and user roles within your organization.
This allows you to assign permissions, control usage, and help implement robust access control
mechanisms, preventing unauthorized access to sensitive information. Begin with high-level
groupings of people. Typically, this aligns with organizational units and job roles (for example, a
systems administrator in the IT Department, financial controller, or business analysts). The groups
categorize people that do similar tasks and need similar access. Roles define what a group must
do. It is easier to manage permissions for groups and roles than for individual users. Roles and
groups assign permissions consistently and systematically across all users, preventing errors and
inconsistencies.
When a user’s role changes, administrators can adjust access at the role or group level, rather than
reconfiguring individual user accounts. For example, a systems administrator in IT requires access to
create all resources, but an analytics team member only needs to create analytics resources.
Implementation steps
• Implement groups: Using the groups of users defined in your organizational policies, implement
the corresponding groups, if necessary. For best practices on users, groups and authentication,
see the Security Pillar of the AWS Well-Architected Framework.
• Implement roles and policies: Using the actions defined in your organizational policies, create
the required roles and access policies. For best practices on roles and policies, see the Security
Pillar of the AWS Well-Architected Framework.
Resources
Related documents:
you identify anomalous spend and root causes, so you can quickly take action. First, create a cost
monitor in AWS Cost Anomaly Detection, then choose your alerting preference by setting up a
dollar threshold (such as an alert on anomalies with impact greater than $1,000). Once you receive
alerts, you can analyze the root cause behind the anomaly and impact on your costs. You can also
monitor and perform your own anomaly analysis in AWS Cost Explorer.
Enforce governance policies in AWS through AWS Identity and Access Management and AWS
Organizations Service Control Policies (SCP). IAM allows you to securely manage access to AWS
services and resources. Using IAM, you can control who can create or manage AWS resources,
the type of resources that can be created, and where they can be created. This minimizes the
possibility of resources being created outside of the defined policy. Use the roles and groups
created previously and assign IAM policies to enforce the correct usage. SCP offers central control
over the maximum available permissions for all accounts in your organization, keeping your
accounts stay within your access control guidelines. SCPs are available only in an organization
that has all features turned on, and you can configure the SCPs to either deny or allow actions for
member accounts by default. For more details on implementing access management, see the Well-
Architected Security Pillar whitepaper.
Implementation steps
• Implement notifications on spend: Using your defined organization policies, create AWS
Budgets to notify you when spending is outside of your policies. Configure multiple cost budgets,
one for each account, which notify you about overall account spending. Configure additional cost
budgets within each account for smaller units within the account. These units vary depending
on your account structure. Some common examples are AWS Regions, workloads (using tags),
or AWS services. Configure an email distribution list as the recipient for notifications, and not an
individual's email account. You can configure an actual budget for when an amount is exceeded,
or use a forecasted budget for notifying on forecasted usage. You can also preconfigure AWS
Budget Actions that can enforce specific IAM or SCP policies, or stop target Amazon EC2 or
Amazon RDS instances. Budget Actions can be started automatically or require workflow
approval.
Track, measure, and audit the lifecycle of projects, teams, and environments to avoid using and
paying for unnecessary resources.
Implementation guidance
By effectively tracking the project lifecycle, organizations can achieve better cost control through
enhanced planning, management, and resource optimization. The insights gained through tracking
are invaluable for making informed decisions that contribute to the cost-effectiveness and overall
success of the project.
Tracking the entire lifecycle of the workload helps you understand when workloads or workload
components are no longer required. The existing workloads and components may appear to be
in use, but when AWS releases new services or features, they can be decommissioned or adopted.
Check the previous stages of workloads. After a workload is in production, previous environments
can be decommissioned or greatly reduced in capacity until they are required again.
You can tag resources with a timeframe or reminder to pin the time that the workload was
reviewed. For example, if the development environment was last reviewed months ago, it could be
a good time to review it again to explore if new services can be adopted or if the environment is
in use. You can group and tag your applications with myApplications on AWS to manage and track
metadata such as criticality, environment, last reviewed, and cost center. You can both track your
workload's lifecycle and monitor and manage the cost, health, security posture, and performance
of your applications.
AWS provides various management and governance services you can use for entity lifecycle
tracking. You can use AWS Config or AWS Systems Manager to provide a detailed inventory of
your AWS resources and configuration. It is recommended that you integrate with your existing
project or asset management systems to keep track of active projects and products within your
organization. Combining your current system with the rich set of events and metrics provided by
Related examples:
Related Tools
• AWS Config
• AWS Systems Manager
• AWS Budgets
• AWS Organizations
• AWS CloudFormation
Establish policies and procedures to monitor and appropriately allocate your costs. This permits
you to measure and improve the cost efficiency of this workload.
Best practices
• COST03-BP01 Configure detailed information sources
• COST03-BP02 Add organization information to cost and usage
• COST03-BP03 Identify cost attribution categories
• COST03-BP04 Establish organization metrics
• COST03-BP05 Configure billing and cost management tools
• COST03-BP06 Allocate costs based on workload metrics
Set up cost management and reporting tools for enhanced analysis and transparency of cost
and usage data. Configure your workload to create log entries that facilitate the tracking and
segregation of costs and usage.
Implementation guidance
Detailed billing information such as hourly granularity in cost management tools allow
organizations to track their consumptions with further details and help them to identify some of
option if you want to quickly deploy a dashboard of your cost and usage data without the ability
for customization.
If desired, you can still export CUR in legacy mode, where you can integrate other processing
services such as AWS Glue to prepare the data for analysis and perform data analysis with Amazon
Athena using SQL to query the data.
Implementation steps
• Create data exports: Create customized exports with the data you want and control the schema
of your exports. Create billing and cost management data exports using basic SQL, and visualize
your billing and cost management data by integrating with Amazon QuickSight. You can also
export your data in standard mode to analyze your data with other processing tools like Amazon
Athena.
• Configure the cost and usage report: Using the billing console, configure at least one cost
and usage report. Configure a report with hourly granularity that includes all identifiers and
resource IDs. You can also create other reports with different granularities to provide higher-level
summary information.
• Configure hourly granularity in Cost Explorer: To access cost and usage data with hourly
granularity for the past 14 days, consider enabling hourly and resource level data in the billing
console.
• Configure application logging: Verify that your application logs each business outcome that
it delivers so it can be tracked and measured. Ensure that the granularity of this data is at least
hourly so it matches with the cost and usage data. For more details on logging and monitoring,
see Well-Architected Operational Excellence Pillar.
Resources
Related documents:
in AWS Organizations to define rules for how tags can be used on AWS resources in your accounts
in AWS Organizations. Tag Policies allow you to easily adopt a standardized approach for tagging
AWS resources
AWS Tag Editor allows you to add, delete, and manage tags of multiple resources. With Tag Editor,
you search for the resources that you want to tag, and then manage tags for the resources in your
search results.
AWS Cost Categories allows you to assign organization meaning to your costs, without requiring
tags on resources. You can map your cost and usage information to unique internal organization
structures. You define category rules to map and categorize costs using billing dimensions, such as
accounts and tags. This provides another level of management capability in addition to tagging.
You can also map specific accounts and tags to multiple projects.
Implementation steps
• Define a tagging schema: Gather all stakeholders from across your business to define a schema.
This typically includes people in technical, financial, and management roles. Define a list of tags
that all resources must have, as well as a list of tags that resources should have. Verify that the
tag names and values are consistent across your organization.
• Tag resources: Using your defined cost attribution categories, place tags on all resources in your
workloads according to the categories. Use tools such as the CLI, Tag Editor, or AWS Systems
Manager to increase efficiency.
• Implement AWS Cost Categories: You can create Cost Categories without implementing
tagging. Cost categories use the existing cost and usage dimensions. Create category rules from
your schema and implement them into cost categories.
• Automate tagging: To verify that you maintain high levels of tagging across all resources,
automate tagging so that resources are automatically tagged when they are created. Use services
such as AWS CloudFormation to verify that resources are tagged when created. You can also
create a custom solution to tag automatically using Lambda functions or use a microservice that
scans the workload periodically and removes any resources that are not tagged, which is ideal for
test and development environments.
• Monitor and report on tagging: To verify that you maintain high levels of tagging across your
organization, report and monitor the tags across your workloads. You can use AWS Cost Explorer
to view the cost of tagged and untagged resources, or use services such as Tag Editor. Regularly
review the number of untagged resources and take action to add tags until you reach the desired
level of tagging.
Work with your finance team and other relevant stakeholders to understand the requirements
of how costs must be allocated within your organization during your regular cadence calls.
Workload costs must be allocated throughout the entire lifecycle, including development,
testing, production, and decommissioning. Understand how the costs incurred for learning, staff
development, and idea creation are attributed in the organization. This can be helpful to correctly
allocate accounts used for this purpose to training and development budgets instead of generic IT
cost budgets.
After defining your cost attribution categories with stakeholders in your organization, use AWS
Cost Categories to group your cost and usage information into meaningful categories in the AWS
Cloud, such as cost for a specific project, or AWS accounts for departments or business units. You
can create custom categories and map your cost and usage information into these categories based
on rules you define using various dimensions such as account, tag, service, or charge type. Once
cost categories are set up, you can view your cost and usage information by these categories, which
allows your organization to make better strategic and purchasing decisions. These categories are
visible in AWS Cost Explorer, AWS Budgets, and AWS Cost and Usage Report as well.
For example, create cost categories for your business units (DevOps team), and under each
category create multiple rules (rules for each sub category) with multiple dimensions (AWS
accounts, cost allocation tags, services or charge type) based on your defined groupings. With
cost categories, you can organize your costs using a rule-based engine. The rules that you
configure organize your costs into categories. Within these rules, you can filter with using multiple
dimensions for each category such as specific AWS accounts, AWS services, or charge types. You
can then use these categories across multiple products in the AWS Billing and Cost Management
and Cost Management console. This includes AWS Cost Explorer, AWS Budgets, AWS Cost and
Usage Report, and AWS Cost Anomaly Detection.
As an example, the following diagram displays how to group your costs and usage information in
your organization by having multiple teams (cost category), multiple environments (rules), and
each environment having multiple resources or assets (dimensions).
• Define AWS Cost Categories: Create cost categories to organize your cost and usage information
with using AWS Cost Categories and map your AWS cost and usage into meaningful categories.
Multiple categories can be assigned to a resource, and a resource can be in multiple different
categories, so define as many categories as needed so that you can manage your costs within the
categorized structure using AWS Cost Categories.
Resources
Related documents:
Related examples:
• Organize your cost and usage data with AWS Cost Categories
• Managing your costs with AWS Cost Categories
• Well-Architected Labs: Cost and Usage Visualization
• Well-Architected Labs: Cost Categories
Establish the organization metrics that are required for this workload. Example metrics of a
workload are customer reports produced, or web pages served to customers.
Implementation guidance
To establish strong accountability, consider your account strategy first as part of your cost
allocation strategy. Get this right, and you may not need to go any further. Otherwise, there can be
unawareness and further pain points.
To encourage accountability of cloud spend, grant users access to tools that provide visibility
into their costs and usage. AWS recommends that you configure all workloads and teams for the
following purposes:
• Organize: Establish your cost allocation and governance baseline with your own tagging strategy
and taxonomy. Create multiple AWS Accounts with tools such as AWS Control Tower or AWS
Organization. Tag the supported AWS resources and categorize them meaningfully based on
your organization structure (business units, departments, or projects). Tag account names for
specific cost centers and map them with AWS Cost Categories to group accounts for business
units to their cost centers so that business unit owner can see multiple accounts' consumption in
one place.
• Access: Track organization-wide billing information in consolidated billing. Verify the right
stakeholders and business owners have access.
• Control: Build effective governance mechanisms with the right guardrails to prevent unexpected
scenarios when using Service Control Policies (SCP), tag policies, IAM policies and budget alerts.
For example, you can allow teams to create specific resources in preferred regions only by using
effective control mechanisms and prevent resource creations without specific tag (such as cost-
center).
• Current state: Configure a dashboard that shows current levels of cost and usage. The dashboard
should be available in a highly visible place within the work environment like an operations
dashboard. You can export data and use the Cost and Usage Dashboard from the AWS Cost
Optimization Hub or any supported product to create this visibility. You may need to create
different dashboards for different personas. For example, manager dashboard may differ from an
engineering dashboard.
• Notifications: Provide notifications when cost or usage exceeds defined limits and anomalies
occur with AWS Budgets or AWS Cost Anomaly Detection.
• Reports: Summarize all cost and usage information. Raise awareness and accountability of your
cloud spend with detailed, attributable cost data. Create reports that are relevant to the team
consuming them and contain recommendations.
• Configure AWS Cost Anomaly Detection: Use AWS Cost Anomaly Detection for your accounts,
core services or cost categories you created to monitor your cost and usage and detect unusual
spends. You can receive alerts individually in aggregated reports and receive alerts in an email or
an Amazon SNS topic which allows you to analyze and determine the root cause of the anomaly
and identify the factor that is driving the cost increase.
• Use cost analysis tools: Configure AWS Cost Explorer for your workload and accounts to
visualize your cost data for further analysis. Create a dashboard for the workload that tracks
overall spend, key usage metrics for the workload, and forecast of future costs based on your
historical cost data.
• Use cost-saving analysis tools: Use AWS Cost Optimization Hub to identify savings
opportunities with tailored recommendations including deleting unused resources, rightsizing,
savings Plans, reservations and compute optimizer recommendations.
• Configure advanced tools: You can optionally create visuals to facilitate interactive analysis
and sharing of cost insights. With Data Exports on AWS Cost Optimization Hub, you can create
cost and usage dashboard powered by Amazon QuickSight for your organization that provides
additional detail and granularity. You can also implement advanced analysis capability with
using data exports in Amazon Athena for advanced queries, and create dashboards on Amazon
QuickSight. Work with AWS Partners to adopt cloud management solutions for consolidated
cloud bill monitoring and optimization.
Resources
Related documents:
Related videos:
Implementation steps
• Allocate costs to workload metrics: Using the defined metrics and configured tags, create
a metric that combines the workload output and workload cost. Use analytics services such
as Amazon Athena and Amazon QuickSight to create an efficiency dashboard for the overall
workload and any components.
Resources
Related documents:
Related examples:
• Improve cost visibility of Amazon ECS and AWS Batch with AWS Split Cost Allocation Data
Implement change control and resource management from project inception to end-of-life. This
ensures you shut down or terminates unused resources to reduce waste.
Best practices
• COST04-BP01 Track resources over their lifetime
• COST04-BP02 Implement a decommissioning process
• COST04-BP03 Decommission resources
• COST04-BP04 Decommission resources automatically
• COST04-BP05 Enforce data retention policies
Define and implement a method to track resources and their associations with systems over their
lifetime. You can use tagging to identify the workload or function of the resource.
Related videos:
Related examples:
Implementation guidance
Implement a standardized process across your organization to identify and remove unused
resources. The process should define the frequency searches are performed and the processes to
remove the resource to verify that all organization requirements are met.
Implementation steps
• Create and implement a decommissioning process: Work with the workload developers and
owners to build a decommissioning process for the workload and its resources. The process
should cover the method to verify if the workload is in use, and also if each of the workload
resources are in use. Detail the steps necessary to decommission the resource, removing them
from service while ensuring compliance with any regulatory requirements. Any associated
resources should be included, such as licenses or attached storage. Notify the workload owners
that the decommissioning process has been started.
If the resource is an object in Amazon S3 Glacier storage and if you delete an archive before
meeting the minimum storage duration, you will be charged a prorated early deletion fee.
Amazon S3 Glacier minimum storage duration depends on the storage class used. For a summary
of minimum storage duration for each storage class, see Performance across the Amazon S3
storage classes. For detail on how early deletion fees are calculated, see Amazon S3 pricing.
The following simple decommissioning process flowchart outlines the decommissioning steps.
Before decommissioning resources, verify that resources you have identified for decommissioning
are not being used by the organization.
Resources
Related documents:
• AWS CloudTrail
Related videos:
Related examples:
Design your workload to gracefully handle resource termination as you identify and decommission
non-critical resources, resources that are not required, or resources with low utilization.
Implementation guidance
Use automation to reduce or remove the associated costs of the decommissioning process.
Designing your workload to perform automated decommissioning will reduce the overall workload
costs during its lifetime. You can use Amazon EC2 Auto Scaling or Application Auto Scaling to
perform the decommissioning process. You can also implement custom code using the API or SDK
to decommission workload resources automatically.
Modern applications are built serverless-first, a strategy that prioritizes the adoption of serverless
services. AWS developed serverless services for all three layers of your stack: compute, integration,
and data stores. Using serverless architecture will allow you to save costs during low-traffic periods
with scaling up and down automatically.
Implementation steps
• Implement Amazon EC2 Auto Scaling or Application Auto Scaling: For resources that are
supported, configure them with Amazon EC2 Auto Scaling or Application Auto Scaling. These
services can help you optimize your utilization and cost efficiencies when consuming AWS
services. When demand drops, these services will automatically remove any excess resource
capacity so you avoid overspending.
• Configure CloudWatch to terminate instances: Instances can be configured to terminate
using CloudWatch alarms. Using the metrics from the decommissioning process, implement an
alarm with an Amazon Elastic Compute Cloud action. Verify the operation in a non-production
environment before rolling out.
• Implement code within the workload: You can use the AWS SDK or AWS CLI to decommission
workload resources. Implement code within the application that integrates with AWS and
terminates or removes resources that are no longer used.
Manager to automate the creation and deletion of Amazon Elastic Block Store snapshots and
Amazon EBS-backed Amazon Machine Images (AMIs), and use Amazon S3 Intelligent-Tiering or an
Amazon S3 lifecycle configuration to manage the lifecycle of your Amazon S3 objects. You can also
implement custom code using the API or SDK to create lifecycle policies and policy rules for objects
to be deleted automatically.
Implementation steps
• Use Amazon Data Lifecycle Manager: Use lifecycle policies on Amazon Data Lifecycle Manager
to automate deletion of Amazon EBS snapshots and Amazon EBS-backed AMIs.
• Set up lifecycle configuration on a bucket: Use Amazon S3 lifecycle configuration on a bucket
to define actions for Amazon S3 to take during an object's lifecycle, as well as deletion at the end
of the object's lifecycle, based on your business requirements.
Resources
Related documents:
Related videos:
Related examples:
Cost-effective resources
Questions
• COST 5. How do you evaluate cost when you select services?
When selecting services for your workload, it is key that you understand your organization
priorities. Create a balance between cost optimization and other AWS Well-Architected Framework
pillars, such as performance and reliability. This process should be conducted systematically
and regularly to reflect changes in the organization's objectives, market conditions, and
operational dynamics. A fully cost-optimized workload is the solution that is most aligned to
your organization’s requirements, not necessarily the lowest cost. Meet with all teams in your
organization, such as product, business, technical, and finance to collect information. Evaluate the
impact of tradeoffs between competing interests or alternative approaches to help make informed
decisions when determining where to focus efforts or choosing a course of action.
For example, accelerating speed to market for new features may be emphasized over cost
optimization, or you may choose a relational database for non-relational data to simplify the
effort to migrate a system, rather than migrating to a database optimized for your data type and
updating your application.
Implementation steps
• Identify organization requirements for cost: Meet with team members from your organization,
including those in product management, application owners, development and operational
teams, management, and financial roles. Prioritize the Well-Architected pillars for this workload
and its components. The output should be a list of the pillars in order. You can also add a weight
to each pillar to indicate how much additional focus it has, or how similar the focus is between
two pillars.
• Address the technical debt and document it: During the workload review, address the technical
debt. Document a backlog item to revisit the workload in the future, with the goal of refactoring
or re-architecting to optimize it further. It's essential to clearly communicate the trade-offs that
were made to other stakeholders.
Resources
• REL11-BP07 Architect your product to meet availability targets and uptime service level
agreements (SLAs)
• OPS01-BP06 Evaluate tradeoffs
AWS Cost Explorer and the AWS Cost and Usage Reports (CUR) can analyze the cost of a proof
of concept (PoC) or running environment. You can also use AWS Pricing Calculator to estimate
workload costs.
Write a workflow to be followed by technical teams to review their workloads. Keep this workflow
simple, but also cover all the necessary steps to make sure the teams understand each component
of the workload and its pricing. Your organization can then follow and customize this workflow
based on the specific needs of each team.
1. List each service in use for your workload: This is a good starting point. Identify all of the
services currently in use and where costs are originate from.
2. Understand how pricing works for those services: Understand the pricing model of each
service. Different AWS services have different pricing models based on factors like usage volume,
data transfer, and feature-specific pricing.
3. Focus on the services that have unexpected workload costs and that do not align with
your expected usage and business outcome: Identify outliers or services where the cost is
not proportional to the value or usage with using AWS Cost Explorer or AWS Cost and Usage
Reports. It's important to correlate costs with business outcomes to prioritize optimization
efforts.
4. AWS Cost Explorer, CloudWatch Logs, VPC Flow Logs, and Amazon S3 Storage Lens to
understand the root cause of those high costs: These tools are instrumental in the diagnosis of
high costs. Each service offers a different lens to view and analyze usage and costs. For instance,
Cost Explorer helps determine overall cost trends, CloudWatch Logs provides operational
insights, VPC Flow Logs displays IP traffic, and Amazon S3 Storage Lens is useful for storage
analytics.
5. Use AWS Budgets to set budgets for certain amounts for services or accounts: Setting budgets
is a proactive way to manage costs. Use AWS Budgets to set custom budget thresholds and
receive alerts when costs exceed those thresholds.
6. Configure Amazon CloudWatch alarms to send billing and usage alerts: Set up monitoring
and alerts for cost and usage metrics. CloudWatch alarms can notify you when certain
thresholds are breached, which improves intervention response time.
Facilitate notable enhancement and financial savings over time through strategic review of all
workload components and irrespective of their present attributes. The effort invested in this review
process should be deliberate, with careful consideration of the potential advantages that might be
realized.
to lift and shift (also known as rehost) your databases from your on-premises environment to the
cloud as rapidly as possible and optimize later. It is worth exploring the possible savings attained
by using managed services on AWS that may remove or reduce license costs. Managed services on
AWS remove the operational and administrative burden of maintaining a service, such as patching
or upgrading the OS, and allow you to focus on innovation and business.
Since managed services operate at cloud scale, they can offer a lower cost per transaction or
service. You can make potential optimizations in order to achieve some tangible benefit, without
changing the core architecture of the application. For example, you may be looking to reduce the
amount of time you spend managing database instances by migrating to a database-as-a-service
platform like Amazon Relational Database Service (Amazon RDS) or migrating your application to a
fully managed platform like AWS Elastic Beanstalk.
Usually, managed services have attributes that you can set to ensure sufficient capacity. You
must set and monitor these attributes so that your excess capacity is kept to a minimum and
performance is maximized. You can modify the attributes of AWS Managed Services using the
AWS Management Console or AWS APIs and SDKs to align resource needs with changing demand.
For example, you can increase or decrease the number of nodes on an Amazon EMR cluster (or an
Amazon Redshift cluster) to scale out or in.
You can also pack multiple instances on an AWS resource to activate higher density usage. For
example, you can provision multiple small databases on a single Amazon Relational Database
Service (Amazon RDS) database instance. As usage grows, you can migrate one of the databases to
a dedicated Amazon RDS database instance using a snapshot and restore process.
When provisioning workloads on managed services, you must understand the requirements of
adjusting the service capacity. These requirements are typically time, effort, and any impact to
normal workload operation. The provisioned resource must allow time for any changes to occur,
provision the required overhead to allow this. The ongoing effort required to modify services
can be reduced to virtually zero by using APIs and SDKs that are integrated with system and
monitoring tools, such as Amazon CloudWatch.
Amazon RDS, Amazon Redshift, and Amazon ElastiCache provide a managed database service.
Amazon Athena, Amazon EMR, and Amazon OpenSearch Service provide a managed analytics
service.
AMS is a service that operates AWS infrastructure on behalf of enterprise customers and partners.
It provides a secure and compliant environment that you can deploy your workloads onto. AMS
• Consolidate data from identical SQL Server databases into a single Amazon RDS for SQL Server
database using AWS DMS
• Deliver data at scale to Amazon Managed Streaming for Apache Kafka (Amazon MSK)
• Migrate an ASP.NET web application to AWS Elastic Beanstalk
Open-source software eliminates software licensing costs, which can contribute significant costs to
workloads. Where licensed software is required, avoid licenses bound to arbitrary attributes such
as CPUs, look for licenses that are bound to output or outcomes. The cost of these licenses scales
more closely to the benefit they provide.
Implementation guidance
Open source originated in the context of software development to indicate that the software
complies with certain free distribution criteria. Open source software is composed of source code
that anyone can inspect, modify, and enhance. Based on business requirements, skill of engineers,
forecasted usage, or other technology dependencies, organizations can consider using open source
software on AWS to minimize their license costs. In other words, the cost of software licenses can
be reduced through the use of open source software. This can have significant impact on workload
costs as the size of the workload scales.
Measure the benefits of licensed software against the total cost to optimize your workload. Model
any changes in licensing and how they would impact your workload costs. If a vendor changes the
cost of your database license, investigate how that impacts the overall efficiency of your workload.
Consider historical pricing announcements from your vendors for trends of licensing changes
across their products. Licensing costs may also scale independently of throughput or usage, such
as licenses that scale by hardware (CPU bound licenses). These licenses should be avoided because
costs can rapidly increase without corresponding outcomes.
For instance, operating an Amazon EC2 instance in us-east-1 with a Linux operating system allows
you to cut costs by approximately 45%, compared to running another Amazon EC2 instance that
runs on Windows.
The AWS Pricing Calculator offers a comprehensive way to compare the costs of various resources
with different license options, such as Amazon RDS instances and different database engines.
Implementation guidance
Consider the cost of services and options when selecting all components. This includes using
application level and managed services, such as Amazon Relational Database Service (Amazon
RDS), Amazon DynamoDB, Amazon Simple Notification Service (Amazon SNS), and Amazon Simple
Email Service (Amazon SES) to reduce overall organization cost.
Use serverless and containers for compute, such as AWS Lambda and Amazon Simple Storage
Service (Amazon S3) for static websites. Containerize your application if possible and use AWS
Managed Container Services such as Amazon Elastic Container Service (Amazon ECS) or Amazon
Elastic Kubernetes Service (Amazon EKS).
Minimize license costs by using open-source software, or software that does not have license fees
(for example, Amazon Linux for compute workloads or migrate databases to Amazon Aurora).
You can use serverless or application-level services such as Lambda, Amazon Simple Queue Service
(Amazon SQS), Amazon SNS, and Amazon SES. These services remove the need for you to manage
a resource and provide the function of code execution, queuing services, and message delivery. The
other benefit is that they scale in performance and cost in line with usage, allowing efficient cost
allocation and attribution.
Using event-driven architecture is also possible with serverless services. Event-driven architectures
are push-based, so everything happens on demand as the event presents itself in the router.
This way, you’re not paying for continuous polling to check for an event. This means less
network bandwidth consumption, less CPU utilization, less idle fleet capacity, and fewer SSL/TLS
handshakes.
For more information on serverless, see Well-Architected Serverless Application lens whitepaper.
Implementation steps
• Select each service to optimize cost: Using your prioritized list and analysis, select each option
that provides the best match with your organizational priorities. Instead of increasing the
capacity to meet the demand, consider other options which may give you better performance
with lower cost. For example, if you need to review expected traffic for your databases on AWS,
consider either increasing the instance size or using Amazon ElastiCache services (Redis or
Memcached) to provide cached mechanisms for your databases.
review is change in usage patterns. Significant changes in usage can indicate that alternate services
would be more optimal.
If you need to move data into AWS Cloud, you can select any wide variety of services AWS offers
and partner tools to help you migrate your data sets, whether they are files, databases, machine
images, block volumes, or even tape backups. For example, to move a large amount of data to
and from AWS or process data at the edge, you can use one of the AWS purpose-built devices
to cost effectively move petabytes of data offline. Another example is for higher data transfer
rates, a direct connect service may be cheaper than a VPN which provides the required consistent
connectivity for your business.
Based on the cost analysis for different usage over time, review your scaling activity. Analyze
the result to see if the scaling policy can be tuned to add instances with multiple instance types
and purchase options. Review your settings to see if the minimum can be reduced to serve user
requests but with a smaller fleet size, and add more resources to meet the expected high demand.
Perform cost analysis for different usage over time by discussing with stakeholders in your
organization and use AWS Cost Explorer’s forecast feature to predict the potential impact of
service changes. Monitor usage level launches using AWS Budgets, CloudWatch billing alarms and
AWS Cost Anomaly Detection to identify and implement the most cost-effective services sooner.
Implementation steps
• Define predicted usage patterns: Working with your organization, such as marketing and
product owners, document what the expected and predicted usage patterns will be for the
workload. Discuss with business stakeholders about both historical and forecasted cost and
usage increases and make sure increases align with business requirements. Identify calendar
days, weeks, or months where you expect more users to use your AWS resources, which indicate
that you should increase the capacity of the existing resources or adopt additional services to
reduce the cost and increase performance.
• Perform cost analysis at predicted usage: Using the usage patterns defined, perform analysis
at each of these points. The analysis effort should reflect the potential outcome. For example,
if the change in usage is large, a thorough analysis should be performed to verify any costs and
changes. In other words, when cost increases, usage should increase for business as well.
Resources
Related documents:
Implementation guidance
Perform cost modelling for your workload and each of its components to understand the balance
between resources, and find the correct size for each resource in the workload, given a specific level
of performance. Understanding cost considerations can inform your organizational business case
and decision-making process when evaluating the value realization outcomes for planned workload
deployment.
Perform benchmark activities for the workload under different predicted loads and compare the
costs. The modelling effort should reflect potential benefit; for example, time spent is proportional
to component cost or predicted saving. For best practices, refer to the Review section of the
Performance Efficiency Pillar of the AWS Well-Architected Framework.
As an example, to create cost modeling for a workload consisting of compute resources, AWS
Compute Optimizer can assist with cost modelling for running workloads. It provides right-
sizing recommendations for compute resources based on historical usage. Make sure CloudWatch
Agents are deployed to the Amazon EC2 instances to collect memory metrics which help you with
more accurate recommendations within AWS Compute Optimizer. This is the ideal data source
for compute resources because it is a free service that uses machine learning to make multiple
recommendations depending on levels of risk.
There are multiple services you can use with custom logs as data sources for rightsizing operations
for other services and workload components, such as AWS Trusted Advisor, Amazon CloudWatch
and Amazon CloudWatch Logs. AWS Trusted Advisor checks resources and flags resources with low
utilization which can help you right size your resources and create cost modelling.
The following are recommendations for cost modelling data and metrics:
• The monitoring must accurately reflect the user experience. Select the correct granularity for the
time period and thoughtfully choose the maximum or 99th percentile instead of the average.
• Select the correct granularity for the time period of analysis that is required to cover any
workload cycles. For example, if a two-week analysis is performed, you might be overlooking a
monthly cycle of high utilization, which could lead to under-provisioning.
• Choose the right AWS services for your planned workload by considering your existing
commitments, selected pricing models for other workloads, and ability to innovate faster and
focus on your core business value.
Implementation steps
Implementation guidance
Amazon EC2 provides a wide selection of instance types with different levels of CPU, memory,
storage, and networking capacity to fit different use cases. These instance types feature different
blends of CPU, memory, storage, and networking capabilities, giving you versatility when selecting
the right resource combination for your projects. Every instance type comes in multiple sizes,
so that you can adjust your resources based on your workload’s demands. To determine which
instance type you need, gather details about the system requirements of the application or
software that you plan to run on your instance. These details should include the following:
• Operating system
• Number of CPU cores
• GPU cores
• Amount of system memory (RAM)
• Storage type and space
• Network bandwidth requirement
Identify the purpose of compute requirements and which instance is needed, and then explore the
various Amazon EC2 instance families. Amazon offers the following instance type families:
• General Purpose
• Compute Optimized
• Memory Optimized
• Storage Optimized
• Accelerated Computing
• HPC Optimized
For a deeper understanding of the specific purposes and use cases that a particular Amazon EC2
instance family can fulfill, see AWS Instance types.
System requirements gathering is critical for you to select the specific instance family and instance
type that best serves your needs. Instance type names are comprised of the family name and the
instance size. For example, the t2.micro instance is from the T2 family and is micro-sized.
Select resource size or type based on workload and resource characteristics (for example, compute,
memory, throughput, or write intensive). This selection is typically made using cost modelling,
networking services. This can be done with a feedback loop such as automatic scaling or by custom
code in the workload.
Implementation guidance
Create a feedback loop within the workload that uses active metrics from the running workload to
make changes to that workload. You can use a managed service, such as AWS Auto Scaling, which
you configure to perform the right sizing operations for you. AWS also provides APIs, SDKs, and
features that allow resources to be modified with minimal effort. You can program a workload to
stop-and-start an Amazon EC2 instance to allow a change of instance size or instance type. This
provides the benefits of right-sizing while removing almost all the operational cost required to
make the change.
Some AWS services have built in automatic type or size selection, such as Amazon Simple Storage
Service Intelligent-Tiering. Amazon S3 Intelligent-Tiering automatically moves your data between
two access tiers, frequent access and infrequent access, based on your usage patterns.
Implementation steps
• Increase your observability by configuring workload metrics: Capture key metrics for the
workload. These metrics provide an indication of the customer experience, such as workload
output, and align to the differences between resource types and sizes, such as CPU and memory
usage. For compute resource, analyze performance data to right size your Amazon EC2 instances.
Identify idle instances and ones that are underutilized. Key metrics to look for are CPU usage
and memory utilization (for example, 40% CPU utilization at 90% of the time as explained in
Rightsizing with AWS Compute Optimizer and Memory Utilization Enabled). Identify instances
with a maximum CPU usage and memory utilization of less than 40% over a four-week period.
These are the instances to right size to reduce costs. For storage resources such as Amazon
S3, you can use Amazon S3 Storage Lens, which allows you to see 28 metrics across various
categories at the bucket level, and 14 days of historical data in the dashboard by default. You can
filter your Amazon S3 Storage Lens dashboard by summary and cost optimization or events to
analyze specific metrics.
• View rightsizing recommendations: Use the rightsizing recommendations in AWS Compute
Optimizer and the Amazon EC2 rightsizing tool in the Cost Management console, or review
AWS Trusted Advisor right-sizing your resources to make adjustments on your workload. It is
important to use the right tools when right-sizing different resources and follow right-sizing
guidelines whether it is an Amazon EC2 instance, AWS storage classes, or Amazon RDS instance
Related videos:
Related examples:
• Attribute based Instance Type Selection for Auto Scaling for Amazon EC2 Fleet
• Optimizing Amazon Elastic Container Service for cost using scheduled scaling
• Predictive scaling with Amazon EC2 Auto Scaling
• Optimize Costs and Gain Visibility into Usage with Amazon S3 Storage Lens
• Well-Architected Labs: Rightsizing Recommendations (Level 100)
For already-deployed services at the organization level for multiple business units, consider using
shared resources to increase utilization and reduce total cost of ownership (TCO). Using shared
resources can be a cost-effective option to centralize the management and costs by using existing
solutions, sharing components, or both. Manage common functions like monitoring, backups, and
connectivity either within an account boundary or in a dedicated account. You can also reduce cost
by implementing standardization, reducing duplication, and reducing complexity.
Implementation guidance
Where multiple workloads cause the same function, use existing solutions and shared components
to improve management and optimize costs. Consider using existing resources (especially shared
ones), such as non-production database servers or directory services, to mitigate cloud costs by
following security best practices and organizational regulations. For optimal value realization and
efficiency, it is crucial to allocate costs back (using showback and chargeback) to the pertinent
areas of the business driving consumption.
should know where you have incurred costs at the resource, workload, team, or organization level,
as this knowledge enhances your understanding of the value delivered at the applicable level when
compared to the business outcomes achieved. Ultimately, organizations benefit from cost savings
as a result of sharing cloud infrastructure. Encourage cost allocation on shared cloud resources to
optimize cloud spend.
Implementation steps
• Evaluate existing resources: Review existing workloads that use similar services for your
workload. Depending on the workload’s components, consider existing platforms if business
logic or technical requirement allow.
• Use resource sharing in AWS RAM and restrict accordingly: Use AWS RAM to share resources
with other AWS accounts within your organization. When you share resources, you don’t need
to duplicate resources in multiple accounts, which minimizes the operational burden of resource
maintenance. This process also helps you securely share the resources that you have created with
roles and users in your account, as well as with other AWS accounts.
• Tag resources: Tag resources that are candidates for cost reporting and categorize them within
cost categories. Activate these cost related resource tags for cost allocation to provide visibility
of AWS resources usage. Focus on creating an appropriate level of granularity with respect to
cost and usage visibility, and influence cloud consumption behaviors through cost allocation
reporting and KPI tracking.
Resources
Related documents:
Related videos:
determine the most appropriate pricing model. Often your pricing model consists of a combination
of multiple options, as determined by your availability
On-Demand Instances allow you pay for compute or database capacity by the hour or by
the second (60 seconds minimum) depending on which instances you run, without long-term
commitments or upfront payments.
Savings Plans are a flexible pricing model that offers low prices on Amazon EC2, Lambda, and AWS
Fargate (Fargate) usage, in exchange for a commitment to a consistent amount of usage (measured
in dollars per hour) over one year or three years terms.
Spot Instances are an Amazon EC2 pricing mechanism that allows you request spare compute
capacity at discounted hourly rate (up to 90% off the on-demand price) without upfront
commitment.
Reserved Instances allow you up to 75 percent discount by prepaying for capacity. For more
details, see Optimizing costs with reservations.
You might choose to include a Savings Plan for the resources associated with the production,
quality, and development environments. Alternatively, because sandbox resources are only
powered on when needed, you might choose an on-demand model for the resources in that
environment. Use Amazon Spot Instances to reduce Amazon EC2 costs or use Compute Savings
Plans to reduce Amazon EC2, Fargate, and Lambda cost. The AWS Cost Explorer recommendations
tool provides opportunities for commitment discounts with Saving plans.
If you have been purchasing Reserved Instances for Amazon EC2 in the past or have established
cost allocation practices inside your organization, you can continue using Amazon EC2 Reserved
Instances for the time being. However, we recommend working on a strategy to use Savings
Plans in the future as a more flexible cost savings mechanism. You can refresh Savings Plans (SP)
Recommendations in AWS Cost Management to generate new Savings Plans Recommendations
at any time. Use Reserved Instances (RI) to reduce Amazon RDS, Amazon Redshift, Amazon
ElastiCache, and Amazon OpenSearch Service costs. Saving Plans and Reserved Instances
are available in three options: all upfront, partial upfront and no upfront payments. Use the
recommendations provided in AWS Cost Explorer RI and SP purchase recommendations.
To find opportunities for Spot workloads, use an hourly view of your overall usage, and look
for regular periods of changing usage or elasticity. You can use Spot Instances for various fault-
tolerant and flexible applications. Examples include stateless web servers, API endpoints, big data
and analytics applications, containerized workloads, CI/CD, and other flexible workloads.
Related videos:
Related examples:
Resource pricing may be different in each Region. Identify Regional cost differences and only
deploy in Regions with higher costs to meet latency, data residency and data sovereignty
requirements. Factoring in Region cost helps you pay the lowest overall price for this workload.
Implementation guidance
The AWS Cloud Infrastructure is global, hosted in multiple locations world-wide, and built around
AWS Regions, Availability Zones, Local Zones, AWS Outposts, and Wavelength Zones. A Region
is a physical location in the world and each Region is a separate geographic area where AWS has
multiple Availability Zones. Availability Zones which are multiple isolated locations within each
Region consist of one or more discrete data centers, each with redundant power, networking, and
connectivity.
Each AWS Region operates within local market conditions, and resource pricing is different in each
Region due to differences in the cost of land, fiber, electricity, and taxes, for example. Choose
a specific Region to operate a component of or your entire solution so that you can run at the
lowest possible price globally. Use AWS Calculator to estimate the costs of your workload in various
Regions by searching services by location type (Region, wave length zone and local zone) and
Region.
to not to use multiple Regions. If there are no obligations to restrict you to use single Region,
then use multiple Regions.
• Analyze required data transfer: Consider data transfer costs when selecting Regions. Keep your
data close to your customer and close to the resources. Select less costly AWS Regions where
data flows and where there is minimal data transfer. Depending on your business requirements
for data transfer, you can use Amazon CloudFront, AWS PrivateLink, AWS Direct Connect, and
AWS Virtual Private Network to reduce your networking costs, improve performance, and
enhance security.
Resources
Related documents:
Related videos:
Related examples:
Cost efficient agreements and terms ensure the cost of these services scales with the benefits they
provide. Select agreements and pricing that scale when they provide additional benefits to your
organization.
Related videos:
Permanently running resources should utilize reserved capacity such as Savings Plans or Reserved
Instances. Short-term capacity is configured to use Spot Instances, or Spot Fleet. On-Demand
Instances are only used for short-term workloads that cannot be interrupted and do not run long
enough for reserved capacity, between 25% to 75% of the period, depending on the resource type.
Implementation guidance
To improve cost efficiency, AWS provides multiple commitment recommendations based on your
past usage. You can use these recommendations to understand what you can save, and how the
commitment will be used. You can use these services as On-Demand, Spot, or make a commitment
for a certain period of time and reduce your on-demand costs with Reserved Instances (RIs) and
Savings Plans (SPs). You need to understand not only each workload components and multiple
AWS services, but also commitment discounts, purchase options, and Spot Instances for these
services to optimize your workload.
Consider the requirements of your workload’s components, and understand the different pricing
models for these services. Define the availability requirement of these components. Determine
if there are multiple independent resources that perform the function in the workload, and what
the workload requirements are over time. Compare the cost of the resources using the default On-
Demand pricing model and other applicable models. Factor in any potential changes in resources or
workload components.
For example, let’s look at this Web Application Architecture on AWS. This sample workload consists
of multiple AWS services, such as Amazon Route 53, AWS WAF, Amazon CloudFront, Amazon EC2
instances, Amazon RDS instances, Load Balancers, Amazon S3 storage, and Amazon Elastic File
System (Amazon EFS). You need to review each of these services, and identify potential cost saving
opportunities with different pricing models. Some of them may be eligible for RIs or SPs, while
some of them may be only available by on-demand. As the following image shows, some of the
AWS services can be committed using RIs or SPs.
Related videos:
Related examples:
Check billing and cost management tools and see recommended discounts with commitments and
reservations to perform regular analysis at the management account level.
Implementation guidance
Performing regular cost modeling helps you implement opportunities to optimize across multiple
workloads. For example, if multiple workloads use On-Demand Instances at an aggregate level,
the risk of change is lower, and implementing a commitment-based discount can achieve a lower
overall cost. It is recommended to perform analysis in regular cycles of two weeks to one month.
This allows you to make small adjustment purchases, so the coverage of your pricing models
continues to evolve with your changing workloads and their components.
Use the AWS Cost Explorer recommendations tool to find opportunities for commitment discounts
in your management account. Recommendations at the management account level are calculated
considering usage across all of the accounts in your AWS organization that have Reserve Instances
(RI) or Savings Plans (SP). They're also calculated when discount sharing is activated to recommend
a commitment that maximizes savings across accounts.
While purchasing at the management account level optimizes for max savings in many cases, there
may be situations where you might consider purchasing SPs at the linked account level, like when
you want the discounts to apply first to usage in that particular linked account. Member account
recommendations are calculated at the individual account level, to maximize savings for each
isolated account. If your account owns both RI and SP commitments, they will be applied in this
order:
correct recommendations with the required discounts and risk by following the Well-Architected
labs.
Resources
Related documents:
Related videos:
Related examples:
Verify that you plan and monitor data transfer charges so that you can make architectural decisions
to minimize costs. A small yet effective architectural change can drastically reduce your operational
costs over time.
Best practices
Implementation steps
• Identify requirements: What is the primary goal and business requirements for the planned data
transfer between source and destination? What is the expected business outcome at the end?
Gather business requirements and define expected outcome.
• Identify source and destination: What is the data source and destination for the data transfer,
such as within AWS Regions, to AWS services, or out to the internet?
• Identify data classifications: What is the data classification for this data transfer? What kind of
data is it? How big is the data? How frequently must data be transferred? Is data sensitive?
• Identify AWS services or tools to use: Which AWS services are used for this data transfer? Is it
possible to use an already-provisioned service for another workload?
• Calculate data transfer costs: Use AWS Pricing the data transfer modeling you created
previously to calculate the data transfer costs for the workload. Calculate the data transfer costs
at different usage levels, for both increases and reductions in workload usage. Where there are
multiple options for the workload architecture, calculate the cost for each option for comparison.
• Link costs to outcomes: For each data transfer cost incurred, specify the outcome that it
achieves for the workload. If it is transfer between components, it may be for decoupling, if it is
between Availability Zones it may be for redundancy.
• Create data transfer modeling: After gathering all information, create a conceptual base data
transfer modeling for multiple use cases and different workloads.
Resources
Related documents:
• AWS Pricing
are not, create new NAT gateways in the same Availability Zone as the resource to reduce cross-
AZ data transfer charges.
• Use AWS Direct Connect AWS Direct Connect bypasses the public internet and establishes a
direct, private connection between your on-premises network and AWS. This can be more cost-
effective and consistent than transferring large volumes of data over the internet.
• Avoid transferring data across Regional boundaries: Data transfers between AWS Regions
(from one Region to another) typically incur charges. It should be a very thoughtful decision to
pursue a multi-Region path. For more detail, see Multi-Region scenarios.
• Monitor data transfer: Use Amazon CloudWatch and VPC flow logs to capture details about your
data transfer and network usage. Analyze captured network traffic information in your VPCs,
such as IP address or range going to and from network interfaces.
• Analyze your network usage: Use metering and reporting tools such as AWS Cost Explorer,
CUDOS Dashboards, or CloudWatch to understand data transfer cost of your workload.
Implementation steps
• Select components for data transfer: Using the data transfer modeling explained in COST08-
BP01 Perform data transfer modeling, focus on where the largest data transfer costs are or
where they would be if the workload usage changes. Look for alternative architectures or
additional components that remove or reduce the need for data transfer (or lower its cost).
Resources
Related documents:
Related examples:
• NAT gateways provide built-in scaling and management for reducing costs as opposed to a
standalone NAT instance. Place NAT gateways in the same Availability Zones as high traffic
instances and consider using VPC endpoints for the instances that need to access Amazon
DynamoDB or Amazon S3 to reduce the data transfer and processing costs.
• Use AWS Snow Family devices which have computing resources to collect and process data at
the edge. AWS Snow Family devices (Snowcone, Snowball and Snowmobile) allow you to move
petabytes of data to the AWS Cloud cost effectively and offline.
Implementation steps
• Implement services: Select applicable AWS network services based on your service workload
type using the data transfer modeling and reviewing VPC Flow Logs. Look at where the largest
costs and highest volume flows are. Review the AWS services and assess whether there is a
service that reduces or removes the transfer, specifically networking and content delivery. Also
look for caching services where there is repeated access to data or large amounts of data.
Resources
Related documents:
• Amazon CloudFront
• AWS Snow Family
Related videos:
Related examples:
Implementation guidance
Analyzing workload demand for cloud computing involves understanding the patterns and
characteristics of computing tasks that are initiated in the cloud environment. This analysis helps
users optimize resource allocation, manage costs, and verify that performance meets required
levels.
Know the requirements of the workload. Your organization's requirements should indicate the
workload response times for requests. The response time can be used to determine if the demand
is managed, or if the supply of resources should change to meet the demand.
The analysis should include the predictability and repeatability of the demand, the rate of change
in demand, and the amount of change in demand. Perform the analysis over a long enough period
to incorporate any seasonal variance, such as end-of-month processing or holiday peaks.
Analysis effort should reflect the potential benefits of implementing scaling. Look at the expected
total cost of the component and any increases or decreases in usage and cost over the workload's
lifetime.
The following are some key aspects to consider when performing workload demand analysis for
cloud computing:
1. Resource utilization and performance metrics: Analyze how AWS resources are being used over
time. Determine peak and off-peak usage patterns to optimize resource allocation and scaling
strategies. Monitor performance metrics such as response times, latency, throughput, and error
rates. These metrics help assess the overall health and efficiency of the cloud infrastructure.
2. User and application scaling behaviour: Understand user behavior and how it affects workload
demand. Examining the patterns of user traffic assists in enhancing the delivery of content
and the responsiveness of applications. Analyze how workloads scale with increasing demand.
Determine whether auto-scaling parameters are configured correctly and effectively for
handling load fluctuations.
3. Workload types: Identify the different types of workloads running in the cloud, such as batch
processing, real-time data processing, web applications, databases, or machine learning. Each
type of workload may have different resource requirements and performance profiles.
4. Service-level agreements (SLAs): Compare actual performance with SLAs to ensure compliance
and identify areas that need improvement.
• AWS X-Ray
• AWS Auto Scaling
• Amazon QuickSight
Related videos:
Related examples:
Buffering and throttling modify the demand on your workload, smoothing out any peaks.
Implement throttling when your clients perform retries. Implement buffering to store the request
and defer processing until a later time. Verify that your throttles and buffers are designed so clients
receive a response in the required time.
Implementation guidance
Implementing a buffer or throttle is crucial in cloud computing in order to manage demand and
reduce the provisioned capacity required for your workload. For optimal performance, it's essential
to gauge the total demand, including peaks, the pace of change in requests, and the necessary
response time. When clients have the ability to resend their requests, it becomes practical to apply
throttling. Conversely, for clients lacking retry functionalities, the ideal approach is implementing
a buffer solution. Such buffers streamline the influx of requests and optimize the interaction of
applications with varied operational speeds.
Buffering and throttling can smooth out any peaks by modifying the demand on your workload.
Use throttling when clients retry actions and use buffering to hold the request and process it later.
When working with a buffer-based approach, architect your workload to service the request in the
required time, verify that you are able to handle duplicate requests for work. Analyze the overall
demand, rate of change, and required response time to right size the throttle or buffer required.
Implementation steps
• Analyze the client requirements: Analyze the client requests to determine if they are capable
of performing retries. For clients that cannot perform retries, buffers need to be implemented.
Analyze the overall demand, rate of change, and required response time to determine the size of
throttle or buffer required.
Resources
Related documents:
• Amazon Kinesis
Related videos:
You can also easily configure schedules for your Amazon EC2 instances across your accounts
and Regions with a simple user interface (UI) using AWS Systems Manager Quick Setup. You can
schedule Amazon EC2 or Amazon RDS instances with AWS Instance Scheduler and you can stop
and start existing instances. However, you cannot stop and start instances which are part of your
Auto Scaling group (ASG) or that manage services such as Amazon Redshift or Amazon OpenSearch
Service. Auto Scaling groups have their own scheduling for the instances in the group and these
instances are created.
AWS Auto Scaling helps you adjust your capacity to maintain steady, predictable performance
at the lowest possible cost to meet changing demand. It is a fully managed and free service to
scale the capacity of your application that integrates with Amazon EC2 instances and Spot Fleets,
Amazon ECS, Amazon DynamoDB, and Amazon Aurora. Auto Scaling provides automatic resource
discovery to help find resources in your workload that can be configured, it has built-in scaling
strategies to optimize performance, costs, or a balance between the two, and provides predictive
scaling to assist with regularly occurring spikes.
There are multiple scaling options available to scale your Auto Scaling group:
during demand spikes to maintain performance and decrease capacity when demand subsides to
reduce costs.
• Simple/Step Scaling: Monitors metrics and adds/removes instances as per steps defined by the
customers manually.
When architecting with a demand-based approach keep in mind two key considerations. First,
understand how quickly you must provision new resources. Second, understand that the size of
margin between supply and demand will shift. You must be ready to cope with the rate of change
in demand and also be ready for resource failures.
Time-based supply: A time-based approach aligns resource capacity to demand that is predictable
or well-defined by time. This approach is typically not dependent upon utilization levels of the
resources. A time-based approach ensures that resources are available at the specific time they
are required and can be provided without any delays due to start-up procedures and system or
consistency checks. Using a time-based approach, you can provide additional resources or increase
capacity during busy periods.
When architecting with a time-based approach keep in mind two key considerations. First,
how consistent is the usage pattern? Second, what is the impact if the pattern changes? You
can increase the accuracy of predictions by monitoring your workloads and by using business
intelligence. If you see significant changes in the usage pattern, you can adjust the times to ensure
that coverage is provided.
Implementation steps
• Configure scheduled scaling: For predictable changes in demand, time-based scaling can
provide the correct number of resources in a timely manner. It is also useful if resource
creation and configuration is not fast enough to respond to changes on demand. Using the
workload analysis configure scheduled scaling using AWS Auto Scaling. To configure time-
based scheduling, you can use predictive scaling of scheduled scaling to increase the number
of Amazon EC2 instances in your Auto Scaling groups in advance according to expected or
predictable load changes.
• Configure predictive scaling: Predictive scaling allows you to increase the number of Amazon
EC2 instances in your Auto Scaling group in advance of daily and weekly patterns in traffic flows.
If you have regular traffic spikes and applications that take a long time to start, you should
consider using predictive scaling. Predictive scaling can help you scale faster by initializing
capacity before projected load compared to dynamic scaling alone, which is reactive in nature.
For example, if users start using your workload with the start of the business hours and don’t use
after hours, then predictive scaling can add capacity before the business hours which eliminates
delay of dynamic scaling to react to changing traffic.
• Configure dynamic automatic scaling: To configure scaling based on active workload metrics,
use Auto Scaling. Use the analysis and configure Auto Scaling to launch on the correct resource
levels, and verify that the workload scales in the required time. You can launch and automatically
scale a fleet of On-Demand Instances and Spot Instances within a single Auto Scaling group.
In addition to receiving discounts for using Spot Instances, you can use Reserved Instances or a
Savings Plan to receive discounted rates of the regular On-Demand Instance pricing. All of these
factors combined help you to optimize your cost savings for Amazon EC2 instances and help you
get the desired scale and performance for your application.
Resources
Related documents:
Develop a process that defines the criteria and process for workload review. The review effort
should reflect potential benefit. For example, core workloads or workloads with a value of over ten
percent of the bill are reviewed quarterly or every six months, while workloads below ten percent
are reviewed annually.
Implementation guidance
To have the most cost-efficient workload, you must regularly review the workload to know if there
are opportunities to implement new services, features, and components. To achieve overall lower
costs the process must be proportional to the potential amount of savings. For example, workloads
that are 50% of your overall spend should be reviewed more regularly, and more thoroughly, than
workloads that are five percent of your overall spend. Factor in any external factors or volatility.
If the workload services a specific geography or market segment, and change in that area is
predicted, more frequent reviews could lead to cost savings. Another factor in review is the effort
to implement changes. If there are significant costs in testing and validating changes, reviews
should be less frequent.
Factor in the long-term cost of maintaining outdated and legacy, components and resources and
the inability to implement new features into them. The current cost of testing and validation may
exceed the proposed benefit. However, over time, the cost of making the change may significantly
increase as the gap between the workload and the current technologies increases, resulting in even
larger costs. For example, the cost of moving to a new programming language may not currently
be cost effective. However, in five years time, the cost of people skilled in that language may
increase, and due to workload growth, you would be moving an even larger system to the new
language, requiring even more effort than previously.
Break down your workload into components, assign the cost of the component (an estimate
is sufficient), and then list the factors (for example, effort and external markets) next to each
component. Use these indicators to determine a review frequency for each workload. For example,
you may have webservers as a high cost, low change effort, and high external factors, resulting
in high frequency of review. A central database may be medium cost, high change effort, and low
external factors, resulting in a medium frequency of review.
Define a process to evaluate new services, design patterns, resource types, and configurations to
optimize your workload cost as they become available. Similar to performance pillar review and
Existing workloads are regularly reviewed based on each defined process to find out if new services
can be adopted, existing services can be replaced, or workloads can be re-architected.
Implementation guidance
AWS is constantly adding new features so you can experiment and innovate faster with the latest
technology. AWS What's New details how AWS is doing this and provides a quick overview of AWS
services, features, and Regional expansion announcements as they are released. You can dive
deeper into the launches that have been announced and use them for your review and analyze
of your existing workloads. To realize the benefits of new AWS services and features, you review
on your workloads and implement new services and features as required. This means you may
need to replace existing services you use for your workload, or modernize your workload to adopt
these new AWS services. For example, you might review your workloads and replace the messaging
component with Amazon Simple Email Service. This removes the cost of operating and maintaining
a fleet of instances, while providing all the functionality at a reduced cost.
To analyze your workload and highlight potential opportunities, you should consider not only
new services but also new ways of building solutions. Review the This is My Architecture videos
on AWS to learn about other customers’ architecture designs, their challenges and their solutions.
Check the All-In series to find out real world applications of AWS services and customer stories.
You can also watch the Back to Basics video series that explains, examines, and breaks down basic
cloud architecture pattern best practices. Another source is How to Build This videos, which are
designed to assist people with big ideas on how to bring their minimum viable product (MVP) to
life using AWS services. It is a way for builders from all over the world who have a strong idea to
gain architectural guidance from experienced AWS Solutions Architects. Finally, you can review the
Getting Started resource materials, which has step by step tutorials.
Before starting your review process, follow your business’ requirements for the workload,
security and data privacy requirements in order to use specific service or Region and performance
requirements while following your agreed review process.
Implementation steps
operations through automation. Assess the time and associated costs required for operational
efforts and implement automation for administrative tasks to minimize manual effort wherever
feasible.
Implementation guidance
Automating operations reduces the frequency of manual tasks, improves efficiency, and benefits
customers by delivering a consistent and reliable experience when deploying, administering, or
operating workloads. You can free up infrastructure resources from manual operational tasks
and use them for higher value tasks and innovations, which improves business value. Enterprises
require a proven, tested way to manage their workloads in the cloud. That solution must be secure,
fast, and cost effective, with minimum risk and maximum reliability.
Start by prioritizing your operational activities based on required effort by looking at overall
operations cost. For example, how long does it take to deploy new resources in the cloud, make
optimization changes to existing ones, or implement necessary configurations? Look at the total
cost of human actions by factoring in cost of operations and management. Prioritize automations
for admin tasks to reduce the human effort.
Review effort should reflect the potential benefit. For example, examine time spent performing
tasks manually as opposed to automatically. Prioritize automating repetitive, high value, time
consuming and complex activities. Activities that pose a high value or high risk of human error
are typically the better place to start automating as the risk often poses an unwanted additional
operational cost (like operations team working extra hours).
Use automation tools like AWS Systems Manager or AWS Config to streamline operations,
compliance, monitoring, lifecycle, and termination processes. With AWS services, tools, and
third-party products, you can customize the automations you implement to meet your specific
requirement. Following table shows some of the core operation functions and capabilities you can
achieve with AWS services to automate administration and operation:
• AWS Audit Manager: Continually audit your AWS usage to simplify risk and compliance
assessment
• AWS Backup: Centrally manage and automate data protection.
• AWS Config: Configure compute resources, asses, audit, evaluate configurations and resource
inventory.
• AWS CloudFormation: Launch highly available resources with Infrastructure as Code.
with the capabilities of AWS Config and AWS CloudFormation, you can efficiently manage and
automate configuration compliance at scale for hundreds of member accounts. You can review
changes in configurations and relationships between AWS resources and dive into the history of
a resource configuration.
• Automate monitoring tasks AWS provides various tools that you can use to monitor services.
You can configure these tools to automate monitoring tasks. Create and implement a monitoring
plan that collects monitoring data from all the parts in your workload so that you can more
easily debug a multi-point failure if one occurs. For example, you can use the automated
monitoring tools to observe Amazon EC2 and report back to you when something is wrong for
system status checks, instance status checks, and Amazon CloudWatch alarms.
• Create a continual lifecycle with automations: It is important that you establish and preserve
mature lifecycle policies not only for regulations or redundancy but also for cost optimization.
You can use AWS Backup to centrally manage and automate data protection of data stores, such
as your buckets, volumes, databases, and file systems. You can also use Amazon Data Lifecycle
Manager to automate the creation, retention, and deletion of EBS snapshots and EBS-backed
AMIs.
• Delete unnecessary resources: It's quite common to accumulate unused resources in sandbox
or development AWS accounts. Developers create and experiment with various services and
resources as part of the normal development cycle, and then they don't delete those resources
when they're no longer needed. Unused resources can incur unnecessary and sometimes high
costs for the organization. Deleting these resources can reduce the costs of operating these
environments. Make sure your data is not needed or backed up if you are not sure. You can use
AWS CloudFormation to clean up deployed stacks, which automatically deletes most resources
defined in the template. Alternatively, you can create an automation for the deletion of AWS
resources using tools like aws-nuke.
Region selection
Question
• SUS 1 How do you select Regions for your workload?
The choice of Region for your workload significantly affects its KPIs, including performance, cost,
and carbon footprint. To effectively improve these KPIs, you should choose Regions for your
workloads based on both business requirements and sustainability goals.
Best practices
• SUS01-BP01 Choose Region based on both business requirements and sustainability goals
SUS01-BP01 Choose Region based on both business requirements and sustainability goals
Choose a Region for your workload based on both your business requirements and sustainability
goals to optimize its KPIs, including performance, cost, and carbon footprint.
Common anti-patterns:
Benefits of establishing this best practice: Placing a workload close to Amazon renewable energy
projects or Regions with low published carbon intensity can help to lower the carbon footprint of a
cloud workload.
Related videos:
Alignment to demand
Question
• SUS 2 How do you align cloud resources to your demand?
The way users and applications consume your workloads and other resources can help you identify
improvements to meet sustainability goals. Scale infrastructure to continually match demand and
verify that you use only the minimum resources required to support your users. Align service levels
to customer needs. Position resources to limit the network required for users and applications to
consume them. Remove unused assets. Provide your team members with devices that support their
needs and minimize their sustainability impact.
Best practices
• SUS02-BP01 Scale workload infrastructure dynamically
• SUS02-BP02 Align SLAs with sustainability goals
• SUS02-BP03 Stop the creation and maintenance of unused assets
• SUS02-BP04 Optimize geographic placement of workloads based on their networking
requirements
• SUS02-BP05 Optimize team member resources for activities performed
• SUS02-BP06 Implement buffering or throttling to flatten the demand curve
Use elasticity of the cloud and scale your infrastructure dynamically to match supply of cloud
resources to demand and avoid overprovisioned capacity in your workload.
Implementation steps
• Elasticity matches the supply of resources you have against the demand for those resources.
Instances, containers, and functions provide mechanisms for elasticity, either in combination
with automatic scaling or as a feature of the service. AWS provides a range of auto scaling
mechanisms to ensure that workloads can scale down quickly and easily during periods of low
user load. Here are some examples of auto scaling mechanisms:
Amazon EC2 Auto Scaling Use to verify you have the correct number of
Amazon EC2 instances available to handle
the user load for your application.
• Scaling is often discussed related to compute services like Amazon EC2 instances or AWS Lambda
functions. Consider the configuration of non-compute services like Amazon DynamoDB read and
write capacity units or Amazon Kinesis Data Streams shards to match the demand.
• Verify that the metrics for scaling up or down are validated against the type of workload being
deployed. If you are deploying a video transcoding application, 100% CPU utilization is expected
and should not be your primary metric. You can use a customized metric (such as memory
utilization) for your scaling policy if required. To choose the right metrics, consider the following
guidance for Amazon EC2:
• The metric should be a valid utilization metric and describe how busy an instance is.
• The metric value must increase or decrease proportionally to the number of instances in the
Auto Scaling group.
• Use dynamic scaling instead of manual scaling for your Auto Scaling group. We also recommend
that you use target tracking scaling policies in your dynamic scaling.
• Verify that workload deployments can handle both scale-out and scale-in events. Create test
scenarios for scale-in events to verify that the workload behaves as expected and does not affect
Review and optimize workload service-level agreements (SLA) based on your sustainability goals
to minimize the resources required to support your workload while continuing to meet business
needs.
Common anti-patterns:
Benefits of establishing this best practice: Aligning SLAs with sustainability goals leads to optimal
resource usage while meeting business needs.
Implementation guidance
SLAs define the level of service expected from a cloud workload, such as response time, availability,
and data retention. They influence the architecture, resource usage, and environmental impact of
a cloud workload. At a regular cadence, review SLAs and make trade-offs that significantly reduce
resource usage in exchange for acceptable decreases in service levels.
Implementation steps
Common anti-patterns:
• You do not analyze your application for assets that are redundant or no longer required.
Benefits of establishing this best practice: Removing unused assets frees resources and improves
the overall efficiency of the workload.
Implementation guidance
Unused assets consume cloud resources like storage space and compute power. By identifying
and eliminating these assets, you can free up these resources, resulting in a more efficient cloud
architecture. Perform regular analysis on application assets such as pre-compiled reports, datasets,
static images, and asset access patterns to identify redundancy, underutilization, and potential
decommission targets. Remove those redundant assets to reduce the resource waste in your
workload.
Implementation steps
• Conduct an inventory: Conduct a comprehensive inventory to identify all assets within your
workload.
• Analyze usage: Use continuous monitoring to identify static assets that are no longer required.
• Remove unused assets: Develop a plan to remove assets that are no longer required.
• Before removing any asset, evaluate the impact of removing it on the architecture.
• Update your applications to no longer produce and store assets that are not required.
• Communicate with third parties: Instruct third parties to stop producing and storing assets
managed on your behalf that are no longer required. Ask to consolidate redundant assets.
• Use lifecycle policies: Use lifecycle policies to automatically delete unused assets.
• You can use Amazon S3 Lifecycle to manage your objects throughout their lifecycle.
• You can use Amazon Data Lifecycle Manager to automate the creation, retention, and deletion
of Amazon EBS snapshots and Amazon EBS-backed AMIs.
• Review and optimize: Regularly review your workload to identify and remove any unused assets.
Alignment to demand 907
AWS Well-Architected Framework Framework
Analyze the network access patterns in your workload to identify how to use these cloud location
options and reduce the distance network traffic must travel.
Implementation steps
• Analyze network access patterns in your workload to identify how users use your application.
• Use monitoring tools, such as Amazon CloudWatch and AWS CloudTrail, to gather data on
network activities.
• Select the Regions for your workload deployment based on the following key elements:
• Where your data is located: For data-heavy applications (such as big data and machine
learning), application code should run as close to the data as possible.
• Where your users are located: For user-facing applications, choose a Region (or Regions) close
to your workload’s users.
• Other constraints: Consider constraints such as cost and compliance as explained in What to
Consider when Selecting a Region for your Workloads.
• Use local caching or AWS Caching Solutions for frequently used assets to improve performance,
reduce data movement, and lower environmental impact.
• Use services that can help you run code closer to users of your workload:
• AWS re:Invent 2023 - A migration strategy for edge and on-premises workloads
• AWS re:Invent 2021 - AWS Outposts: Bringing the AWS experience on premises
• AWS re:Invent 2020 - AWS Wavelength: Run apps with ultra-low latency at 5G edge
• AWS re:Invent 2022 - AWS Local Zones: Building applications for a distributed edge
• AWS re:Invent 2022 - Improve performance and availability with AWS Global Accelerator
• AWS re:Invent 2022 - Build your global wide area network using AWS
Related examples:
Common anti-patterns:
• You ignore the impact of devices used by your team members on the overall efficiency of your
cloud application.
Benefits of establishing this best practice: Optimizing team member resources improves the
overall efficiency of cloud-enabled applications.
Related videos:
Buffering and throttling flatten the demand curve and reduce the provisioned capacity required for
your workload.
Common anti-patterns:
Benefits of establishing this best practice: Flattening the demand curve reduce the required
provisioned capacity for the workload. Reducing the provisioned capacity means less energy
consumption and less environmental impact.
Implementation guidance
Flattening the workload demand curve can help you to reduce the provisioned capacity for a
workload and reduce its environmental impact. Assume a workload with the demand curve shown
in below figure. This workload has two peaks, and to handle those peaks, the resource capacity
as shown by orange line is provisioned. The resources and energy used for this workload is not
indicated by the area under the demand curve, but the area under the provisioned capacity line, as
provisioned capacity is needed to handle those two peaks.
Demand curve with two distinct peaks that require high provisioned capacity.
Resources
Related documents:
Related videos:
Implement patterns for performing load smoothing and maintaining consistent high utilization
of deployed resources to minimize the resources consumed. Components might become idle from
lack of use because of changes in user behavior over time. Revise patterns and architecture to
consolidate under-utilized components to increase overall utilization. Retire components that are
no longer required. Understand the performance of your workload components, and optimize the
components that consume the most resources. Be aware of the devices that your customers use to
access your services, and implement patterns to minimize the need for device upgrades.
Best practices
• SUS03-BP01 Optimize software and architecture for asynchronous and scheduled jobs
Implementation steps
• Analyze the demand for your workload to determine how to respond to those.
• For requests or jobs that don’t require synchronous responses, use queue-driven architectures
and auto scaling workers to maximize utilization. Here are some examples of when you might
consider queue-driven architecture:
AWS Batch job queues AWS Batch jobs are submitted to a job queue
where they reside until they can be scheduled
to run in a compute environment.
Amazon Simple Queue Service and Amazon Pairing Amazon SQS and Spot Instances to
EC2 Spot Instances build fault tolerant and efficient architecture.
• For requests or jobs that can be processed anytime, use scheduling mechanisms to process jobs
in batch for more efficiency. Here are some examples of scheduling mechanisms on AWS:
Amazon Elastic Container Service (Amazon Amazon ECS supports creating scheduled
ECS) scheduled tasks tasks. Scheduled tasks use Amazon EventBrid
ge rules to run tasks either on a schedule or
in a response to an EventBridge event.
• If you use polling and webhooks mechanisms in your architecture, replace those with events. Use
event-driven architectures to build highly efficient workloads.
Remove components that are unused and no longer required, and refactor components with little
utilization to minimize waste in your workload.
Common anti-patterns:
• You do not regularly check the utilization level of individual components of your workload.
• You do not check and analyze recommendations from AWS rightsizing tools such as AWS
Compute Optimizer.
Benefits of establishing this best practice: Removing unused components minimizes waste and
improves the overall efficiency of your cloud workload.
Implementation guidance
Review your workload to identify idle or unused components. This is an iterative improvement
process which can be initiated by changes in demand or the release of a new cloud service. For
example, a significant drop in AWS Lambda function run time can be an indicator of a need to
lower the memory size. Also, as AWS releases new services and features, the optimal services and
architecture for your workload may change.
Continually monitor workload activity and look for opportunities to improve the utilization level
of individual components. By removing idle components and performing rightsizing activities, you
meet your business requirements with the fewest cloud resources.
Implementation steps
• Have an inventory of your AWS resources. In AWS, you can turn on AWS Resource Explorer to
explore and organize your AWS resources. For more details, see AWS re:Invent 2022 - How to
manage resources and applications at scale on AWS.
• Monitor and capture the utilization metrics for critical components of your workload (like CPU
utilization, memory utilization, or network throughput in Amazon CloudWatch metrics).
• Identify unused or under-utilized components in your architecture.
• For stable workloads, check AWS rightsizing tools such as AWS Compute Optimizer at regular
intervals to identify idle, unused, or underutilized components.
Benefits of establishing this best practice: Using efficient code minimizes resource usage and
improves performance.
Implementation guidance
It is crucial to examine every functional area, including the code for a cloud architected application,
to optimize its resource usage and performance. Continually monitor your workload’s performance
in build environments and production and identify opportunities to improve code snippets that
have particularly high resource usage. Adopt a regular review process to identify bugs or anti-
patterns within your code that use resources inefficiently. Leverage simple and efficient algorithms
that produce the same results for your use case.
Implementation steps
• Use efficient programming language: Use an efficient operating system and programming
language for the workload. For details on energy efficient programming languages (including
Rust), see Sustainability with Rust.
• Use an AI coding companion: Consider using an AI coding companion such as Amazon
CodeWhisperer to efficiently write code.
• Automate code reviews: While developing your workloads, adopt an automated code review
process to improve quality and identify bugs and anti-patterns.
• Automate code reviews with Amazon CodeGuru Reviewer
• Detecting concurrency bugs with Amazon CodeGuru
• Raising code quality for Python applications using Amazon CodeGuru
• Use a code profiler: Use a code profiler to identify the areas of code that use the most time or
resources as targets for optimization.
• Reducing your organization's carbon footprint with Amazon CodeGuru Profiler
• Understanding memory usage in your Java application with Amazon CodeGuru Profiler
• Improving customer experience and reducing cost with Amazon CodeGuru Profiler
• Monitor and optimize: Use continuous monitoring resources to identify components with high
resource requirements or suboptimal configuration.
Benefits of establishing this best practice: Implementing software patterns and features that are
optimized for customer device can reduce the overall environmental impact of cloud workload.
Implementation guidance
Implementing software patterns and features that are optimized for customer devices can reduce
the environmental impact in several ways:
• Implementing new features that are backward compatible can reduce the number of hardware
replacements.
• Optimizing an application to run efficiently on devices can help to reduce their energy
consumption and extend their battery life (if they are powered by battery).
• Optimizing an application for devices can also reduce the data transfer over the network.
Understand the devices and equipment used in your architecture, their expected lifecycle, and the
impact of replacing those components. Implement software patterns and features that can help
to minimize the device energy consumption, the need for customers to replace the device and also
upgrade it manually.
Implementation steps
• Conduct an inventory: Inventory the devices used in your architecture. Devices can be mobile,
tablet, IOT devices, smart light, or even smart devices in a factory.
• Use energy-efficient devices: Consider using energy-efficient devices in your architecture. Use
power management configurations on devices to enter low power mode when not in use.
• Run efficient applications: Optimize the application running on the devices:
• Use strategies such as running tasks in the background to reduce their energy consumption.
• Account for network bandwidth and latency when building payloads, and implement
capabilities that help your applications work well on low bandwidth, high latency links.
• Convert payloads and files into optimized formats required by devices. For example, you
can use Amazon Elastic Transcoder or AWS Elemental MediaConvert to convert large, high
quality digital media files into formats that users can play back on mobile devices, tablets, web
browsers, and connected televisions.
SUS03-BP05 Use software patterns and architectures that best support data access and storage
patterns
Understand how data is used within your workload, consumed by your users, transferred, and
stored. Use software patterns and architectures that best support data access and storage to
minimize the compute, networking, and storage resources required to support the workload.
Common anti-patterns:
• You assume that all workloads have similar data storage and access patterns.
• You only use one tier of storage, assuming all workloads fit within that tier.
• You assume that data access patterns will stay consistent over time.
• Your architecture supports a potential high data access burst, which results in the resources
remaining idle most of the time.
Benefits of establishing this best practice: Selecting and optimizing your architecture based on
data access and storage patterns will help decrease development complexity and increase overall
utilization. Understanding when to use global tables, data partitioning, and caching will help you
decrease operational overhead and scale based on your workload needs.
Implementation guidance
Use software and architecture patterns that aligns best to your data characteristics and access
patterns. For example, use modern data architecture on AWS that allows you to use purpose-
built services optimized for your unique analytics use cases. These architecture patterns allow for
efficient data processing and reduce the resource usage.
Implementation steps
• Analyze your data characteristics and access patterns to identify the correct configuration for
your cloud resources. Key characteristics to consider include:
• Data type: structured, semi-structured, unstructured
• Data growth: bounded, unbounded
• Data durability: persistent, ephemeral, transient
• Access patterns reads or writes, update frequency, spiky, or consistent
• Use architecture patterns that best support data access and storage patterns.
• AWS re:Invent 2023 - Optimizing storage price and performance with Amazon S3
• AWS re:Invent 2023 - Building and optimizing a data lake on Amazon S3
• AWS re:Invent 2023 - Advanced event-driven patterns with Amazon EventBridge
Related examples:
Data
Question
• SUS 4 How do you take advantage of data management policies and patterns to support your
sustainability goals?
SUS 4 How do you take advantage of data management policies and patterns to
support your sustainability goals?
Implement data management practices to reduce the provisioned storage required to support your
workload, and the resources required to use it. Understand your data, and use storage technologies
and configurations that more effectively support the business value of the data and how it’s used.
Lifecycle data to more efficient, less performant storage when requirements decrease, and delete
data that’s no longer required.
Best practices
• SUS04-BP01 Implement a data classification policy
• SUS04-BP02 Use technologies that support data access and storage patterns
• SUS04-BP03 Use policies to manage the lifecycle of your datasets
• SUS04-BP04 Use elasticity and automation to expand block storage or file system
• SUS04-BP05 Remove unneeded or redundant data
• SUS04-BP06 Use shared file systems or storage to access common data
• SUS04-BP07 Minimize data movement across networks
Data 927
AWS Well-Architected Framework Framework
• Periodically review: Periodically review and audit your environment for untagged and
unclassified data. Use automation to identify this data, and classify and tag the data
appropriately. As an example, see Data Catalog and crawlers in AWS Glue.
• Establish a data catalog: Establish a data catalog that provides audit and governance
capabilities.
• Documentation: Document data classification policies and handling procedures for each data
class.
Resources
Related documents:
Related videos:
SUS04-BP02 Use technologies that support data access and storage patterns
Use storage technologies that best support how your data is accessed and stored to minimize the
resources provisioned while supporting your workload.
Common anti-patterns:
• You assume that all workloads have similar data storage and access patterns.
• You only use one tier of storage, assuming all workloads fit within that tier.
• You assume that data access patterns will stay consistent over time.
Benefits of establishing this best practice: Selecting and optimizing your storage technologies
based on data access and storage patterns will help you reduce the required cloud resources to
meet your business needs and improve the overall efficiency of cloud workload.
Data 929
AWS Well-Architected Framework Framework
Data 931
AWS Well-Architected Framework Framework
Resources
Related documents:
Related videos:
• AWS re:Invent 2023 - Improve Amazon EBS efficiency and be more cost-efficient
• AWS re:Invent 2023 - Optimizing storage price and performance with Amazon S3
• AWS re:Invent 2023 - Building and optimizing a data lake on Amazon S3
• AWS re:Invent 2022 - Building modern data architectures on AWS
• AWS re:Invent 2022 - Modernize apps with purpose-built databases
• AWS re:Invent 2022 - Building data mesh architectures on AWS
• AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations
• AWS re:Invent 2023 - Advanced data modeling with Amazon DynamoDB
Related examples:
• Amazon S3 Examples
• AWS Purpose Built Databases Workshop
• Databases for Developers
• AWS Modern Data Architecture Immersion Day
• Build a Data Mesh on AWS
Manage the lifecycle of all of your data and automatically enforce deletion to minimize the total
storage required for your workload.
Data 933
AWS Well-Architected Framework Framework
Amazon Elastic Block Store You can use Amazon Data Lifecycle Manager
to automate the creation, retention, and
deletion of Amazon EBS snapshots and
Amazon EBS-backed AMIs.
Amazon Elastic Container Registry Amazon ECR lifecycle policies automate the
cleanup of your container images by expiring
images based on age or count.
AWS Elemental MediaStore You can use an object lifecycle policy that
governs how long objects should be stored in
the MediaStore container.
• Delete unused volumes, snapshots, and data that is out of its retention period. Leverage native
service features like Amazon DynamoDB Time To Live or Amazon CloudWatch log retention for
deletion.
Resources
Related documents:
• Optimize your Amazon S3 Lifecycle rules with Amazon S3 Storage Class Analysis
Related videos:
Data 935
AWS Well-Architected Framework Framework
Resources
Related documents:
Related videos:
Remove unneeded or redundant data to minimize the storage resources required to store your
datasets.
Common anti-patterns:
Data 937
AWS Well-Architected Framework Framework
• Use data virtualization capabilities on AWS to maintain data at its source and avoid data
duplication.
• Cloud Native Data Virtualization on AWS
• Optimize Data Pattern Using Amazon Redshift Data Sharing
• Use backup technology that can make incremental backups.
• Leverage the durability of Amazon S3 and replication of Amazon EBS to meet your durability
goals instead of self-managed technologies (such as a redundant array of independent disks
(RAID)).
• Centralize log and trace data, deduplicate identical log entries, and establish mechanisms to tune
verbosity when needed.
• Pre-populate caches only where justified.
• Establish cache monitoring and automation to resize the cache accordingly.
• Remove out-of-date deployments and assets from object stores and edge caches when pushing
new versions of your workload.
Resources
Related documents:
Related videos:
Related examples:
Data 939
AWS Well-Architected Framework Framework
• Copy data to or fetch data from shared file systems only as needed. As an example, you can
create an Amazon FSx for Lustre file system backed by Amazon S3 and only load the subset of
data required for processing jobs to Amazon FSx.
• Delete data as appropriate for your usage patterns as outlined in SUS04-BP03 Use policies to
manage the lifecycle of your datasets.
• Detach volumes from clients that are not actively using them.
Resources
Related documents:
related videos:
Data 941
AWS Well-Architected Framework Framework
• Use services that can help you run code closer to users of your workload.
Resources
Related documents:
• Amazon CloudFront Key Features including the CloudFront Global Edge Network
Related videos:
Related examples:
Data 943
AWS Well-Architected Framework Framework
Resources
• REL09-BP01 Identify and back up all data that needs to be backed up, or reproduce the data
from sources
• REL09-BP03 Perform data backup automatically
• REL13-BP02 Use defined recovery strategies to meet the recovery objectives
Related documents:
• Using AWS Backup to back up and restore Amazon EFS file systems
• Amazon EBS snapshots
• Working with backups on Amazon Relational Database Service
• APN Partner: partners that can help with backup
• AWS Marketplace: products that can be used for backup
• Backing Up Amazon EFS
• Backing Up Amazon FSx for Windows File Server
• Backup and Restore for Amazon ElastiCache (Redis OSS)
Related videos:
• AWS re:Invent 2023 - Backup and disaster recovery strategies for increased resilience
• AWS re:Invent 2023 - What's new with AWS Backup
• AWS re:Invent 2021 - Backup, disaster recovery, and ransomware protection with AWS
Related examples:
use rightsizing guidelines from AWS tools to efficiently operate your cloud resource and meet your
business needs.
Implementation steps
• Choose the instances type: Choose the right instances type to best fit your needs. To learn about
how to choose Amazon Elastic Compute Cloud instances and use mechanisms such as attribute-
based instance selection, see the following:
• How do I choose the appropriate Amazon EC2 instance type for my workload?
• Attribute-based instance type selection for Amazon EC2 Fleet.
• Create an Auto Scaling group using attribute-based instance type selection.
• Scale: Use small increments to scale variable workloads.
• Use multiple compute purchase options: Balance instance flexibility, scalability, and cost
savings with multiple compute purchase options.
• Amazon EC2 On-Demand Instances are best suited for new, stateful, and spiky workloads
which can’t be instance type, location, or time flexible.
• Amazon EC2 Spot Instances are a great way to supplement the other options for applications
that are fault tolerant and flexible.
• Leverage Compute Savings Plans for steady state workloads that allow flexibility if your needs
(like AZ, Region, instance families, or instance types) change.
• Use instance and Availability Zone diversity: Maximize application availability and take
advantage of excess capacity by diversifying your instances and Availability Zones.
• Rightsize instances: Use the rightsizing recommendations from AWS tools to make
adjustments on your workload. For more information, see Optimizing your cost with Rightsizing
Recommendations and Right Sizing: Provisioning Instances to Match Workloads
• Use rightsizing recommendations in AWS Cost Explorer or AWS Compute Optimizer to identify
rightsizing opportunities.
• Negotiate service-level agreements (SLAs): Negotiate SLAs that permit temporarily reducing
capacity while automation deploys replacement resources.
Resources
Related documents:
Implementation guidance
Using efficient instances in cloud workload is crucial for lower resource usage and cost-
effectiveness. Continually monitor the release of new instance types and take advantage of energy
efficiency improvements, including those instance types designed to support specific workloads
such as machine learning training and inference, and video transcoding.
Implementation steps
• Learn and explore instance types: Find instance types that can lower your workload's
environmental impact.
• Subscribe to What's New with AWS to stay up-to-date with the latest AWS technologies and
instances.
• Learn about AWS Graviton-based instances which offer the best performance per watt
of energy use in Amazon EC2 by watching re:Invent 2020 - Deep dive on AWS Graviton2
processor-powered Amazon EC2 instances and Deep dive into AWS Graviton3 and Amazon EC2
C7g instances.
• Use instance types with the least impact: Plan and transition your workload to instance types
with the least impact.
• Define a process to evaluate new features or instances for your workload. Take advantage
of agility in the cloud to quickly test how new instance types can improve your workload
environmental sustainability. Use proxy metrics to measure how many resources it takes you to
complete a unit of work.
• If possible, modify your workload to work with different numbers of vCPUs and different
amounts of memory to maximize your choice of instance type.
• Consider selecting the AWS Graviton option in your usage of AWS managed services.
• Migrate your workload to Regions that offer instances with the least sustainability impact and
still meet your business requirements.
• For machine learning workloads, take advantage of purpose-built hardware that is specific to
your workload such as AWS Trainium, AWS Inferentia, and Amazon EC2 DL1. AWS Inferentia
Hardware and services 949
AWS Well-Architected Framework Framework
• AWS re:Invent 2023 - New Amazon Elastic Compute Cloud generative AI capabilities in AWS
Management Console
• AWS re:Invent 2023 = What's new with Amazon Elastic Compute Cloud
• AWS re:Invent 2023 - Smart savings: Amazon Elastic Compute Cloud cost-optimization strategies
• AWS re:Invent 2021 - Deep dive into AWS Graviton3 and Amazon EC2 C7g instances
• AWS re:Invent 2022 - Build a cost-, energy-, and resource-efficient compute environment
Related examples:
• Solution: Guidance for Optimizing Deep Learning Workloads for Sustainability on AWS
Common anti-patterns:
• You use Amazon EC2 instances with low utilization to run your applications.
• Your in-house team only manages the workload, without time to focus on innovation or
simplifications.
• You deploy and maintain technologies for tasks that can run more efficiently on managed
services.
• Using managed services shifts the responsibility to AWS, which has insights across millions of
customers that can help drive new innovations and efficiencies.
• Managed service distributes the environmental impact of the service across many users because
of the multi-tenet control planes.
5. Replace self-hosted services: Use your migration plan to replace self-hosted services with
managed service.
6. Monitor and adjust: Continually monitor the service after the migration is complete to make
adjustments as required and optimize the service.
Resources
Related documents:
Related videos:
• AWS re:Invent 2021 - Cloud operations at scale with AWS Managed Services
• AWS re:Invent 2023 - Best practices for operating on AWS
Optimize your use of accelerated computing instances to reduce the physical infrastructure
demands of your workload.
Common anti-patterns:
Benefits of establishing this best practice: By optimizing the use of hardware-based accelerators,
you can reduce the physical-infrastructure demands of your workload.
• Choose the best AI accelerator and model compilation for computer vision inference with
Amazon SageMaker
Related videos:
• AWS re:Invent 2021 - How to select Amazon EC2 GPU instances for deep learning
• AWS re:Invent 2022 - [NEW LAUNCH!] Introducing AWS Inferentia2-based Amazon EC2 Inf2
instances
• AWS re:Invent 2022 - Accelerate deep learning and innovate faster with AWS Trainium
• AWS re:Invent 2022 - Deep learning on AWS with NVIDIA: From training to deployment
Look for opportunities to reduce your sustainability impact by making changes to your
development, test, and deployment practices.
Best practices
• Streamline the process: Continually improve and streamline your development processes. As an
example, Automate your software delivery process using continuous integration and delivery (CI/
CD) pipelines to test and deploy potential improvements to reduce the level of effort and limit
errors caused by manual processes.
• Training and awareness: Run training programs for your team members to educate them about
sustainability and how their activities impact your organizational sustainability goals.
• Assess and adjust: Continually assess the impact of improvements and make adjustments as
needed.
Resources
Related documents:
Related videos:
Related examples:
• Well-Architected Lab - Turning cost & usage reports into efficiency reports
Keep your workload up-to-date to adopt efficient features, remove issues, and improve the overall
efficiency of your workload.
Common anti-patterns:
• You assume your current architecture is static and will not be updated over time.
• Use automation: Automate updates to reduce the level of effort to deploy new features and
limit errors caused by manual processes.
• You can use CI/CD to automatically update AMIs, container images, and other artifacts related
to your cloud application.
• You can use tools such as AWS Systems Manager Patch Manager to automate the process of
system updates, and schedule the activity using AWS Systems Manager Maintenance Windows.
Resources
Related documents:
Related videos:
• AWS re:Invent 2022 - Optimize your AWS workloads with best-practice guidance
Related examples:
• Maximize utilization: Use strategies to maximize the utilization of development and test
environments.
• Use minimum viable representative environments to develop and test potential improvements.
• Use instance types with burst capacity, Spot Instances, and other technologies to align build
capacity with use.
• Adopt native cloud services for secure instance shell access rather than deploying fleets of
bastion hosts.
Resources
Related documents:
Related videos:
Use managed device farms to efficiently test a new feature on a representative set of hardware.
Common anti-patterns:
• You manually test and deploy your application on individual physical devices.
• You do not use app testing service to test and interact with your apps (for example, Android, iOS,
and web apps) on real, physical devices.
Related videos:
• AWS re:Invent 2023 - Improve your mobile and web app quality using AWS Device Farm
• AWS re:Invent 2021 - Optimize applications through end user insights with Amazon CloudWatch
RUM
Related examples:
AWS Glossary
For the latest AWS terminology, see the AWS glossary in the AWS Glossary Reference.
965