Chapter 3
Chapter 3
This chapter covers the following official AWS Certified SysOps Administrator -
Associate (SOA-C02) exam domains:
(For more information on the official AWS Certified SysOps Administrator - Associate [SOA-
C02] exam topics, see the Introduction.)
As a general rule, you should consider the services, instances, and objects you deploy to
AWS to “just work.” Of course, you need to take this rule with a grain of salt because back-
end issues might cause impairment of services in an availability zone, thus impairing the
application’s functionality. However, if there is impairment on the AWS side, you can just
examine the API response and determine whether the issue is at your end or at the
provider’s end.
In this chapter, we discuss the three aspects of troubleshooting and remediation and
introduce AWS services that can help you detect and remediate issues.
ExamAlert
Remember that proper troubleshooting and remediation can be done only if you have
already set up your monitoring and log collection in advance. The tools and services
discussed in this section rely heavily on the services discussed in Chapter 2, “Monitoring
Services in AWS.” For this reason, the exam usually ties troubleshooting, monitoring, and
remediation into one question, so services from both this and the preceding chapter could be
included in a specific question in the exam.
Responding to Alarms
This section covers the following official AWS Certified SysOps Administrator - Associate
(SOA-C02) exam domains:
1/8
Domain 1: Monitoring, Logging, and Remediation
CramSaver
If you can correctly answer these questions before going through this section, save time by
skimming the Exam Alerts in this section and then completing the Cram Quiz at the end of
the section.
1. You have issued a request to download an object on an S3 bucket. Your request receives
a 403 HTTP response. What could be the cause of the bad response?
2. True or False: You need to enable the EC2 instance health monitoring first before you can
create a CloudWatch Alarm based on the state of the instance check.
Answers
1. Answer: There is an issue in the user, group, role, or bucket policy. All polices in AWS
combine with equal weight, and a denial to a resource in one policy has a global effect on the
request.
2. Answer: False. EC2 instances have the automatic health check configured; health
monitoring can be used directly in CloudWatch Alarms to trigger an alert based on the health
check.
Infrastructure issues
Application issues
Security issues
Infrastructure Issues
Generally, you should follow AWS best practices and deploy any unmanaged system across
two availability zones, as discussed in Chapter 1, “Introduction to AWS.” Anytime you deploy,
you expect the infrastructure to just work. If you have a deployment issue, you can easily
resolve the issue by trying a redeployment. We call this approach the “rinse and repeat”
2/8
approach. The approach is useful both for the initial deployment as well as development
testing and upgrades.
You can also set up infrastructure triggers in CloudWatch Alarms to try to prevent any
possible issues with the infrastructure or detect any unusual metric and try to preemptively
remediate. For example, you can monitor the state of these instances, and if any health
issues are detected, you can perform automatic remediation. Many different factors can
constitute infrastructure health issues, including but not limited to
EC2 instance health check failure: All instances have an automatic health check
configured. This can be monitored with CloudWatch, and you can create an alarm that
informs you of any issues of this type.
Change in number of EC2 instances: An availability zone failure could cause the
number of reachable instances in an EC2 environment to drop suddenly. You can track
the number of active instances and perform remediation. This is usually done through
autoscaling. The number could also increase dramatically due to a runaway automation
script. That is also an important factor to monitor and alert on.
ExamAlert
Infrastructure performance is correlated with the performance of the application, and the
application security is coupled with the overall data security. An infrastructure performance
alarm often can be triggered by an application or security issue. The exam focuses on a
holistic view of troubleshooting, so expect to see questions that include all levels of
troubleshooting and remediation in one.
Application Issues
3/8
After you set up the infrastructure monitoring and alarming, you need to deal with the
application layer. Tracking internal application metrics and logs and creating alarms to
respond to should be done in the same exact manner as with the infrastructure. The
application often can trigger an infrastructure issue; for example, an infinite loop in code can
cause a CPU spike to 100 percent. This means that when troubleshooting your application,
you should not expect it to “just work,” and you need to compare the aforementioned
collection of monitoring and logs to the infrastructure data to determine if the issues are
infrastructure related or stem from the application itself.
The simplest practice for metric and log collection when running your application on EC2
instances or on-premise servers would be using the CloudWatch agent. The agent can
collect data from any source within the operating system and forward that data to
CloudWatch as metric or log data. An even better approach is coding API calls to the
CloudWatch API within the application code so that the application is able to self-report
metrics regardless of the environment where it runs.
Security Issues
At the top layer of the monitoring and alerting stack are security issues. These issues also
encompass a wide range of aspects that need to be determined for each application
beforehand. A range of different alerts can be configured for security issues, including but not
limited to
Large numbers of failed login attempts: These could indicate brute-force break-in
attempts to the application.
Sudden spikes in data transfer out: These could indicate a breach or data leak.
The AWS infrastructure exposes a public HTTP API, and all calls either receive a 200-type
HTTP response if the action is accepted and will be processed, or a 400-type or 500-type
HTTP response, indicating the problem is with the query. All 400-type responses indicate
there is an issue with the request. All 500-type responses indicate that there is an issue with
the AWS infrastructure. In case of infrastructure issues, always make sure to repeat the
4/8
request with an exponential back-off approach, meaning that you wait for an increasingly
longer period of time before reissuing the request.
400 - bad request: Any 400 error includes a message like InvalidAction,
MessageRejected, or RequestExpired. Specific responses by some services also
indicate throttling. In case of throttling, you should retry the requests with exponential
back-off.
403 - access denied: All IAM polices apply with equal weight, and a deny in one
policy denies an action across all policies. Check all the policies attached to the user,
group, or role. Check any inline policies and resource policies attached to buckets,
queues, and so on.
404 - page not found: This error indicates the object, instance, or resource specified
in the query does not exist.
500 - internal failure: This error indicates an internal error on an operational service
on the AWS side. You can immediately retry the request and will probably succeed on
the second try. If not, retry with exponential back-off.
503 - service unavailable: These errors are rare because they indicate a major
failure in an AWS service. You can retry your request using exponential back-off. This
way you ensure the request will succeed at some point after the issue is resolved.
Amazon EventBridge
This section covers the following official AWS Certified SysOps Administrator - Associate
(SOA-C02) exam domain:
CramSaver
If you can correctly answer these questions before going through this section, save time by
skimming the Exam Alerts in this section and then completing the Cram Quiz at the end of
the section.
1. In what way do CloudWatch Events and EventBridge differ from each other?
2. True or false: In AWS you can build both serverless and traditional, instance-based
applications that can respond to infrastructure, application, and third-party events.
5/8
Answers
1. Answer: EventBridge offers integration of AWS events as well as any application and third-
party provider events on the event bus. CloudWatch Alarms only supports AWS events by
default; however, custom event patterns can be established.
2. Answer: True. With EventBridge and Systems Manager Automation, you can build
traditional, instance-based applications and create automation scenarios that are able to
respond to real-time events from the EventBridge.
Amazon EventBridge is a new service built on the same API structure as the Amazon
CloudWatch Events service. CloudWatch Events enables you to collect events from your
AWS services, instances, and objects. The EventBridge service is an evolution of the
CloudWatch Events platform and is slated to replace it entirely because at this point the
CloudWatch Alarms service is still available but deprecated.
EventBridge is more than just an internal event collection platform because it enables you to
build your own serverless event bus, helping you design a seamless platform where events
from your own application can be combined with events from AWS. These events can also
trigger actions on services within AWS and your application, enabling you to build event-
driven applications at any scale. Another benefit of EventBridge is that third-party Software
as a Service (SaaS) providers are able to publish their integration to EventBridge, thus
making EventBridge a unified platform for tracking and relaying events in a diverse
environment of multiple coordinated platforms.
ExamAlert
The new exam questions are typically written to reflect the change in monitoring and possibly
mention solutions with both CloudWatch Events and EventBridge. If you find yourself with
two questions, both describing a solution based on the two services, consider selecting
EventBridge as the correct answer due to the fact that EventBridge fully replaces
CloudWatch Alarms because CloudWatch Alarms is being deprecated.
With EventBridge, you can create applications that emit streams in real time and create
routing rules to send your data for consumption to another service, also in real time. The
EventBridge bus also completely decouples the publisher and consumer and complies with a
loosely coupled, cloud native, serverless, event-based approach to computing.
6/8
Chapter 7, “Provisioning Resources.”
An important part of Systems Manager is Systems Manager Automation, which allows you to
perform the following common IT tasks:
Performing complex and disruptive operations such as replacing an image for your
instances in a scalable, secure, and orchestrated manner
You can select Systems Manager as a target type when creating an EventBridge rule by
simply specifying the automation document that will be targeted based on the event pattern.
Having the ability to tie the EventBridge service with Systems Manager is invaluable in any
systems operations environment because it enables you to treat the infrastructure as a
programmatically addressable resource that can respond to events in a similar manner that
serverless applications do. You therefore can create much more flexible, resilient, and
reliable infrastructure even when your application is not ready to go entirely serverless.
AWS Config
This section covers the following official AWS Certified SysOps Administrator - Associate
(SOA-C02) exam domains:
CramSaver
If you can correctly answer these questions before going through this section, save time by
completing the Cram Quiz at the end of the section.
1. Your organization requires you to capture a comprehensive auditable log of the state of
your AWS account over time. What would be the simplest way to capture the state for
auditing purposes?
2. What would be the easiest way to perform remediation of an issue found in AWS Config?
7/8
Answers
2. Answer: You can enable remediation directly in AWS Config if the remediation is
supported as an action for the config rule. In case there is no remediation supported, you can
create a notification to another service that will perform remediation or notify an administrator
for human intervention.
So far, we have covered how to monitor, troubleshoot, and react to alarms and events at the
infrastructure and account level. However, there is a missing aspect to the troubleshooting
and reaction story that is needed in any IT environment: the state. Capturing the state of your
application is a crucial part of tracking how your application changes over time and for
ensuring you have a manageable audit trail. Recording the state of your application
environment is also a crucial factor in determining compliance and increasing the security of
your platform over time; this is where AWS Config comes in.
With AWS Config, you can create a configuration snapshot of your environment so you can
easily assess, audit, and evaluate the state of all the AWS resources within your account or
organization. Over time, configuration snapshots can be compared against a desired state,
thus allowing you to maintain an auditable record of compliance for your application
infrastructure in AWS.
AWS Config also can detect any resource changes by continuously performing checks
against the infrastructure through preconfigured or custom AWS Config rules. When a rule is
created, you can also define a remediation action for the rule, thus enabling you to alert or
autoremediate the state of the environment when remediation is supported by AWS Config.
What Next?
If you want more practice on this chapter’s exam objectives before you move on, remember
that you can access all of the Cram Quiz questions on the Pearson Test Prep software
online. You can also create a custom exam by objective with the Online Practice Test. Note
any objective you struggle with and go to that objective’s material in this chapter.
8/8