Student Lab Guide - Observability Fundamentals
Student Lab Guide - Observability Fundamentals
Observability Fundamentals
Student Lab Guide
Table of Labs
1. Lab 0: Log in to the training environment
2. Lab 1: Finding the root cause of latency or outages using the Root
Cause Explorer (RCE)
3. Lab 2: Tracing in Sumo Coffee App
4. Lab 3: When are Logs the best bet in Observability?
5. Lab 4: Working with Logs
The training lab environment is separate from your other accounts. So, you'll need to create
another login. To access the training lab environment:
If you're reading this on Chrome, open a Firefox window. Using separate browsers will keep you
logged in to your regular Sumo Logic account and the training lab environment at the same
time. If you don't have two different browsers on your machine, you can open a Private or
Incognito window instead.
In one browser, log in to your own account. Use this browser to navigate the lab activities, and
take the exam. This way, you'll get credit for taking the course and passing the exam. Make sure
you're logged in as yourself by clicking the profile icon in the upper right corner.
In another browser, log in to a Sumo Logic training lab environment. You can explore this
environment freely, without impacting your dev, prod, or trial environments. Our lab activities are
designed for the training lab environment, and may not work in other Sumo Logic accounts.
Note: The training accounts are public. Be careful what kind of personal information you share,
like your real name or email address. These training accounts are wiped clean weekly, so make
sure you complete any lab activities and save any data you want to keep, as you may not be
able to recover it later.
Lab 1: Finding the root cause of latency or outages using the Root
Cause Explorer (RCE)
In this lab, you'll learn how dashboards created based on metrics are instrumental in identifying
the root cause of an outage. You’ll use the Root Cause Explorer, a powerful Observability
feature of Sumo Logic.
Note: All labs assume you’re using a training+analyst###@sumologic.com account. The data
you see may vary depending on your environment if you’re using your own credentials or a
Sumo Logic trial account instead of a training account. To access a training account, review Lab
0.
Here’s the scenario you are trying to troubleshoot and isolate the root cause.
TravelLogic is an app that facilitates travellers booking their flights online and provides the allied
services including hotel booking, sightseeing, and arranging visits to the well-known places at
the destination.
You, a DevOps specialist, are supporting a TravelLogic app on AWS that uses following AWS
services:
● ELB (Application Load Balancer)
● Host/EC2
The TravelLogic app faces unusual latency in loading the app web page, also sometimes
resulting in outage in the us-area-1. You need to find out the root cause of the latency and
outages ASAP.
Alternatively, you can click +New, and select Root Cause to access the Root Cause
Explorer, if you are comfortable with applying the filters to see your events of interest.
You can continue with Step 7 directly.
2. In the top left corner of the screen that displays, click under Explore by, and select AWS
Observability.
3. In the left pane, click the us-east-1 region under the Prod account.
5. To open the Events of Interest (EOI) dashboard in the Root Cause Explorer (RCE), click
the three vertical dots on the right of the EoI screen.
6. You can see the most recent outage in the EC2 ALB with 5XX error plotted on the
screen.
When you hover over the circles, you can see the EOI stats. In the stats, you can see
that one of the reading says y% for x min. It indicates that the value of the metric shows
y percent drift from the expected value for x minutes. The extent of drift from the
7. However, if you hover your mouse slightly above 5XX error, you can see the EC2 CPU
Log bottleneck event is plotted slightly earlier, and when you move further up, you’ll
notice the DynamoDB throttle event plotted earlier than the CPU Load bottleneck.
10. Now that you realize that the DynamoDB throttling caused the outage, you want to find
out how and why. Click the DynamoDB event plot. On the right side of the screen, in the
Entity Inspector panel, check the details that the Summary tab displays. You can see a
spike in the metrics chart plotted below in the Summary tab.
12. Click the Open In button in the Entity section, and select Entity Dashboard, to open the
relevant entity dashboard.
14. The dashboard displays a number of panels including No. of Errors, Top 5 Events, and
Events over Time. You can see the number of errors and the events that occurred on the
dashboard including Delete Table, Describe Table, and Update Table.
16. When you open the events in Search mode, you see the aggregate events, where you
see someone has updated the table. Click Messages to drill down the logs.
18. When you go through the update table log events, you realize that someone named
Mike Man changed the read/write parameter (IOPS).
Most probably the cost optimization could have been a motivation for the developer to
make the change, as AWS charges for DynamoDB based on provisioned Read/Write
Capacity Units. When you reconfigure the IOPS, the issue should be resolved.
Note:The developer ‘Mike Man’ who configured the DynamoDB is just an example. The
developer name and the IOPS changes may vary when you perform the lab exercise.
So here, you find that the real root cause for DynamoDB throttling is a change in the
Provisioned IOPS setting of a table. Lowering this setting, while lowering AWS costs,
can also lead to throttling. Such a configuration change might be evident in AWS
CloudTrail logs associated with DynamoDB. So, updating the IOPS value to the allowed
range will solve the issue.
With more and more modern apps using microservices architecture, it’s difficult to fully
understand the actual cause of an outage due to the dynamic nature of the architecture.
Note: All labs assume you’re using a training+analyst###@sumologic.com account. The data
you see may vary depending on your environment if you’re using your own credentials or a
Sumo Logic trial account instead of a training account. To access a training account, review Lab
0.
Here’s the scenario you are trying to troubleshoot and isolate the root cause.
Welcome to the Sumo Logic Coffee app that delivers a hot and fast cup of coffee globally in
thousands of locations. The app uses K8s platform and the distributed architecture. It uses
several microservices like user login, order registration, water service, coffee bean service, and
payment service.
Typically, a customer orders coffee from the Sumo Logic Coffee app. Barista turns on the coffee
machine. The machine service, in action, connects with water service to check whether the
adequate water level is available. Coffee service is then initiated for preparing the coffee. At the
same time, a payment request is sent to the Payment System. The SQL commands are
executed and a bill is generated post the bill amount calculation for the coffee service.
The duration in milli or microseconds is captured while each span of the transaction is
completed. With the Span ID and Trace ID, you can track a transaction across multiple
microservices the app uses.
Abel, the Coffee Lover orders coffee at 8:45 am using the Coffee App, However, his transaction
is unsuccessful. You need to trace the transaction and troubleshoot and address the issue
ASAP.
1. Begin with the Service Map. To open a service map, in the Sumo UI, click +New, and
then Service Map.
Filter for Application Name= the-coffee-bar-app
3. Mouse over one of the services that are showing anomalies, to see more details:
Note: Alternatively, you can near the top of the Sumo Logic UI, click +New > Traces.
The trace page opens for the service selected. The window will show a list of traces
related to that service, displayed in a table format.
Select the first trace on the table, alternatively you can filter the table by the number of
error column.
7. Click on the /get_coffee operation, to find out what happened to our coffee order.
Note: If you have selected a different service the operation will also have a different
name. For example, water service uses get_water
8. Click the first Log search to open the logs related to the Span ID.
10. Inform the barista at the particular error so the problem can be rectified.
Optional: Use the SpanId and open in the Span Analytics tab to search and look further
into the metadata related to the span.
In this lab, you'll learn that when alerted to an incident, diving deep into the logs can
help you troubleshoot and resolve an issue.
Being an SRE you’ve got an alert on an incident. The end users have been
experiencing the outage in the coffee app within the us-east-1 area in the last 24
hours. You’ve got an alert that the problem is on the Lambda server. You got to begin
taking a look to see when all of this started. What metrics are you seeing and when did
this happen?
2. In the top left corner of the screen that displays, click Explore by, and select
AWS Observability.
3. In the left pane, click the us-east-1 region under the Prod account.
If you scroll down, you can compare the function's duration to find out functions
that take the longest to run with the ones that consume the highest amount of
memory. Using the Success and Failure panel, you can see that the
get-billing_info function has recorded the maximum number of errors.
6. You can see that in the Most Frequent Function Operations panel, there is an
UpdateFunctionConfiguration, so some user changed some function which is
causing our errors to the get_billing_information function. Let us investigate -
click ‘Open in Search’ to open the related log events in the search mode.
8. Apply the filter using the field browser. Click on event_name and select the
UpdateFunctionConfiguration function.
In this lab, you'll learn how to dive deep to filter and analyze logs and explore the
LogReduce and LogCompare operators.
2. Select the timeframes Last 60 mins and click Start to execute the query.
Use the field browser to filter the log entries as per your requirement.
LogReduce Operator:
The LogReduce algorithm uses fuzzy logic to group messages together based on string and
pattern similarity. You can use the logreduce operator to quickly assess activity patterns for
things like a range of devices or traffic on a website. Focus the LogReduce algorithm on an area
of interest by defining that area in the keyword expression.
1. Modify our query to remove the ‘count by’ clause, because remember, the
logreduce operator cannot be used with group-by operators such as "count by
field".
_sourceCategory=Labs/AWS/CloudTrail
| json field=_raw "awsRegion"as awsregion
4. Next, in a new search, run LogReduce on your Snort security data to identify
unusual activity (i.e. intrusions).
_sourceCategory=labs/snort
| logreduce
5. Sort your results by count to identify those that happen only once.
7. Now click on the host to view surrounding messages to identify the context of
the intrusion.
LogCompare Operator:
LogCompare allows you to compare log activity from two different time periods,
providing you insight on how your current time compares to a baseline. In this case, we
will use LogCompare to identify when signature messages deviate by more than 25%
from the baseline.
_sourceCategory=Labs/AWS/CloudTrail
| logcompare timeshift -24h
_sourceCategory=Labs/AWS/CloudTrail
| logcompare timeshift -24h
| where abs(_deltaPercentage) > 25
3. To view results where there is a new Signature in the current time period, add a
where clause for _isNew:
_sourceCategory=Labs/AWS/CloudTrail
| logcompare timeshift -24h
| where (_isNew)
The LogExplain operator allows you to compare sets of structured logs based on
events you're interested in. Structured logs can be in JSON, CSV, key-value, or any
structured format.
You'll need to specify an event of interest as a conditional statement, this is called the
Event Condition. You can specify a condition to compare against the event-of-interest
condition, this is called the Against Condition. If no Against Condition is provided,
LogExplain will generate the comparison data set based on the fields in your Event
Condition.
LogExplain will process your data against the specified conditions and create separate
data sets to compare:
● A control data set from normal operations data.
● An event-of-interest data set.
LogExplain gathers frequent (at least 5%) joint-column entries, such as key-value pairs
that occur more frequently compared to the control set. The results indicate what
entities correlate with the event you're interested in.
_sourceCategory=*cloudtrail*
| json field=_raw "userIdentity.userName" as userName nodrop
| json field=_raw
"userIdentity.sessionContext.sessionIssuer.userName" as
userName_role nodrop
| if (isNull(userName), if(!isNull(userName_role),userName_role,
"Null_UserName"), userName) as userName
| json field=_raw "eventSource" as eventSource
| json field=_raw "eventName" as eventName
| json field=_raw "awsRegion" as awsRegion
| json field=_raw "errorCode" as errorCode nodrop
| json field=_raw "errorMessage" as errorMessage nodrop
| json field=_raw "sourceIPAddress" as sourceIp nodrop
| json field=_raw "requestParameters.bucketName" as bucketName
nodrop
| json field=_raw "recipientAccountId" as accountId
| where eventSource matches "s3.amazonaws.com" and accountId
matches "*"
| logexplain (errorCode matches "*AccessDenied*") on sourceIp,
userName, awsRegion, eventName, bucketName
● _explanation
The fields and respective values from the comparison.
● _relevance
The probability that the explanation occurs in the event-of-interest data set.
Values are 0 to 1.
_test_coverage
The percentage of data in the event-of-interest set that has the explanation.The
link opens a new search that drills down to these logs based on the related
explanation.
● _control_coverage
The percentage of control data in the event-of-interest set that has the
explanation.The link opens a new search that drills down to these logs based on
the related explanation.