0% found this document useful (0 votes)
42 views36 pages

Student Lab Guide - Observability Fundamentals

Student Guide for Sumo Logic

Uploaded by

mlin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views36 pages

Student Lab Guide - Observability Fundamentals

Student Guide for Sumo Logic

Uploaded by

mlin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Sumo Logic

Observability Fundamentals
Student Lab Guide

©2021 Sumo Logic, All Rights Reserved.


1
Disclaimer

This course material is Sumo Logic, Inc. confidential information. Sumo


Logic, Inc. provides it as-is solely for your use and assessment as an
individual partaking in the training and certification program on our
Learning Portal and in specified other venues. The course material may
not be reproduced, sold, or otherwise processed or transferred in any
other way or for any other purpose without prior written permission from
Sumo Logic. Sumo Logic owns the copyright and other intellectual
property rights in the text, graphics, information, designs, data, and other
content on this website (including exam materials and certifications as
well as audio files and their scripts) with the exception of our partners’,
licensors’, and other third parties’ trademarks and other intellectual
property. United States export control laws and regulations apply to this
material, and you agree to comply with such laws and regulations.

©2021 Sumo Logic, All Rights Reserved.


2
Observability Fundamentals Lab Guide

Table of Labs
1. Lab 0: Log in to the training environment
2. Lab 1: Finding the root cause of latency or outages using the Root
Cause Explorer (RCE)
3. Lab 2: Tracing in Sumo Coffee App
4. Lab 3: When are Logs the best bet in Observability?
5. Lab 4: Working with Logs

Lab 0: Log in to the training environment


You'll need two different browser windows open for the lab activities. In one browser, you will
be logged in as yourself, in your own account. In another browser, you'll be logged into a
Sumo Logic training lab environment.

The training lab environment is separate from your other accounts. So, you'll need to create
another login. To access the training lab environment:

1. Open a new window in a different browser or incognito window.


2. Navigate to https://service.sumologic.com in the new browser window.
3. Choose a number between 001 and 999. Remember this number, since you'll use it in all
your labs.
4. Enter training+analyst###@sumologic.com in the Email field. Replace ### with the
number you chose in Step 3.
5. Enter the Password provided to you by your instructor.

©2021 Sumo Logic, All Rights Reserved.


3
Note: The password changes monthly. You can find the password on the training homepage for
more information.

If you're reading this on Chrome, open a Firefox window. Using separate browsers will keep you
logged in to your regular Sumo Logic account and the training lab environment at the same
time. If you don't have two different browsers on your machine, you can open a Private or
Incognito window instead.

In one browser, log in to your own account. Use this browser to navigate the lab activities, and
take the exam. This way, you'll get credit for taking the course and passing the exam. Make sure
you're logged in as yourself by clicking the profile icon in the upper right corner.

In another browser, log in to a Sumo Logic training lab environment. You can explore this
environment freely, without impacting your dev, prod, or trial environments. Our lab activities are
designed for the training lab environment, and may not work in other Sumo Logic accounts.

Note: The training accounts are public. Be careful what kind of personal information you share,
like your real name or email address. These training accounts are wiped clean weekly, so make
sure you complete any lab activities and save any data you want to keep, as you may not be
able to recover it later.

Lab 1: Finding the root cause of latency or outages using the Root
Cause Explorer (RCE)
In this lab, you'll learn how dashboards created based on metrics are instrumental in identifying
the root cause of an outage. You’ll use the Root Cause Explorer, a powerful Observability
feature of Sumo Logic.

Note: All labs assume you’re using a training+analyst###@sumologic.com account. The data
you see may vary depending on your environment if you’re using your own credentials or a
Sumo Logic trial account instead of a training account. To access a training account, review Lab
0.

Here’s the scenario you are trying to troubleshoot and isolate the root cause.

TravelLogic is an app that facilitates travellers booking their flights online and provides the allied
services including hotel booking, sightseeing, and arranging visits to the well-known places at
the destination.
You, a DevOps specialist, are supporting a TravelLogic app on AWS that uses following AWS
services:
● ELB (Application Load Balancer)
● Host/EC2

©2021 Sumo Logic, All Rights Reserved.


4
● DynamoDB

The TravelLogic app faces unusual latency in loading the app web page, also sometimes
resulting in outage in the us-area-1. You need to find out the root cause of the latency and
outages ASAP.

The lab begins:


1. In the Sumo Logic UI, near the top of the Sumo Logic UI, click +New > Explore.

Alternatively, you can click +New, and select Root Cause to access the Root Cause
Explorer, if you are comfortable with applying the filters to see your events of interest.
You can continue with Step 7 directly.

2. In the top left corner of the screen that displays, click under Explore by, and select AWS
Observability.

3. In the left pane, click the us-east-1 region under the Prod account.

©2021 Sumo Logic, All Rights Reserved.


5
4. In the right pane, in the Dashboards dropdown, select AWS Region Events of Interest.
For more information on the RCE and Dashboards linked, refer to our Root Cause
Explorer documentation.

5. To open the Events of Interest (EOI) dashboard in the Root Cause Explorer (RCE), click
the three vertical dots on the right of the EoI screen.

©2021 Sumo Logic, All Rights Reserved.


6
The EOI opens in RCE.

6. You can see the most recent outage in the EC2 ALB with 5XX error plotted on the
screen.

When you hover over the circles, you can see the EOI stats. In the stats, you can see
that one of the reading says y% for x min. It indicates that the value of the metric shows
y percent drift from the expected value for x minutes. The extent of drift from the

©2021 Sumo Logic, All Rights Reserved.


7
expected value of a metric is classified as High, Medium or Low. The high intensity EOIs
require more attention than others.

7. However, if you hover your mouse slightly above 5XX error, you can see the EC2 CPU
Log bottleneck event is plotted slightly earlier, and when you move further up, you’ll
notice the DynamoDB throttle event plotted earlier than the CPU Load bottleneck.

©2021 Sumo Logic, All Rights Reserved.


8
8. You can select the time range and the scatter plot will magnify. Use the magnifying glass
to zoom out.

©2021 Sumo Logic, All Rights Reserved.


9
9. When you put these relevant events in timeline perspective, you realise that mostly the
DynamoDB got throttled first and rest of the events occurred as the cascading effects of
the database throttling.

10. Now that you realize that the DynamoDB throttling caused the outage, you want to find
out how and why. Click the DynamoDB event plot. On the right side of the screen, in the
Entity Inspector panel, check the details that the Summary tab displays. You can see a
spike in the metrics chart plotted below in the Summary tab.

©2021 Sumo Logic, All Rights Reserved.


10
11. Click the Entities tab, where you can see the relevant entities and environment
parameters listed.

12. Click the Open In button in the Entity section, and select Entity Dashboard, to open the
relevant entity dashboard.

©2021 Sumo Logic, All Rights Reserved.


11
13. Open the relevant DynamoDB Events dashboard by clicking the AWS DynamoDB -
Events option in the dropdown The data in the dashboard is fetched using the relevant
metrics.

14. The dashboard displays a number of panels including No. of Errors, Top 5 Events, and
Events over Time. You can see the number of errors and the events that occurred on the
dashboard including Delete Table, Describe Table, and Update Table.

©2021 Sumo Logic, All Rights Reserved.


12
15. When you open the relevant metrics in the events dashboard, you need to further drill
down the associated logs. Scroll down to see the Top Errors, and click Open in Search
to open the relevant logs.

16. When you open the events in Search mode, you see the aggregate events, where you
see someone has updated the table. Click Messages to drill down the logs.

©2021 Sumo Logic, All Rights Reserved.


13
17. All the relevant log entries are displayed. If required, you can apply the filter in the left
pane to focus on the relevant fields and search through the event you need to point out.
In this case, deselect the field ‘AttributeDefinition’.

18. When you go through the update table log events, you realize that someone named
Mike Man changed the read/write parameter (IOPS).

©2021 Sumo Logic, All Rights Reserved.


14
The developer has reconfigured DynamoDB to use lower-provisioned IOPS
(Input/Output Operations Per Second) which caused throttling and the subsequent
outage.

Most probably the cost optimization could have been a motivation for the developer to
make the change, as AWS charges for DynamoDB based on provisioned Read/Write
Capacity Units. When you reconfigure the IOPS, the issue should be resolved.

Note:The developer ‘Mike Man’ who configured the DynamoDB is just an example. The
developer name and the IOPS changes may vary when you perform the lab exercise.

So here, you find that the real root cause for DynamoDB throttling is a change in the
Provisioned IOPS setting of a table. Lowering this setting, while lowering AWS costs,
can also lead to throttling. Such a configuration change might be evident in AWS
CloudTrail logs associated with DynamoDB. So, updating the IOPS value to the allowed
range will solve the issue.

Lab 2: Tracing in Sumo Coffee App

With more and more modern apps using microservices architecture, it’s difficult to fully
understand the actual cause of an outage due to the dynamic nature of the architecture.

©2021 Sumo Logic, All Rights Reserved.


15
In this lab, you'll learn how to trace a transaction in Sumo Logic in case of a failure to find out
the root cause of the unsuccessful transaction.

Note: All labs assume you’re using a training+analyst###@sumologic.com account. The data
you see may vary depending on your environment if you’re using your own credentials or a
Sumo Logic trial account instead of a training account. To access a training account, review Lab
0.

Here’s the scenario you are trying to troubleshoot and isolate the root cause.

Welcome to the Sumo Logic Coffee app that delivers a hot and fast cup of coffee globally in
thousands of locations. The app uses K8s platform and the distributed architecture. It uses
several microservices like user login, order registration, water service, coffee bean service, and
payment service.

Typically, a customer orders coffee from the Sumo Logic Coffee app. Barista turns on the coffee
machine. The machine service, in action, connects with water service to check whether the
adequate water level is available. Coffee service is then initiated for preparing the coffee. At the
same time, a payment request is sent to the Payment System. The SQL commands are
executed and a bill is generated post the bill amount calculation for the coffee service.

The duration in milli or microseconds is captured while each span of the transaction is
completed. With the Span ID and Trace ID, you can track a transaction across multiple
microservices the app uses.

Abel, the Coffee Lover orders coffee at 8:45 am using the Coffee App, However, his transaction
is unsuccessful. You need to trace the transaction and troubleshoot and address the issue
ASAP.

The lab begins.

1. Begin with the Service Map. To open a service map, in the Sumo UI, click +New, and
then Service Map.
Filter for Application Name= the-coffee-bar-app

©2021 Sumo Logic, All Rights Reserved.


16
2. In the Service map, you can see that the-coffee-bar-app shows anomalies in some of
its micro services.

3. Mouse over one of the services that are showing anomalies, to see more details:

©2021 Sumo Logic, All Rights Reserved.


17
4. Click on the service with the anomaly, this will open the Entity Inspector, where the
latency and error metrics are listed.

©2021 Sumo Logic, All Rights Reserved.


18
5. Click Open In, and then select Traces.

Note: Alternatively, you can near the top of the Sumo Logic UI, click +New > Traces.
The trace page opens for the service selected. The window will show a list of traces
related to that service, displayed in a table format.
Select the first trace on the table, alternatively you can filter the table by the number of
error column.

©2021 Sumo Logic, All Rights Reserved.


19
Click the trace Duration Breakdown to open the trace view.

6. Filter by Error Span Only

7. Click on the /get_coffee operation, to find out what happened to our coffee order.
Note: If you have selected a different service the operation will also have a different
name. For example, water service uses get_water

©2021 Sumo Logic, All Rights Reserved.


20
This will open the Detail Pane on the right of the screen, on the Summary Tab.
You can see the span details of the transaction and related logs below the details.

8. Click the first Log search to open the logs related to the Span ID.

©2021 Sumo Logic, All Rights Reserved.


21
9. When you go through the logs entry, you will see the cause of the error.
Note: If you have chosen another serve i.e the water service the error would have
occurred due to insufficient water in the coffee machine.

10. Inform the barista at the particular error so the problem can be rectified.
Optional: Use the SpanId and open in the Span Analytics tab to search and look further
into the metadata related to the span.

©2021 Sumo Logic, All Rights Reserved.


22
Lab 3: When are Logs the best bet in Observability?

In this lab, you'll learn that when alerted to an incident, diving deep into the logs can
help you troubleshoot and resolve an issue.

Note: All labs assume you’re using a training+analyst###@sumologic.com account.


The data you see may vary depending on your environment if you’re using your own
credentials or a Sumo Logic trial account instead of a training account. To access a
training account, review Lab 0.

Here’s the scenario you are trying to troubleshoot.

Being an SRE you’ve got an alert on an incident. The end users have been
experiencing the outage in the coffee app within the us-east-1 area in the last 24
hours. You’ve got an alert that the problem is on the Lambda server. You got to begin
taking a look to see when all of this started. What metrics are you seeing and when did
this happen?

The lab begins.


1. Near the top of the Sumo Logic UI, click +New > Explore.
We know that the problem is in the Lambda server. Let’s explore by the option
‘AWS Observability, and go through the Lambda for prod us-east-1 region.

2. In the top left corner of the screen that displays, click Explore by, and select
AWS Observability.

3. In the left pane, click the us-east-1 region under the Prod account.

©2021 Sumo Logic, All Rights Reserved.


23
4. The AWS Lambda - Overview dashboard displays the Audit and Performance
panels.

If you scroll down, you can compare the function's duration to find out functions
that take the longest to run with the ones that consume the highest amount of
memory. Using the Success and Failure panel, you can see that the
get-billing_info function has recorded the maximum number of errors.

©2021 Sumo Logic, All Rights Reserved.


24
5. Apply filter to see the details related to billing information by selecting the
‘get_billing_info’ in the functionname drop down on the top of the screen.

6. You can see that in the Most Frequent Function Operations panel, there is an
UpdateFunctionConfiguration, so some user changed some function which is
causing our errors to the get_billing_information function. Let us investigate -
click ‘Open in Search’ to open the related log events in the search mode.

©2021 Sumo Logic, All Rights Reserved.


25
7. Click Messages to take a look at the logs for the frequent function operations.

8. Apply the filter using the field browser. Click on event_name and select the
UpdateFunctionConfiguration function.

©2021 Sumo Logic, All Rights Reserved.


26
9. As you can see by looking at the message, Mark Smith is the username who
changed this function. We can then follow up with Mark and rollback his changes
to resolve this issue.

©2021 Sumo Logic, All Rights Reserved.


27
Lab 4: Working with Logs

In this lab, you'll learn how to dive deep to filter and analyze logs and explore the
LogReduce and LogCompare operators.

Note: All labs assume you’re using a training+analyst###@sumologic.com account.


The data you see may vary depending on your environment if you’re using your own
credentials or a Sumo Logic trial account instead of a training account. To access a
training account, review Lab 0.

The lab begins.


1. In the search query box, type the following query:
_sourceCategory=Labs/AWS/CloudTrail
| json field=_raw "awsRegion"as awsregion
| count by awsregion

2. Select the timeframes Last 60 mins and click Start to execute the query.

The Aggregate panel shows the count by the region.

©2021 Sumo Logic, All Rights Reserved.


28
3. Click Messages. The individual Log entries will be displayed. Use the page
navigation arrows to move within the log pages.

Use the field browser to filter the log entries as per your requirement.

LogReduce Operator:

The LogReduce algorithm uses fuzzy logic to group messages together based on string and
pattern similarity. You can use the logreduce operator to quickly assess activity patterns for
things like a range of devices or traffic on a website. Focus the LogReduce algorithm on an area
of interest by defining that area in the keyword expression.

©2021 Sumo Logic, All Rights Reserved.


29
There are two ways to use the operator.
● Use the LogReduce button displayed on the results table after running a search.
● Manually add the operator to your query following its syntax.

For our lab, we will use the LogReduce button.

1. Modify our query to remove the ‘count by’ clause, because remember, the
logreduce operator cannot be used with group-by operators such as "count by
field".

_sourceCategory=Labs/AWS/CloudTrail
| json field=_raw "awsRegion"as awsregion

2. Click the LogReduce button.

3. The Signatures tab is displayed with your results.

©2021 Sumo Logic, All Rights Reserved.


30
To further explore the functionality of LogReduce, which allows you to distill unique
messages from the noise by identifying recurring Signatures in your data.

4. Next, in a new search, run LogReduce on your Snort security data to identify
unusual activity (i.e. intrusions).

_sourceCategory=labs/snort

| logreduce

5. Sort your results by count to identify those that happen only once.

6. Click on the count (1) to view the unusual message.

7. Now click on the host to view surrounding messages to identify the context of
the intrusion.

LogCompare Operator:
LogCompare allows you to compare log activity from two different time periods,
providing you insight on how your current time compares to a baseline. In this case, we
will use LogCompare to identify when signature messages deviate by more than 25%
from the baseline.

©2021 Sumo Logic, All Rights Reserved.


31
1. To Explore the functionality of LogCompare, use LogCompare to run a
summarized query for a baseline 24 hours ago.

_sourceCategory=Labs/AWS/CloudTrail
| logcompare timeshift -24h

©2021 Sumo Logic, All Rights Reserved.


32
2. To view only those results where Delta Percentage is more than 25%, add a where
clause for _deltaPercentage.

_sourceCategory=Labs/AWS/CloudTrail
| logcompare timeshift -24h
| where abs(_deltaPercentage) > 25

3. To view results where there is a new Signature in the current time period, add a
where clause for _isNew:

_sourceCategory=Labs/AWS/CloudTrail
| logcompare timeshift -24h
| where (_isNew)

©2021 Sumo Logic, All Rights Reserved.


33
LogExplain Operator

The LogExplain operator allows you to compare sets of structured logs based on
events you're interested in. Structured logs can be in JSON, CSV, key-value, or any
structured format.

You'll need to specify an event of interest as a conditional statement, this is called the
Event Condition. You can specify a condition to compare against the event-of-interest
condition, this is called the Against Condition. If no Against Condition is provided,
LogExplain will generate the comparison data set based on the fields in your Event
Condition.

The syntax is:


| logexplain <event_condition> [against <against_condition>] on
<fieldname>

LogExplain will process your data against the specified conditions and create separate
data sets to compare:
● A control data set from normal operations data.
● An event-of-interest data set.

LogExplain gathers frequent (at least 5%) joint-column entries, such as key-value pairs
that occur more frequently compared to the control set. The results indicate what
entities correlate with the event you're interested in.

©2021 Sumo Logic, All Rights Reserved.


34
To explore the LogExplain operator, run the following query on CloudTrail:

_sourceCategory=*cloudtrail*
| json field=_raw "userIdentity.userName" as userName nodrop
| json field=_raw
"userIdentity.sessionContext.sessionIssuer.userName" as
userName_role nodrop
| if (isNull(userName), if(!isNull(userName_role),userName_role,
"Null_UserName"), userName) as userName
| json field=_raw "eventSource" as eventSource
| json field=_raw "eventName" as eventName
| json field=_raw "awsRegion" as awsRegion
| json field=_raw "errorCode" as errorCode nodrop
| json field=_raw "errorMessage" as errorMessage nodrop
| json field=_raw "sourceIPAddress" as sourceIp nodrop
| json field=_raw "requestParameters.bucketName" as bucketName
nodrop
| json field=_raw "recipientAccountId" as accountId
| where eventSource matches "s3.amazonaws.com" and accountId
matches "*"
| logexplain (errorCode matches "*AccessDenied*") on sourceIp,
userName, awsRegion, eventName, bucketName

LogExplain returns following fields in results:

● _explanation
The fields and respective values from the comparison.

● _relevance
The probability that the explanation occurs in the event-of-interest data set.
Values are 0 to 1.

_test_coverage
The percentage of data in the event-of-interest set that has the explanation.The
link opens a new search that drills down to these logs based on the related
explanation.

● _control_coverage
The percentage of control data in the event-of-interest set that has the
explanation.The link opens a new search that drills down to these logs based on
the related explanation.

©2021 Sumo Logic, All Rights Reserved.


35
With the provided results you can:
● Click the provided links to drill down and further explore logs from each
explanation.
● Run subsequent searches.
For example, if an IP address is an outlier you might search for logs referencing
that IP address for further investigation.

©2021 Sumo Logic, All Rights Reserved.


36

You might also like