0% found this document useful (0 votes)
138 views43 pages

IT1171

splunk03

Uploaded by

bobwillmore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views43 pages

IT1171

splunk03

Uploaded by

bobwillmore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

© 2019 SPLUNK INC.

Accelerate your ability to sniff out


application exceptions and detect
outliers in performance KPIs

PJ Pokhrel
Performance Engineer | Stubhub
© 2019 SPLUNK INC.

Steve Veio Eurus Kim


Ops Manager | StubHub Staff ML Architect | Splunk
© 2019 SPLUNK INC.

1. Building an Exception Sniffer

2. Process and setup

Agenda 3. Applying the Use Cases

4. Splunk Metrics and Machine Learning

5. Enhanced Use Cases - Smarter Alerts

6. Summary
© 2019 SPLUNK INC.

StubHub
Introduction to StubHub

​StubHub is the world’s most trusted


ticket marketplace owned by eBay,
which provides services for buyers and
sellers of tickets for sports, concerts,
theater and other live entertainment
events

​E-commerce site with desktop,


mobile-web and native app products
© 2019 SPLUNK INC.

Our Stack
How do you detect production issues early in this complexity?

• Distributed microservice architecture


• About 70 roles (pools/instance groups)
• About 4700 servers out of which 1450
servers running Java
• Three pods (production environment)
• Over 1000 endpoints
© 2019 SPLUNK INC.

Exceptions Sniffer
What is the exceptions sniffer?

​“Exception Sniffer” is the name we gave


our tool that helps us extract, track and
use exceptions data to gain insights into
our application behavior and
performance.

​Tracks java and business exceptions on


all of our servers running java
© 2019 SPLUNK INC.

Common Types of Exceptions in our


Stack

• java.io.IOException • SHBadRequestException
• java.lang.NumberFormatException • SHResourceNotFoundException
• java.lang.NullPointerException • UserNotAuthorizedException
• java.net.SocketTimeoutException •…
• java.sql.SQLException
•…
© 2019 SPLUNK INC.

Exceptions Sniffer v1 (old version)


Architecture overview and issues

​Overview: ​Issues:
• Internal java application which sniffs Errors and • Became slower as data grew
Exceptions in java apps from application logs • Time consuming
• Calls Splunk REST APIs • Server maintenance
• Data processed by the sniffer and saved to • Dependent on Splunk REST APIs
PostgreSQL
• No native Machine Learning support
• Rules engine
• Alert manager module to send alerts

This Photo by Unknown Author is licensed


under CC BY-NC-ND
© 2019 SPLUNK INC.

Exceptions Sniffer v2!


Overview of exceptions sniffer v2

• Built in Splunk
• Set of data models, metrics,
dashboards and alerts
• Uses Splunk components: metric store,
alert, dashboard, machine learning
functions etc
• Allows us to store lot of data without
worrying about space, reducing time to
generate weekly and monthly reports
© 2019 SPLUNK INC.

Requirements and Use Cases

• Deeper insights into data patterns and


ability to use trends for debugging and
troubleshooting
• Less time maintaining application and
servers and lean hardware / storage
• Better alerting
• Fast searching of large amounts of
historical data
• Creating month to date trends
• Whitelist functionality
© 2019 SPLUNK INC.

Journey
How we got started?

• Started very simple :)


– index=java AND exceptions
• Next added auto extracted java_exceptions field
– index=java and java_exceptions=*
• Next added more dimensions
– index=java and java_exceptions=* | stats count by
role,pod,java_exceptions
• Next...
© 2019 SPLUNK INC.

Journey: Event Logs


Event logs retention and performance

• Querying events logs was taking very long


especially for weekly and monthly reports
• Event logs only had 30 days retention, so
historical data was lost and we did not
have enough data to make a good model
© 2019 SPLUNK INC.

Journey: Splunk Metric


We used our initial search query and mcollect to store data as a metric and mstats to
query the metric
Save: Query:
(external version) (external version)

index=java java_exception=* | mstats sum(java.exceptions.count)


| stats count AS _value by as "ExceptionCount" where
_time,host,dimension1,dimension2
index=metrics earliest=-7d@w1
| eval
metric_name="java.exceptions.count" latest=@w1 span=1d
| mcollect index=metrics
© 2019 SPLUNK INC.

Splunk Metrics
Data retrieval performance before and after

• Searching 15 minutes of raw logs


– 5 minutes 59 seconds
• Searching 15 minutes or metrics index data
– 3 seconds
© 2019 SPLUNK INC.

Journey: Exception Sniffer Use Cases

• Dashboards and Views


• Reports and Trending
• Used for DevOps Alerts
• Used for modeling for anomaly
detection
© 2019 SPLUNK INC.

Dashboard
Splunk dashboard that allows users to filter and group data
© 2019 SPLUNK INC.

Weekly Report
Weekly exceptions report heatmap view by role
© 2019 SPLUNK INC.

Error percentage calculation


Using tstats for calculating error rate

• We had exceptions count as a metric Tstats example:


but no other reference to use for
calculating error rate | tstats count WHERE index=java
• Used tstats to solve this problem to sourcetype=log4j by sh_role _time
calculate error percentage span=1d
| outputlookup
exception_sniffer_tstat_output.cs
v
© 2019 SPLUNK INC.

Monthly Report
Monthly exceptions trend report
© 2019 SPLUNK INC.

​Reports are
great, but we
also needed
data in
real-time to
take action
© 2019 SPLUNK INC.

Exploratory Data Analysis


Overview of exceptions data

• About 70 roles (cluster/instance groups/pools) with different exception rate patterns


• Disparate distinct shape of data seasonality
• Lots of high and low values
• Same data different patterns
• Typically 40 million logging events per minute
© 2019 SPLUNK INC.

Our Exception Data Pattern


Pattern 1

• Some roles have large number of


ongoing exceptions (mostly business
exceptions)
• Brokers and Selling related errors
• Exception rate is greater than 1000
exceptions per minute on average
• Some seasonality (Large spikes at
different times of the day)
© 2019 SPLUNK INC.

Our exception data pattern


Pattern 2

• Exception rate is greater than 100 but


less than 1000 exceptions/min on
average
• Large variance with lots of high low
values
• Ancillary roles, page controllers, batch
services
© 2019 SPLUNK INC.

Our Exception Data Pattern


Pattern 3

• Very few or zero ongoing exceptions


© 2019 SPLUNK INC.

Alerts
Actionable alert policies

​New Exceptions Alert ​Critical Exceptions Alert


• A scheduled job writes exceptions for that day • Created a outputlookup file which contains list
to a outputlookup file at midnight of critical exceptions
• Alert job queries exceptions for last 15 minutes (ex java.lang.OutOfMemoryError)
and filters the result using the outputlookup file • Alert job runs every 15 mins and checks results
and reports if any exceptions from the critical
exceptions list were found
© 2019 SPLUNK INC.

Needed Intelligent Alerts


Experiments with creating smarter alerts

•Threshold method
– Standard deviation
– Standard deviation with sliding window
– Median absolute deviation
•Other ML algorithms
– Clustering to find underlying structure in exception data
– Probability density function
© 2019 SPLUNK INC.

Understanding a bit of the


ML behind smarter alerts
Eurus Kim | ML Architect | Splunk
© 2019 SPLUNK INC.

Numeric Outlier Detection with MLTK


Trying to create smarter alerts with statistics
Starting with basic thresholding using
Standard Deviation, Median Absolute Deviation, or Interquartile Range

Global outliers are found, but local outliers are undetected


For seasonal data, thresholds are too large at certain times
© 2019 SPLUNK INC.

Numeric Outlier Detection with MLTK


Getting a little more advanced
Using Standard Deviation with a sliding window

Both global and local outliers found, but now it’s way too noisy
Thresholds are too small or large at certain periods
© 2019 SPLUNK INC.

Numeric Outlier Detection with MLTK


Getting a little more advanced
Further testing using Median Absolute Deviation with a sliding window

The boundaries look better, but stilll appears to be way too noisy
© 2019 SPLUNK INC.

Understanding the Shape of our Data


Viewing the frequency of values (histogram) against a normal distribution curve

-2 SD -1 SD Average +1 SD +2 SD +3 SD
© 2019 SPLUNK INC.

Understanding the Shape of our Data


What if we could draw a curve that better fits our data?
© 2019 SPLUNK INC.

Leveraging the DensityFunction Algorithm


Allowing the math to figure out the best fit curve

| fit DensityFunction requests threshold=0.01


into MyModel

| apply MyModel
© 2019 SPLUNK INC.

Splitting our Data with DensityFunction


Given that our data is cyclical, should we split the data by time?

The multi-modal nature of our data probably to the fact that our data is cyclical
Consider what else you may want to split your data by (app type, user group, etc)

Boxplot of every 3rd hour of the day


© 2019 SPLUNK INC.

Splitting data using DensityFunction


| fit DensityFunction requests by "hour" into MyModel
© 2019 SPLUNK INC.

Splunk Machine Learning Advisory


Program

• Get help from the Splunk Data Scientists to solve your


business use case with Machine Learning Toolkit
• Complimentary support with your Enterprise or Cloud
license
• Early access to new Machine Learning features
• Results in opportunity to tell your success story with Splunk
• Contact [email protected] for more information
© 2019 SPLUNK INC.

What to
Learn More
About
Density
Function?
Additional sessions to
further deep dive on the
theory and example use
cases
© 2019 SPLUNK INC.

Comparing Threshold vs Density Function


Actionable alert policies

​Threshold Method: ​Probability Density Function (PDF):

​Pros ​Pros
• Easy to understand • Better for outlier detection
• MLTK assistant available • Supports fit and apply, so easier to setup

​Cons ​Cons
• Doesn’t support fit or apply • Not available in MLTK assistant yet
• Complicated to use for alerts
© 2019 SPLUNK INC.

Effective Alerts
Final findings for the most effective alert for each exception pattern

​Pattern 3 (zero ongoing) ​Pattern 2 (medium ​Pattern 1 (ongoing


• Diff alert and basic static ongoing exception) pattern)
threshold • Diff alert and probability • Diff method ( Newexception
density function alert and Critical exceptions)
© 2019 SPLUNK INC.

How Tracking Exceptions Has Helped Us?

• 50% reduction in exceptions logging


– Duplicate logging
– Don’t log stack trace for business exception
• Found several hidden application issues
• Using exceptions as one indicator of an issue
• More accurate alerts
• Cleaner logs, reduced noise
• More accountability, better code quality
© 2019 SPLUNK INC.

1. Create metrics for faster data retrieval

2. Know your dataset! (avg, p90, median, standard


deviation..etc)
Lessons • Use histogram to get a better understanding of the distribution
• Use PDF function to figure out the distribution pattern for you
Learned
The lessons we 3. Threshold method works well when data is normally
learned… distributed but can be a little complicated to create

4. For more complex data, create different alert policy for


each pool

5. Each use case is different


© 2019 SPLUNK INC.

Thank
You
!Go to the .conf19 mobile app to

RATE THIS SESSION


© 2019 SPLUNK INC.

Q&A
Steve Veio | Ops Manager
PJ Pokhrel | Performance Engineer
Eurus Kim | Staff ML Architect | Splunk

You might also like