0% found this document useful (0 votes)
51 views8 pages

External Notes On Monitoring

The document discusses best practices for monitoring systems and applications. It recommends measuring metrics that affect business value like capacity, availability, performance, and scalability. Different metric types should be used like gauges, counters, meters, histograms, and timers. An OODA (observe, orient, decide, act) approach helps build observability. Metrics should be monitored in real-time for current issues and aggregated over time to see patterns. The goal is to deliver knowledge, not just information, by asking questions of the data. Monitoring tools and evolving approaches are discussed, along with challenges and tips for better monitoring.

Uploaded by

Johnson Kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views8 pages

External Notes On Monitoring

The document discusses best practices for monitoring systems and applications. It recommends measuring metrics that affect business value like capacity, availability, performance, and scalability. Different metric types should be used like gauges, counters, meters, histograms, and timers. An OODA (observe, orient, decide, act) approach helps build observability. Metrics should be monitored in real-time for current issues and aggregated over time to see patterns. The goal is to deliver knowledge, not just information, by asking questions of the data. Monitoring tools and evolving approaches are discussed, along with challenges and tips for better monitoring.

Uploaded by

Johnson Kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

External Notes on Monitoring

High-level takeaways:
● Alerts are not logs — optimise for a false positive rate of zero, even if the false negative
rate is nonzero
● Measure work being performed that affects business value (e.g. CAPS: capacity,
availability, performance, scalability)
● Consider the best type of metric for the job: gauge, counter, meter, histogram, timer
● Use OODA (observe, orient, decide, act) to build a culture of observability
○ You can’t improve what you can’t measure
○ You can’t measure what you can’t observe
○ Decide what you care about and how to measure it
○ Find a way to get the data
● Aim to deliver knowledge, not just information. Start with a hypothesis and ask the
graphs the question; don’t go to a graph and then ask what the numbers mean
● Don’t use static thresholds — tune thresholds properly

Metrics, metrics everywhere


● “Our code generates business value when it runs, not when we write it”
● Map ≠ ​ ​territory
● Mind the gap between perception and reality
● Measuring code in the wild means measuring code in production
● What to measure
○ What does this code do that affects its business value?
○ How do we measure that?
● 5 types of metrics to consider
○ Gauge — the instantaneous value of something (e.g. # of cities)
○ Counters — an incrementing and decrementing value (e.g. # of sessions)
○ Meters — average rate of events over a period of time (e.g. # requests/sec)
■ Care less about the overall average, and more about the change over
time
■ “Average of 1,000 requests per second” is less meaningful than “We went
from 3,000 to 500 requests per second”
■ Exponentially weighted moving average
○ Histograms — statistical distribution of values in a stream of data (e.g. # of cities
returned, 95% of autocomplete results returned 3 cities or fewer)
■ Mean is really useful for normally distributed data, but most data is not
■ Quantiles (median, 75%, 95%, 98%, 99%, 99.9%)
■ Reservoir sampling — keep a statistically representative sample of
measurements as they happen
● Vitter’s Algorithm R uses uniform sampling
● Forward decaying priority sampling prioritises recency bias
○ Timers — histogram of durations and meter of calls (at ~2,000 requests per
second, our latency jumps from 13ms to 453ms)
● For perspective, each service of Yammer exports 40-50 metrics
● Monitor values for current problems in order to respond in real-time
● Aggregate values for historical perspective to see long-term patterns
● Benefits
○ Go faster and shorten decision-making cycle
○ If we do this faster, we will win
○ Fewer bugs, more features
○ Make better decisions by using numbers
● OODA
○ Observe — What is the 99% latency of our autocomplete service right now?
500ms
○ Orient — How does this compare to other parts of our system, both currently and
historically?
○ Decide — Should we make it faster? Or should we add a new feature X?
○ Act — Write some code

Crafting performance monitoring tools


● Tools
○ Logster — generate metrics from logfiles
○ Graphite — store and graph metrics
○ Internal graph dashboard​ for engs to manually view impact of changes
○ Nagios — individual threshold monitoring for page performance
■ Important parameters include monitoring intervals for normal performance
vs. poor performance and length of time between start of bad
performance and when a notification goes out
■ Service dependencies to only give actionable alerts
● Metrics Etsy tracks
○ Timer
■ Time taken to load page (60s intervals)
● Number of page visits
● Average time
● Median time
● 95% percentile
● Monitoring
○ Evolution of approach
■ 5 slowest pages by 95% percentile
● Not great because didn’t change much from week-to-week
● Didn’t track the performance of the pages that needed to be most
performant
■ Replaced with regressions report: page performance in last 24 hours
compared to performance over the previous 7 days
● Better but still not great since it didn’t catch slow creep
regressions
● Difficult to tune since each page has a different performance
pattern (e.g. seasonality)
● Required a lot of additional investigation and alert fatigue
■ Email alerts
● Setting page-specific thresholds to create graphs like ​this
● Minimise inactionable alerts
● Improved context tools to make investigations easier
■ Recognising performance improvements

Monitoring challenges
● Measuring business value
○ Customer happiness
■ Time to value
■ Availability
■ Response time
○ Cost efficiency
■ Utilisation
■ Optimisation
■ Automation

Better living through statistics: Monitoring doesn't have to suck


● What is monitoring?
○ Measuring
○ Recording
○ Alerting
○ Visualisation
● Tips on approach
○ Automate the boring parts and do the fun things yourself
○ Alert based on rate of change of time series (e.g. errors per interval)
○ Useful minimum is to look back at 2.5x sampling interval
○ Alerts are not logs
○ Make sure alerts are actionable, then document them
● Three approaches to monitoring
○ Blackbox monitoring resulting in alerts
■ Treat the system as opaque: you can only use it as a user would
■ Cucumber-Nagios
■ Problems
● Only boolean, no visibility into why (e.g. why is the site slow, why
has image serving stopped working)
● No predictive capability (how long until we need more datacenters)
● No historical data
■ Problems with check-alert model
● Tuning is difficult when thresholds are all different
● Adding new checks is difficult because need to write new scripts
● Scripts perform measurement and judgment all at once
● Inactionable alerts
● Cost on monitoring systems are hard to scale
○ Whitebox monitoring resulting in charts
■ Expose the internal state of the system for inspection
■ Graphite, New Relic, Metrics
● Types of time series
○ Counters — only go up
■ No loss of meaning when sampling
○ Gauges — can go up and down
■ No idea of values in between sampling, so prefer counters
■ Can calculate gauge based on a counter by turning it into a rate
● What to measure
○ Queries per second
○ Errors per second (by error type)
○ Latency (by query type, response code, payload size)
○ Bandwidth (by direction, query type, response code)
● What not to measure
○ Load average
● What to alert on
○ Rate of change of queries per second outside normal cycles
○ Ratio of errors to queries
○ Latency (mean, 95% percentile) too high
○ Rate of change of bandwidth

Monitoring is Dead. Long Live Monitoring.


● No major takeaways

What should I monitor and how should I do it?


● Two types of monitoring
○ Health checks
■ Server/service is alive/dead
■ Metrics vs. threshold checks
■ Nagios (old) / Sensu (new)
■ Shortcomings
● Not designed to support a method/process
● Focus on meaningless things
● Passive acceptance of what’s there
● Simplistic (OK/WARN/CRIT)
○ Metrics
■ Graphite / Cacti
● Maxims
○ You can’t improve what you can’t measure
○ You can’t measure what you can’t observe
○ Decide what you care about and how to measure it
○ Find a way to get the data
● Care about and measure CAPS
○ Capacity
○ Availability
○ Performance
○ Scalability
● Provide observability for operations staff
○ Goal: make it easy (=minimal cognitive burden) to observe CAPS in large
(=thousands) numbers of services, systems, and subsystems
● Consider the past, present, and future
○ Past
■ Inspection
■ Reinspection
■ Comparison between time periods
■ Post-mortems, time comparison, reporting
○ Present
■ Goals: determine health, availability, trend
■ Health: are the systems alive, reachable, responsive, fast, consistent, and
correct?
■ Trend: what’s the current state and where is it headed?
■ Observability, troubleshooting, incident response
○ Future
■ Predict problems (and avoid them?)
■ Avoid downtime (increase availability)
■ Capacity planning, procurement, avoiding downtime
● Support a performance analysis method
○ Goal-driven performance optimisation (Percona)
○ Method R (Cary Millsap)
○ USE Method (Brendan Gregg)
● Omit meaningless data
○ Answer the ‘so what?’ question
○ Deliver knowledge, not just information
○ Start with a hypothesis and ask the graphs the question; don’t go to a graph and
then ask what the numbers mean
○ Anticipate the user’s need (support a method)
○ Make the data available, but don’t lead with it
● Support ad-hoc inspection, active alerts, and periodic reports
○ 3 ways to deliver information
■ Information on request
■ Information on triggers (events, incidents)
■ Scheduled reports
○ Alerts are most important to get right because they have the highest cost
■ Prevent needless investigation
■ Include the relevant ‘so what?’ data
● Fault detection
○ React to faults by capturing more information
○ Often more important to be able to conduct an investigation after-the-fact, not to
alert and try to fix in real-time
○ False positives
■ Avoid at all costs
○ False negatives
■ It’s ok to have some
○ Static thresholds are terrible because they lead to high false positives and
negatives
○ Anomaly detection can be very noisy
● Capture raw data in high resolution
○ Don’t predigest data, but go directly to the source
● Performance can be divided into task performance and resource performance
○ Task: a user task (e.g. adding an item to a shopping cart)
■ Measuring task performance
● Good performance is consistently fast performance
● Performance = response time
● Measure SLA (e.g. 95% of tasks take <1s) and hard thresholds
(e.g. no tasks take >5s)
○ SLA is acceptable response time for tasks, in seconds, as
a 95% or 99% percentile over intervals of time
○ SLA interval should be short because if too long, 5% of
time is actually a lot!
● Metrics
○ Little’s Law: Concurrency = Throughput x Response Time
○ Queueing theory: Response Time = Service Time + Wait
Time
○ Error rate
■ Improving task performance
● Classify the work by user goals, and generate a profile (aggregate,
rank, sort)
● Determine what’s worth improving
● Look for outliers and poor experiences
● Goals
○ Improve user-visible response time & consistency
○ Reduce server load; increase throughput
○ Reduce collateral damage
○ Resource (e.g. server, disk)
■ Measuring resource performance
● Time elapsed
● Throughput
● Work-in-flight (current concurrency)
● Total response time
● Busy time
● Availability: the absence of downtime
○ MTBF (Mean Time Between Failures) vs. MTTR (Mean
Time To Recovery)
○ Diagnosing stalls (short faults)
■ Transient, partial faults are common
■ Was the resource overloaded or underperforming?
■ Measure the sources of load/demand
■ Measure utilisation, backlog, errors/exceptions
● Capacity
○ Hard limit or soft limit?
○ Quantify capacity limit
○ Forecast when it will be reached
○ Server capacity
■ Max throughput with good performance (meeting
SLA)
■ Universal scalability law
● Metrics
○ Availability
■ MTBF, MTTR
■ % as determined by SLA
○ Capacity
■ Backlog (load, queue length)
■ Max capacity as per USL
■ Max provided capacity (e.g. disk size, CPU cycles
per second)
○ Scalability
■ Universal scalability law
○ Utilisation
■ Utilisation Law: Utilisation = Busy Time /
Observation Time
● What to measure
○ Think about what we want to measure vs. what’s easy to measure
○ Questions to ask
■ Is the work getting done?
■ How fast, and how consistently?
■ How often are there errors?
■ How much work is queued?
■ How much work can we request?
■ How often/long is there a failure to do work?
○ For tasks, individually or categorised by end-user goal
■ Throughput
■ Response time
■ Concurrency
■ Errors and exceptions
○ For resources
■ Utilisation
■ Backlog/load/demand
■ Errors and exceptions

What should I instrument and how do I do it?


● Convenient blueprints
○ Brendan Gregg — USE (utilisation, saturation, errors)
○ Tom Wilkie — RED (measure Rate, Errors, and Duration of requests)
○ SRE Book’s 4 Golden Signals (latency, traffic, errors, saturation)
● How to define performance
○ External (customer’s) view
■ Request (singular) and its latency and success
○ Internal (operator’s) view
■ Requests (plura) and their latency distribution, rates, and concurrency
■ System resources/components and their throughput, utilisation, and
backlog
● Observability
○ For requests
■ Log every state transition/change a request makes
■ Emit metrics on aggregates at these points, or at regular intervals
■ Capture traces for distributed tracing
○ Tools
■ https://github.com/VividCortex/pm​ for API/service processlists
● Works for observability now but not for historical observability
● Metrics (aggregatable), Tracing (request scoped), Logging (events)
● Recommended article on logging

You might also like