External Notes On Monitoring

The document discusses best practices for monitoring systems and applications. It recommends measuring metrics that affect business value like capacity, availability, performance, and scalability. Different metric types should be used like gauges, counters, meters, histograms, and timers. An OODA (observe, orient, decide, act) approach helps build observability. Metrics should be monitored in real-time for current issues and aggregated over time to see patterns. The goal is to deliver knowledge, not just information, by asking questions of the data. Monitoring tools and evolving approaches are discussed, along with challenges and tips for better monitoring.

Uploaded by

Johnson Kim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views8 pages

External Notes On Monitoring

Uploaded by

Johnson Kim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

External Notes on Monitoring

High-level takeaways:
● Alerts are not logs — optimise for a false positive rate of zero, even if the false negative
rate is nonzero
● Measure work being performed that affects business value (e.g. CAPS: capacity,
availability, performance, scalability)
● Consider the best type of metric for the job: gauge, counter, meter, histogram, timer
● Use OODA (observe, orient, decide, act) to build a culture of observability
○ You can’t improve what you can’t measure
○ You can’t measure what you can’t observe
○ Decide what you care about and how to measure it
○ Find a way to get the data
● Aim to deliver knowledge, not just information. Start with a hypothesis and ask the
graphs the question; don’t go to a graph and then ask what the numbers mean
● Don’t use static thresholds — tune thresholds properly

Metrics, metrics everywhere

● “Our code generates business value when it runs, not when we write it”
● Map ≠ territory
● Mind the gap between perception and reality
● Measuring code in the wild means measuring code in production
● What to measure
○ What does this code do that affects its business value?
○ How do we measure that?
● 5 types of metrics to consider
○ Gauge — the instantaneous value of something (e.g. # of cities)
○ Counters — an incrementing and decrementing value (e.g. # of sessions)
○ Meters — average rate of events over a period of time (e.g. # requests/sec)
■ Care less about the overall average, and more about the change over
time
■ “Average of 1,000 requests per second” is less meaningful than “We went
from 3,000 to 500 requests per second”
■ Exponentially weighted moving average
○ Histograms — statistical distribution of values in a stream of data (e.g. # of cities
returned, 95% of autocomplete results returned 3 cities or fewer)
■ Mean is really useful for normally distributed data, but most data is not
■ Quantiles (median, 75%, 95%, 98%, 99%, 99.9%)
■ Reservoir sampling — keep a statistically representative sample of
measurements as they happen
● Vitter’s Algorithm R uses uniform sampling
● Forward decaying priority sampling prioritises recency bias
○ Timers — histogram of durations and meter of calls (at ~2,000 requests per
second, our latency jumps from 13ms to 453ms)
● For perspective, each service of Yammer exports 40-50 metrics
● Monitor values for current problems in order to respond in real-time
● Aggregate values for historical perspective to see long-term patterns
● Benefits
○ Go faster and shorten decision-making cycle
○ If we do this faster, we will win
○ Fewer bugs, more features
○ Make better decisions by using numbers
● OODA
○ Observe — What is the 99% latency of our autocomplete service right now?
500ms
○ Orient — How does this compare to other parts of our system, both currently and
historically?
○ Decide — Should we make it faster? Or should we add a new feature X?
○ Act — Write some code

Crafting performance monitoring tools

● Tools
○ Logster — generate metrics from logfiles
○ Graphite — store and graph metrics
○ Internal graph dashboard for engs to manually view impact of changes
○ Nagios — individual threshold monitoring for page performance
■ Important parameters include monitoring intervals for normal performance
vs. poor performance and length of time between start of bad
performance and when a notification goes out
■ Service dependencies to only give actionable alerts
● Metrics Etsy tracks
○ Timer
■ Time taken to load page (60s intervals)
● Number of page visits
● Average time
● Median time
● 95% percentile
● Monitoring
○ Evolution of approach
■ 5 slowest pages by 95% percentile
● Not great because didn’t change much from week-to-week
● Didn’t track the performance of the pages that needed to be most
performant
■ Replaced with regressions report: page performance in last 24 hours
compared to performance over the previous 7 days
● Better but still not great since it didn’t catch slow creep
regressions
● Difficult to tune since each page has a different performance
pattern (e.g. seasonality)
● Required a lot of additional investigation and alert fatigue
■ Email alerts
● Setting page-specific thresholds to create graphs like this
● Minimise inactionable alerts
● Improved context tools to make investigations easier
■ Recognising performance improvements

Monitoring challenges
● Measuring business value
○ Customer happiness
■ Time to value
■ Availability
■ Response time
○ Cost efficiency
■ Utilisation
■ Optimisation
■ Automation

Better living through statistics: Monitoring doesn't have to suck

● What is monitoring?
○ Measuring
○ Recording
○ Alerting
○ Visualisation
● Tips on approach
○ Automate the boring parts and do the fun things yourself
○ Alert based on rate of change of time series (e.g. errors per interval)
○ Useful minimum is to look back at 2.5x sampling interval
○ Alerts are not logs
○ Make sure alerts are actionable, then document them
● Three approaches to monitoring
○ Blackbox monitoring resulting in alerts
■ Treat the system as opaque: you can only use it as a user would
■ Cucumber-Nagios
■ Problems
● Only boolean, no visibility into why (e.g. why is the site slow, why
has image serving stopped working)
● No predictive capability (how long until we need more datacenters)
● No historical data
■ Problems with check-alert model
● Tuning is difficult when thresholds are all different
● Adding new checks is difficult because need to write new scripts
● Scripts perform measurement and judgment all at once
● Inactionable alerts
● Cost on monitoring systems are hard to scale
○ Whitebox monitoring resulting in charts
■ Expose the internal state of the system for inspection
■ Graphite, New Relic, Metrics
● Types of time series
○ Counters — only go up
■ No loss of meaning when sampling
○ Gauges — can go up and down
■ No idea of values in between sampling, so prefer counters
■ Can calculate gauge based on a counter by turning it into a rate
● What to measure
○ Queries per second
○ Errors per second (by error type)
○ Latency (by query type, response code, payload size)
○ Bandwidth (by direction, query type, response code)
● What not to measure
○ Load average
● What to alert on
○ Rate of change of queries per second outside normal cycles
○ Ratio of errors to queries
○ Latency (mean, 95% percentile) too high
○ Rate of change of bandwidth

Monitoring is Dead. Long Live Monitoring.

● No major takeaways

What should I monitor and how should I do it?

● Two types of monitoring
○ Health checks
■ Server/service is alive/dead
■ Metrics vs. threshold checks
■ Nagios (old) / Sensu (new)
■ Shortcomings
● Not designed to support a method/process
● Focus on meaningless things
● Passive acceptance of what’s there
● Simplistic (OK/WARN/CRIT)
○ Metrics
■ Graphite / Cacti
● Maxims
○ You can’t improve what you can’t measure
○ You can’t measure what you can’t observe
○ Decide what you care about and how to measure it
○ Find a way to get the data
● Care about and measure CAPS
○ Capacity
○ Availability
○ Performance
○ Scalability
● Provide observability for operations staff
○ Goal: make it easy (=minimal cognitive burden) to observe CAPS in large
(=thousands) numbers of services, systems, and subsystems
● Consider the past, present, and future
○ Past
■ Inspection
■ Reinspection
■ Comparison between time periods
■ Post-mortems, time comparison, reporting
○ Present
■ Goals: determine health, availability, trend
■ Health: are the systems alive, reachable, responsive, fast, consistent, and
correct?
■ Trend: what’s the current state and where is it headed?
■ Observability, troubleshooting, incident response
○ Future
■ Predict problems (and avoid them?)
■ Avoid downtime (increase availability)
■ Capacity planning, procurement, avoiding downtime
● Support a performance analysis method
○ Goal-driven performance optimisation (Percona)
○ Method R (Cary Millsap)
○ USE Method (Brendan Gregg)
● Omit meaningless data
○ Answer the ‘so what?’ question
○ Deliver knowledge, not just information
○ Start with a hypothesis and ask the graphs the question; don’t go to a graph and
then ask what the numbers mean
○ Anticipate the user’s need (support a method)
○ Make the data available, but don’t lead with it
● Support ad-hoc inspection, active alerts, and periodic reports
○ 3 ways to deliver information
■ Information on request
■ Information on triggers (events, incidents)
■ Scheduled reports
○ Alerts are most important to get right because they have the highest cost
■ Prevent needless investigation
■ Include the relevant ‘so what?’ data
● Fault detection
○ React to faults by capturing more information
○ Often more important to be able to conduct an investigation after-the-fact, not to
alert and try to fix in real-time
○ False positives
■ Avoid at all costs
○ False negatives
■ It’s ok to have some
○ Static thresholds are terrible because they lead to high false positives and
negatives
○ Anomaly detection can be very noisy
● Capture raw data in high resolution
○ Don’t predigest data, but go directly to the source
● Performance can be divided into task performance and resource performance
○ Task: a user task (e.g. adding an item to a shopping cart)
■ Measuring task performance
● Good performance is consistently fast performance
● Performance = response time
● Measure SLA (e.g. 95% of tasks take <1s) and hard thresholds
(e.g. no tasks take >5s)
○ SLA is acceptable response time for tasks, in seconds, as
a 95% or 99% percentile over intervals of time
○ SLA interval should be short because if too long, 5% of
time is actually a lot!
● Metrics
○ Little’s Law: Concurrency = Throughput x Response Time
○ Queueing theory: Response Time = Service Time + Wait
Time
○ Error rate
■ Improving task performance
● Classify the work by user goals, and generate a profile (aggregate,
rank, sort)
● Determine what’s worth improving
● Look for outliers and poor experiences
● Goals
○ Improve user-visible response time & consistency
○ Reduce server load; increase throughput
○ Reduce collateral damage
○ Resource (e.g. server, disk)
■ Measuring resource performance
● Time elapsed
● Throughput
● Work-in-flight (current concurrency)
● Total response time
● Busy time
● Availability: the absence of downtime
○ MTBF (Mean Time Between Failures) vs. MTTR (Mean
Time To Recovery)
○ Diagnosing stalls (short faults)
■ Transient, partial faults are common
■ Was the resource overloaded or underperforming?
■ Measure the sources of load/demand
■ Measure utilisation, backlog, errors/exceptions
● Capacity
○ Hard limit or soft limit?
○ Quantify capacity limit
○ Forecast when it will be reached
○ Server capacity
■ Max throughput with good performance (meeting
SLA)
■ Universal scalability law
● Metrics
○ Availability
■ MTBF, MTTR
■ % as determined by SLA
○ Capacity
■ Backlog (load, queue length)
■ Max capacity as per USL
■ Max provided capacity (e.g. disk size, CPU cycles
per second)
○ Scalability
■ Universal scalability law
○ Utilisation
■ Utilisation Law: Utilisation = Busy Time /
Observation Time
● What to measure
○ Think about what we want to measure vs. what’s easy to measure
○ Questions to ask
■ Is the work getting done?
■ How fast, and how consistently?
■ How often are there errors?
■ How much work is queued?
■ How much work can we request?
■ How often/long is there a failure to do work?
○ For tasks, individually or categorised by end-user goal
■ Throughput
■ Response time
■ Concurrency
■ Errors and exceptions
○ For resources
■ Utilisation
■ Backlog/load/demand
■ Errors and exceptions

What should I instrument and how do I do it?

● Convenient blueprints
○ Brendan Gregg — USE (utilisation, saturation, errors)
○ Tom Wilkie — RED (measure Rate, Errors, and Duration of requests)
○ SRE Book’s 4 Golden Signals (latency, traffic, errors, saturation)
● How to define performance
○ External (customer’s) view
■ Request (singular) and its latency and success
○ Internal (operator’s) view
■ Requests (plura) and their latency distribution, rates, and concurrency
■ System resources/components and their throughput, utilisation, and
backlog
● Observability
○ For requests
■ Log every state transition/change a request makes
■ Emit metrics on aggregates at these points, or at regular intervals
■ Capture traces for distributed tracing
○ Tools
■ https://github.com/VividCortex/pm for API/service processlists
● Works for observability now but not for historical observability
● Metrics (aggregatable), Tracing (request scoped), Logging (events)
● Recommended article on logging

Sample Certificate of Non-Claim (Car Insurance Claim)
71% (7)
Sample Certificate of Non-Claim (Car Insurance Claim)
1 page
Austin Parker, Ted Young - Learning OpenTelemetry - Setting Up and Operating A Modern Observability System-O'Reilly Media
No ratings yet
Austin Parker, Ted Young - Learning OpenTelemetry - Setting Up and Operating A Modern Observability System-O'Reilly Media
54 pages
Ultimate Guide Network Observability Kentik 083023
100% (2)
Ultimate Guide Network Observability Kentik 083023
36 pages
My Philosophy On Alerting
No ratings yet
My Philosophy On Alerting
8 pages
Personal Development Plan
No ratings yet
Personal Development Plan
2 pages
6 Steps Effective Performance Monitoring Strategy W - Sevo116 PDF
No ratings yet
6 Steps Effective Performance Monitoring Strategy W - Sevo116 PDF
6 pages
System Monitoring
No ratings yet
System Monitoring
4 pages
An Introduction To Prometheus: Brian Brazil Founder
No ratings yet
An Introduction To Prometheus: Brian Brazil Founder
42 pages
White Paper - Net Optics - Extending Network Monitoring Tool Performance
No ratings yet
White Paper - Net Optics - Extending Network Monitoring Tool Performance
5 pages
COP344 Observability Best Practices For Modern Applications
No ratings yet
COP344 Observability Best Practices For Modern Applications
63 pages
Week 09 Linux
No ratings yet
Week 09 Linux
33 pages
Monitoring Plan Customer Name: Directions For Using Template
No ratings yet
Monitoring Plan Customer Name: Directions For Using Template
10 pages
Network Monitoring and Measurement
No ratings yet
Network Monitoring and Measurement
22 pages
Lec3 Os - 115444
No ratings yet
Lec3 Os - 115444
13 pages
Analyzing Your Network
No ratings yet
Analyzing Your Network
3 pages
Observability Missing Primer Springone 200911001436
No ratings yet
Observability Missing Primer Springone 200911001436
43 pages
M8 - T-GCPFCI-B - Core Infrastructure 5.0 - ILT
No ratings yet
M8 - T-GCPFCI-B - Core Infrastructure 5.0 - ILT
42 pages
372 Kowall
No ratings yet
372 Kowall
51 pages
Observability Basic
No ratings yet
Observability Basic
6 pages
Ultimate Guide To IT Monitoring Management and Modernization
100% (3)
Ultimate Guide To IT Monitoring Management and Modernization
26 pages
Network Monitor PDF 9th Batch
No ratings yet
Network Monitor PDF 9th Batch
4 pages
Setting Up Your Command Center
No ratings yet
Setting Up Your Command Center
8 pages
Four Star Network Management: Jeff Allen Webtv Networks David Williamson Global Networking and Computing
No ratings yet
Four Star Network Management: Jeff Allen Webtv Networks David Williamson Global Networking and Computing
35 pages
SE-3 Muhammad Hamza
No ratings yet
SE-3 Muhammad Hamza
8 pages
Monitoring in The Cloud Ebook 1
No ratings yet
Monitoring in The Cloud Ebook 1
10 pages
Compiled By: Girma.N (MSC in Cse) : Unit One: Implementing Network Monitoring
No ratings yet
Compiled By: Girma.N (MSC in Cse) : Unit One: Implementing Network Monitoring
24 pages
Monitoring
No ratings yet
Monitoring
43 pages
IManager U2000 IP Network Monitoring and Maintenance Solution
No ratings yet
IManager U2000 IP Network Monitoring and Maintenance Solution
56 pages
Network Management
No ratings yet
Network Management
36 pages
Week 9 Agile Course Presentation
No ratings yet
Week 9 Agile Course Presentation
37 pages
Guide To DevOps Monitoring Tools
No ratings yet
Guide To DevOps Monitoring Tools
23 pages
SPM Unit 5
No ratings yet
SPM Unit 5
32 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
SCD Chapter 7 Fall 2024
No ratings yet
SCD Chapter 7 Fall 2024
6 pages
SolarWinds Interview Preparation (Edition 3)
No ratings yet
SolarWinds Interview Preparation (Edition 3)
10 pages
Intro To Observability GrafanaUniversity
No ratings yet
Intro To Observability GrafanaUniversity
7 pages
CM: The Next Generation of Metrics
No ratings yet
CM: The Next Generation of Metrics
4 pages
Honeycomb Dzone Refcard 364 Full Stack Observabili
No ratings yet
Honeycomb Dzone Refcard 364 Full Stack Observabili
8 pages
Metrics & Dashboards: Survey Results
No ratings yet
Metrics & Dashboards: Survey Results
20 pages
Measurement and Monitoring
No ratings yet
Measurement and Monitoring
66 pages
SPM Unit 4
No ratings yet
SPM Unit 4
12 pages
The Economics of Observability Data
No ratings yet
The Economics of Observability Data
6 pages
Real-timeProcessMonitoringVer2 3
No ratings yet
Real-timeProcessMonitoringVer2 3
13 pages
DevOps Topic 3 Slides
No ratings yet
DevOps Topic 3 Slides
20 pages
001 Network-Management
No ratings yet
001 Network-Management
45 pages
Cribl What Is Observability
No ratings yet
Cribl What Is Observability
6 pages
SPPM PPT of Unit 4
No ratings yet
SPPM PPT of Unit 4
71 pages
SoftWare Engineering M-6
No ratings yet
SoftWare Engineering M-6
14 pages
Monitoring and Observability
No ratings yet
Monitoring and Observability
2 pages
1.5 LSS Quality Files Bus. Risk Management
No ratings yet
1.5 LSS Quality Files Bus. Risk Management
81 pages
APP1219B - Splunk Observability
No ratings yet
APP1219B - Splunk Observability
49 pages
Bai 5 - He Thong Canh Bao
No ratings yet
Bai 5 - He Thong Canh Bao
12 pages
Bam Cockpit UCberkeley Del
No ratings yet
Bam Cockpit UCberkeley Del
21 pages
System Design
No ratings yet
System Design
9 pages
What Are Indicators
No ratings yet
What Are Indicators
9 pages
Network Management
No ratings yet
Network Management
34 pages
04 Resource Monitoring
100% (1)
04 Resource Monitoring
35 pages
Monitoring and Logging
No ratings yet
Monitoring and Logging
2 pages
Entreprise Performance Monitoring
No ratings yet
Entreprise Performance Monitoring
8 pages
01 Intro Kevin
No ratings yet
01 Intro Kevin
51 pages
data science course training in india hyderabad: innomatics research labs
From Everand
data science course training in india hyderabad: innomatics research labs
innomatics research labs
No ratings yet
Introduction to N.C.M., a Non Contact Measurement Tool
From Everand
Introduction to N.C.M., a Non Contact Measurement Tool
Dennis R. Branch
No ratings yet
BHS Inggris Xi Sem-1 TP 2021-2022
No ratings yet
BHS Inggris Xi Sem-1 TP 2021-2022
8 pages
State Budget 2025-26
No ratings yet
State Budget 2025-26
13 pages
Defence Standard 00-970 Part 1 Section 1: Issue 13 Date: 13 Jul 2015
No ratings yet
Defence Standard 00-970 Part 1 Section 1: Issue 13 Date: 13 Jul 2015
23 pages
Answers To The First General Quick TEST UTME
No ratings yet
Answers To The First General Quick TEST UTME
22 pages
Effect of Feed Rate On The Generation of Surface Roughness in Turning
No ratings yet
Effect of Feed Rate On The Generation of Surface Roughness in Turning
7 pages
6 Month MCQs (Oct To May 25) English
No ratings yet
6 Month MCQs (Oct To May 25) English
197 pages
Prefinal-1 Model Paper (2024-25)
No ratings yet
Prefinal-1 Model Paper (2024-25)
4 pages
Query Optimization in Object Oriented Databases Through Detecting Independent Subqueries
No ratings yet
Query Optimization in Object Oriented Databases Through Detecting Independent Subqueries
5 pages
Pricing Policies That Protect Your Brand
No ratings yet
Pricing Policies That Protect Your Brand
7 pages
17 - 03 - 22, 8 - 52 AM Microsoft Lens
No ratings yet
17 - 03 - 22, 8 - 52 AM Microsoft Lens
13 pages
Dasmesh Group of Schools: Faridkot/Kotkapura/Bargari Std. VII
No ratings yet
Dasmesh Group of Schools: Faridkot/Kotkapura/Bargari Std. VII
23 pages
Lift Manuals - Manuale Delle Parti - CHASSIS, MAST, OPTIONS & INTERNAL HOSING - PDF Tav 4 Ver
No ratings yet
Lift Manuals - Manuale Delle Parti - CHASSIS, MAST, OPTIONS & INTERNAL HOSING - PDF Tav 4 Ver
3 pages
Family Waste Inventory
No ratings yet
Family Waste Inventory
2 pages
IoT Quantum Computing A Future Concept
No ratings yet
IoT Quantum Computing A Future Concept
8 pages
Catalyst Preparation Methods
100% (1)
Catalyst Preparation Methods
25 pages
Multiple Choice Questions (1-5) 1 Tick For Each Correct Answer PDF
No ratings yet
Multiple Choice Questions (1-5) 1 Tick For Each Correct Answer PDF
2 pages
Chapter 12.2 - Financial Statements
No ratings yet
Chapter 12.2 - Financial Statements
10 pages
Early Method of Detecting Deception
100% (2)
Early Method of Detecting Deception
6 pages
9YA, 95B, 971-Broken Valve Springs
No ratings yet
9YA, 95B, 971-Broken Valve Springs
3 pages
Aircraft Communication System AKD20603: Practical Assignment - Aircraft Hs-125
No ratings yet
Aircraft Communication System AKD20603: Practical Assignment - Aircraft Hs-125
16 pages
ST LINES + CIRCLES TOP 200 PYQs of JEE Mains 2022
No ratings yet
ST LINES + CIRCLES TOP 200 PYQs of JEE Mains 2022
60 pages
Possible Quiz Questions: For January 21st Quiz #1
No ratings yet
Possible Quiz Questions: For January 21st Quiz #1
4 pages
Priciples of Marketing by Philip Kotler and Gary Armstrong
No ratings yet
Priciples of Marketing by Philip Kotler and Gary Armstrong
33 pages
Chapter 2 Basic Physics of Semiconductors
No ratings yet
Chapter 2 Basic Physics of Semiconductors
42 pages
Tle 10-Las Q4-Week 3
No ratings yet
Tle 10-Las Q4-Week 3
4 pages
Standard Requirements For Tourist Land, Water &
100% (1)
Standard Requirements For Tourist Land, Water &
29 pages
Book 3 Unit 8. Communicating With Staff: Group Name: 4 Arya Nugroho Indri Novianti Rahayu Yiyin
No ratings yet
Book 3 Unit 8. Communicating With Staff: Group Name: 4 Arya Nugroho Indri Novianti Rahayu Yiyin
10 pages
Sample Study Matter JEE (Advanced) PDF
100% (1)
Sample Study Matter JEE (Advanced) PDF
89 pages

External Notes On Monitoring

Uploaded by

External Notes On Monitoring

Uploaded by

External Notes on Monitoring

Metrics, metrics everywhere

Crafting performance monitoring tools

Better living through statistics: Monitoring doesn't have to suck

Monitoring is Dead. Long Live Monitoring.

What should I monitor and how should I do it?

What should I instrument and how do I do it?

You might also like