Prom Notes
Prom Notes
1. Prometheus Fundamentals
Introduction to Prometheus
What is Prometheus?
Prometheus collects metrics by scraping HTTP endpoints exposed by target applications and
storing this data in its time-series database (TSDB). Users can then query these metrics using
PromQL (Prometheus Query Language) to create dashboards and set up alerts.
Prometheus Architecture
The architecture consists of multiple components working together:
Stores data in a time-series format, where each metric is stored as a time series
identified by:
o A metric name (e.g., http_requests_total)
o A set of labels (key-value pairs like method="GET", status="200")
Storage is in-memory and on-disk, using a write-ahead log (WAL) for persistence.
2. Scraping
3. Alerting
4. Exporters
5. Service Discovery
This installs a full Prometheus setup with Grafana, Alertmanager, and Node Exporter.
1. Basic Query
http_requests_total
http_requests_total{method="GET"}
3. Aggregation Operators
sum(http_requests_total)
4. Rate Calculation
Find the per-second rate of HTTP requests over the last 5 minutes:
rate(http_requests_total[5m])
5. Alerting Example
By default, Prometheus stores data for 15 days in data/ directory. To modify retention, use:
--storage.tsdb.retention.time=30d
global:
scrape_interval: 15s
evaluation_interval: 30s
storage:
tsdb:
retention: 30d
Conclusion
Key Takeaways
With this deep understanding, you should be well-prepared to answer any interview question
on Prometheus! Let me know if you want to dive even deeper into any specific area.
Prometheus is one of the most widely used monitoring and alerting systems, especially in
cloud-native and containerized environments. It works by pulling (scraping) metrics from
various sources, storing them in a time-series database, and allowing for queries and alerting
based on those metrics. Below, we’ll go deep into each of the topics you listed to ensure you
can confidently answer any interview question.
1.1 Counter
Definition: A counter is a cumulative metric that only increases (or resets to zero upon
restart).
Use Case: Ideal for tracking things like the number of HTTP requests, total errors, or
processed jobs.
Example:
http_requests_total{method="GET", status="200"} 1027
o This metric shows that 1027 GET requests have been processed successfully.
Common Interview Questions:
o What happens when a counter resets?
It starts back at zero, which can happen due to a process restart.
o Can a counter decrease?
No, unless the process restarts.
1.2 Gauge
Definition: A gauge is a metric that can increase or decrease, representing values like
CPU usage, memory usage, or queue length.
Use Case: Suitable for values that fluctuate over time, such as temperature readings or
the number of active connections.
Example:
node_memory_usage_bytes 2534912
o This could represent the current memory usage in bytes.
Common Interview Questions:
o How is a gauge different from a counter?
A gauge can go up and down, whereas a counter only increases.
o What happens if a process dies and restarts? Does a gauge reset?
Yes, it resets unless the value is restored externally.
1.3 Histogram
Definition: A histogram samples observations (e.g., request durations) and counts them
in configurable buckets.
Use Case: Best for tracking request latency or response sizes.
Example:
http_request_duration_seconds_bucket{le="0.1"} 24054
http_request_duration_seconds_bucket{le="0.5"} 33456
http_request_duration_seconds_bucket{le="1"} 40034
http_request_duration_seconds_bucket{le="+Inf"} 50000
o This means:
24,054 requests were under 0.1 seconds.
33,456 were under 0.5 seconds.
40,034 were under 1 second.
50,000 requests in total.
Common Interview Questions:
o What is the main drawback of histograms?
Choosing appropriate bucket sizes is tricky. Too many buckets increase
storage needs, while too few make data less useful.
o How can you derive an average from a histogram?
Use rate(http_request_duration_seconds_sum[5m]) /
rate(http_request_duration_seconds_count[5m]).
1.4 Summary
Definition: A summary provides precomputed percentiles (e.g., 50th, 95th, 99th)
instead of histogram buckets.
Use Case: Useful when you need quantiles like median, 95th percentile response time.
Example:
http_request_duration_seconds{quantile="0.5"} 0.35
http_request_duration_seconds{quantile="0.9"} 0.9
http_request_duration_seconds{quantile="0.99"} 1.2
o The 50th percentile (median) request duration is 0.35 seconds.
o The 99th percentile is 1.2 seconds.
Common Interview Questions:
o What is the difference between histograms and summaries?
Histograms allow you to compute percentiles at query time, while
summaries precompute them at ingestion.
o Why are summaries harder to aggregate?
Precomputed quantiles cannot be merged across multiple instances.
scrape_configs:
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
Common Interview Questions:
o How does Prometheus scrape metrics?
It pulls from HTTP endpoints at specified intervals.
o What happens if a target becomes unreachable?
The metric disappears after the retention period.
def process_request():
REQUEST_COUNT.inc()
if __name__ == "__main__":
start_http_server(8000)
while True:
process_request()
3.2 Go Example
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
func main() {
prometheus.MustRegister(requests)
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8000", nil)
}
4.2 cAdvisor
relabel_configs:
- source_labels: [__address__]
regex: ".*:8080"
action: drop
metric_relabel_configs:
- source_labels: [__name__]
regex: "http_requests_total"
action: drop
This is an in-depth breakdown of metrics collection in Prometheus. If you want even deeper
explanations, feel free to ask!
2. Aggregation Operators
Aggregation operators allow summarizing data across dimensions.
Operator Description
sum Total sum of values
avg Average value
min Minimum value
max Maximum value
count Number of time-series
count_values Count of occurrences of each distinct value
topk(N, metric) Top N highest values
bottomk(N, metric) Bottom N lowest values
Recording rules precompute and store query results to optimize dashboards and reduce query
time.
5. Advanced Functions
5.1. predict_linear()
predict_linear(node_cpu_seconds_total[1h], 1800)
5.2. holt_winters()
5.3. resets()
resets(node_network_transmit_bytes_total[1h])
By internalizing these concepts and practicing queries, you’ll be well-prepared to handle any
PromQL-related interview question with confidence!
To install Alertmanager, you can either download the binary, install it using a package manager,
or run it as a Docker container.
Step 1: Download and Install Alertmanager
wget https://github.com/prometheus/alertmanager/releases/latest/download/alertmanager-linux-amd64.tar.gz
tar -xvf alertmanager-linux-amd64.tar.gz
cd alertmanager-*
./alertmanager --version
Alertmanager’s configuration file, alertmanager.yml, defines how alerts are processed and where
notifications are sent.
route:
receiver: 'email-notifications'
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receivers:
- name: 'email-notifications'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.example.com:587'
auth_username: 'user'
auth_password: 'password'
./alertmanager --config.file=alertmanager.yml
Or if using Docker:
Alerting rules are stored in a YAML file (e.g., alert.rules.yml) and specified in the Prometheus
configuration.
Example:
groups:
- name: instance_down
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "The instance {{ $labels.instance }} has been down for more than 5 minutes."
Modify prometheus.yml:
rule_files:
- "alert.rules.yml"
Restart Prometheus:
The route section in alertmanager.yml determines how alerts are processed and delivered.
Example:
route:
receiver: 'default'
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
Email
receivers:
- name: 'email-notifications'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.example.com:587'
auth_username: 'user'
auth_password: 'password'
Slack
receivers:
- name: 'slack'
slack_configs:
- channel: '#alerts'
api_url: 'https://hooks.slack.com/services/XXXXX/XXXXX/XXXXX'
title: "{{ .CommonLabels.alertname }}"
text: "{{ .CommonAnnotations.description }}"
PagerDuty
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'your-pagerduty-service-key'
Webhooks
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://your-webhook-endpoint.com'
Inhibitions
Inhibition prevents lower-severity alerts from being sent when a higher-severity alert is active.
Example configuration:
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname']
This ensures that if a critical alert is active, the corresponding warning alert is suppressed.
Modify alertmanager.yml:
cluster:
peers:
- "alertmanager-1:9093"
- "alertmanager-2:9093"
listen-address: "0.0.0.0:9094"
Or using Docker:
curl http://localhost:9093/api/v2/status
Conclusion
Alertmanager handles alerts from Prometheus and routes them to various notification
channels.
Alerting rules define when alerts should fire.
Routing rules determine how alerts are grouped and sent.
Silences and inhibitions help reduce alert fatigue.
High availability is achieved by clustering multiple Alertmanager instances.
Types of Federation
1. Hierarchical Federation
o A "parent" Prometheus instance collects summarized metrics from multiple "child"
instances.
o Example:
Each region (child Prometheus) scrapes local metrics.
A central/global Prometheus scrapes these child Prometheus instances.
2. Cross-Service Federation
o Used when multiple Prometheus instances collect different types of data.
o Example:
One instance collects infrastructure metrics (CPU, RAM).
Another collects application-level metrics.
A central Prometheus queries both.
Sharding
Splitting metric collection across multiple Prometheus instances to distribute the load.
Typically done using hashmod sharding on the instance or job labels.
1. Thanos
3. Grafana Mimir
Deploy multiple Prometheus instances and use HAProxy or Nginx for load balancing.
upstream prometheus {
server prometheus-1:9090;
server prometheus-2:9090;
}
server {
listen 80;
location / {
proxy_pass http://prometheus;
}
}
5. Troubleshooting Prometheus
Prometheus troubleshooting involves diagnosing scraping issues, high memory usage, missing
metrics, and query performance issues.
1. Scraping Issues
3. Slow Queries
Final Thoughts
By mastering these concepts, you should be well-prepared for any interview question on
Scaling and Performance Optimization in Prometheus. You now understand:
Would you like me to create mock interview questions on this topic for practice?
Grafana is a powerful open-source observability and data visualization platform that allows
users to query, visualize, alert, and analyze metrics from various data sources. It is widely used
in DevOps, system monitoring, and analytics.
1. Multi-Source Support:
o Connects to multiple backends like Prometheus, Graphite, Loki, Elasticsearch, InfluxDB,
etc.
2. Flexible Dashboards:
o Create customizable dashboards using a variety of panel types, visualizations, and
layouts.
3. Powerful Query Editor:
o Allows users to filter and aggregate data dynamically using PromQL, SQL, and other
query languages.
4. Alerting and Notifications:
o Define alerts and receive notifications via Slack, PagerDuty, Email, Webhooks, and more.
5. User Authentication & Role-Based Access Control (RBAC):
o Supports user authentication via LDAP, OAuth, and Grafana Enterprise.
6. Templating and Variables:
o Use dashboard variables to create dynamic and reusable dashboards.
7. Annotations:
o Highlight important events directly on graphs for better correlation.
8. Plug-ins and Extensions:
o Custom plug-ins extend functionality, including new panel types and data sources.
9. Enterprise and Cloud Versions:
o Offers advanced features like reporting, data security, and support.
Installation Methods
On Linux (Ubuntu/Debian)
On Windows
1. Download the Grafana Windows MSI Installer from the official Grafana website.
2. Run the installer and follow the installation steps.
3. Start the service via:
4. net start grafana
Using Docker
Accessing Grafana
Panels
Queries
Final Thoughts
Understanding Grafana requires hands-on practice. Key takeaways:
If you can confidently explain and demo these concepts in an interview, you’ll have no problem
answering any Grafana-related question!
Grafana is a powerful open-source observability tool used for monitoring and visualizing
metrics, logs, and traces. To master advanced Grafana usage, you need a deep understanding
of its querying capabilities, data transformations, templating, annotations, alerting, and
integrations with Loki (for logs) and Tempo (for distributed tracing). Below is an in-depth
breakdown of each topic:
1. Advanced Queries & Data Transformations
Grafana supports multiple data sources such as Prometheus, InfluxDB, Elasticsearch, MySQL,
and more. Understanding how to write advanced queries and transform data is crucial for
creating insightful dashboards.
Advanced Queries
Queries vary depending on the data source. Some common techniques include:
Data Transformations
Data transformations allow you to manipulate data within Grafana, making it easier to visualize.
Common transformations include:
1. Add field from calculation: Create new fields by applying formulas on existing data.
2. Merge & Join tables: Combine multiple queries to show comparative data.
3. Group by and Aggregate: Summarize data by categories.
4. Filter Data by Value: Remove unwanted data points.
5. Pivot tables: Restructure tabular data to suit visualization needs.
Example: If you retrieve system metrics but want to calculate a "CPU Utilization %" from
cpu_used / cpu_total * 100, you can use "Add field from calculation."
2. Variables and Templating
Variables and templating allow you to create dynamic dashboards that adjust based on user
selection.
Defining Variables
Types of Variables
Example (PromQL):
rate(http_requests_total{pod="$pod"}[5m])
Example (SQL):
Chained Variables
Alerts
Grafana supports alert rules that trigger notifications based on threshold conditions.
Defining an Alert
avg_over_time(node_cpu_seconds_total{mode="idle"}[5m]) < 20
Alerting Components
Labels instead of Indexing: Unlike Elasticsearch, Loki uses labels for filtering.
Efficient storage: Compressed logs reduce storage costs.
Seamless integration with Grafana.
Basic Queries:
o Retrieve logs for a specific job:
o {job="nginx"} |= "error"
o Retrieve logs where response time > 1s:
o {job="nginx"} | json | duration > 1000
Aggregation Queries:
o Count logs per service:
o count_over_time({job="nginx"}[5m])
o Histogram of response codes:
o rate({job="nginx"} | json | status >= 500 [5m])
Tracing Concepts
Querying Traces
Kubernetes is a dynamic system with pods, nodes, and services that constantly scale up or
down. Monitoring is crucial for:
Prometheus is an open-source monitoring and alerting toolkit designed for Kubernetes, while
Grafana is a visualization tool that helps interpret the collected metrics. Together, they form a
powerful monitoring solution.
Prometheus follows a pull-based architecture where it scrapes metrics from targets (nodes,
pods, services) via HTTP endpoints. It relies on exporters to collect data and stores it in a time-
series database.
Key Prometheus Components
3. Setting Up kube-prometheus-stack
kube-prometheus-stack is a Helm chart that bundles Prometheus, Grafana, and Alertmanager
for easy deployment in Kubernetes. It includes:
Installation Steps
This deploys:
Prometheus
Grafana
Alertmanager
Kubernetes Metrics Exporters
The Node Exporter collects OS-level metrics like CPU, memory, disk usage.
cAdvisor (Container Advisor) collects container-level metrics. It runs as part of Kubelet and
exposes /metrics/cadvisor endpoint.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-service-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: web
interval: 30s
groups:
- name: instance-down
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance down"
Prebuilt Dashboards
Import Kubernetes monitoring dashboards from Grafana Labs (ID: 6417 for kube-prometheus-
stack).
Prometheus uses service discovery to find targets (pods, nodes, exporters) and scrapes them
via HTTP /metrics endpoints.
Q2: What is the difference between Metrics Server and Prometheus?
Metrics Server provides real-time resource metrics (CPU, memory) for autoscaling.
Prometheus provides long-term monitoring, custom metrics, and alerting.
Alerts are configured via alerting rules in Prometheus, which trigger notifications via
Alertmanager.
Got it! I'll break down each topic in-depth, ensuring you have a solid grasp of observability best
practices. By the end of this, you'll be able to answer even the toughest interview questions.
To achieve true observability, all three must be correlated and analyzed effectively. Let's go
step by step through each best practice.
Individually, logs, metrics, and traces give partial insights. Together, they allow for deeper
analysis and faster debugging.
4. OpenTelemetry Integration
What is OpenTelemetry?
A vendor-neutral framework for instrumenting, collecting, and exporting traces, metrics, and
logs.
Conclusion
By following these best practices, you’ll be able to set up a comprehensive observability stack
that provides deep insights into system performance, improves debugging, and ensures
security.
Want to test your understanding? I can throw some interview-style questions your way!
What are the key metrics? (CPU usage, request latency, error rates, throughput, etc.)
What components need monitoring? (Database, application server, message queues, caches)
What are the Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level
Agreements (SLAs)?
Prometheus setup: Deploy Prometheus to scrape metrics from applications and infrastructure.
Loki for logs: Aggregate logs from application components.
Jaeger for distributed tracing: Trace requests as they move through microservices.
Grafana for visualization: Create dashboards to monitor system health.
Use client libraries (e.g., Prometheus client for Python, Go, Java) to expose application metrics.
Implement distributed tracing to track requests across microservices.
Enable log shipping using Fluentd or Logstash.
Define Prometheus alert rules (e.g., alert when CPU usage > 80% for 5 minutes).
Integrate with Slack, PagerDuty, or email for notifications.
Set up Grafana Alerting for real-time issue detection.
1. Data Collection
o Use Prometheus to scrape and store metrics.
o Use Fluentd or Logstash to collect logs.
o Use OpenTelemetry to collect traces.
2. Data Storage & Processing
o Store metrics in Prometheus or InfluxDB.
o Store logs in Elasticsearch or Loki.
o Store traces in Jaeger or Zipkin.
3. Data Visualization & Analysis
o Use Grafana for centralized dashboards.
o Correlate logs, traces, and metrics in one view.
4. Alerting & Incident Management
o Define Prometheus alert rules.
o Integrate with Alertmanager for notifications.
o Use AI/ML-based anomaly detection for automated insights.
Prometheus relies on exporters to collect metrics from services that don’t expose them
natively. Sometimes, standard exporters (like Node Exporter, MySQL Exporter) don’t fit specific
needs, so you must build a custom exporter.
Example: Suppose you want to monitor cache hit rates in a custom Redis-like system.
Required metrics:
o cache_requests_total
o cache_hits_total
o cache_miss_total
o cache_hit_ratio
# Define metrics
cache_hits = Counter('cache_hits_total', 'Total number of cache hits')
cache_requests = Counter('cache_requests_total', 'Total number of cache requests')
scrape_configs:
- job_name: 'custom_cache_exporter'
static_configs:
- targets: ['localhost:8000']
Conclusion
Now, you have deep expertise in:
1. Building a complete monitoring stack using Prometheus, Loki, Jaeger, and Grafana.
2. Setting up a centralized observability platform for microservices.
3. Creating custom exporters to monitor unique application behavior.
4. Using Prometheus & Grafana to troubleshoot performance issues.