Prom Qna
Prom Qna
Here are 15 real-world Prometheus work scenarios with answers in the first
person, as if I were answering in an interview.
1. A critical production service is running slow. How do you start debugging with
Prometheus?
First, I check the Prometheus UI or Grafana dashboards for key metrics like CPU, memory, and
request latency. I use PromQL queries such as:
rate(http_requests_total[5m])
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))
I ensure the service exposes a /metrics endpoint and configure Prometheus to scrape it by
updating prometheus.yml:
scrape_configs:
- job_name: 'my-service'
static_configs:
- targets: ['my-service:8080']
curl http://my-service:8080/metrics
I first check the Targets tab in Prometheus (http://prometheus:9090/targets). If the target is down, I
verify:
If it's a scrape interval issue, I increase logging and manually test with:
curl http://my-service:8080/metrics
groups:
- name: cpu_alerts
rules:
- alert: HighCPUUsage
expr: 100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 80
for: 5m
labels:
severity: critical
annotations:
description: "CPU usage is above 80% for 5 minutes"
If it starts consuming too much memory, I optimize retention settings or offload data to Thanos.
rate() calculates the average per-second rate over a time range. Best for long-term
trends.
irate() takes only the most recent points, so it’s better for short-term spikes.
Example: If I want to analyze HTTP request trends, I use:
rate(http_requests_total[5m])
irate(http_requests_total[1m])
High cardinality (too many unique labels) slows down queries and increases memory usage.
I address this by:
scrape_configs:
- job_name: 'mysql'
static_configs:
- targets: ['mysql-exporter:9104']
mysql_global_status_threads_running
def handle_request():
REQUEST_COUNT.inc()
start_http_server(8000)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Final Thoughts
With these answers, you should be well-prepared for any Prometheus interview! Let me know
if you want more deep dives.
Here are 15 real-world work scenario questions related to Prometheus metrics collection,
along with first-person answers to help you confidently handle an interview.
I first check if the service is listed in prometheus.yml under scrape_configs. If it's dynamically
discovered, I use the targets API (/api/v1/targets) to verify if it's registered. If not, I inspect service
discovery settings. If it's listed but unhealthy, I check firewall rules, network connectivity, and
ensure the service exposes /metrics. Finally, I check logs for Prometheus and the service itself.
I’d first check if the application has built-in Prometheus support. If not, I’d implement a custom
exporter using Python (prometheus_client) or Go (prometheus/client_golang). I’d expose an HTTP
endpoint (/metrics), register relevant metrics (Counters, Gauges, Histograms, or Summaries), and
update prometheus.yml to scrape it.
3. You notice high cardinality metrics are consuming too much memory. What do
you do?
I’d analyze which labels contribute to high cardinality using count by (__name__) and count by
(label_name). If unnecessary labels (like user_id, request_id) exist, I’d relabel or drop them using
metric_relabel_configs. If needed, I’d use histograms instead of summaries to aggregate data
more efficiently.
4. The team wants to monitor API latency. Which metric type should I use?
I’d use a Histogram since it captures the distribution of latencies, allowing us to compute
percentiles later. I’d set appropriate buckets to ensure meaningful analysis, like:
http_request_duration_seconds = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: prometheus.DefBuckets,
})
scrape_configs:
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
Then, I’d label services properly and use relabel_configs to filter the right workloads.
Counters should only increase, so if I see a drop, it's likely due to a process restart. I’d check if
the application restarts frequently using up metrics or container logs. If needed, I’d persist
counters externally to avoid resets.
I’d use a Prometheus exporter. For example, if it’s PostgreSQL, I’d deploy postgres_exporter:
Then, I’d configure Prometheus to scrape localhost:9187/metrics and create alerts on slow queries.
I’d deploy cAdvisor to collect per-container CPU, memory, and disk usage:
I’d explain that Prometheus follows a pull-based model for reliability, scalability, and
consistency. However, for short-lived jobs (e.g., batch processing), I’d recommend using
Pushgateway, which buffers metrics until scraped.
11. You need to drop certain metrics before they’re stored. How do you do it?
metric_relabel_configs:
- source_labels: [__name__]
regex: "http_requests_total"
action: drop
modules:
http_2xx:
prober: http
timeout: 5s
13. Your team complains that Prometheus is using too much disk space. What do
you do?
I’d check retention settings (--storage.tsdb.retention.time) and analyze which metrics take up space.
I’d optimize by:
groups:
- name: instance_down
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
I’d explain:
rate(http_requests_total[5m])
increase(http_requests_total[1h])
Final Thoughts
Here are 15 work scenario-based PromQL interview questions with answers in first-person:
1. How would you find the total HTTP requests handled by all instances over the
past 5 minutes?
Answer:
I would use increase(http_requests_total[5m]) to calculate the total number of requests in the last 5
minutes. If I need to group them by status code or method, I'd modify it like this:
Answer:
I’d use the rate() function to calculate the per-second CPU usage and normalize it against total
CPU cores:
Scenario: We suspect that network byte counters reset after a system reboot.
Answer:
I’d use resets() to detect if the counter reset within a given period:
resets(node_network_transmit_bytes_total[1h])
If the result is greater than zero, it means the counter has reset at least once in the last hour.
4. How do you find the top 3 services consuming the most memory?
5. How can you measure the rate of failed HTTP requests over time?
Scenario: Our SRE team wants to track error rates to detect anomalies.
Answer:
I’d filter only failed requests (status_code=~"5..") and apply rate():
rate(http_requests_total{status_code=~"5.."}[5m])
This gives me the per-second failure rate over the last 5 minutes.
Scenario: The product team wants to ensure API latency stays low.
Answer:
I’d divide total request duration by total requests, grouped by service:
Answer:
I’d use predict_linear() to estimate memory usage in the next hour:
predict_linear(node_memory_active_bytes[1h], 3600)
This predicts memory usage based on the last hour’s trend.
Answer:
I’d check the disk I/O latency using rate() on node_disk_write_time_seconds_total:
Answer:
I’d use offset 1d to compare today’s traffic with yesterday’s:
Answer:
I’d use delta() to detect sudden increases over a short time frame:
This flags cases where requests jump by more than 1000 within 2 minutes.
11. How can you calculate the 99th percentile request latency?
Scenario: We want to ensure that 99% of our requests complete within an SLA.
Answer:
Using histogram_quantile(0.99, metric), I’d extract the 99th percentile from request duration
histograms:
This tells me the latency threshold that 99% of requests stay under.
Answer:
I’d use rate() on node_network_transmit_bytes_total and node_network_receive_bytes_total:
Answer:
I’d use absent() to check if a metric is missing:
absent(http_requests_total{job="api-service"})
Answer:
Assuming an active_sessions metric, I’d use count():
count(active_sessions)
Answer:
I’d optimize the query by:
groups:
- name: dashboard_queries
rules:
- record: job:cpu_usage:rate5m
expr: avg by (job) (rate(node_cpu_seconds_total[5m]))
Final Thoughts
Mastering PromQL means understanding when to use which function and how to optimize
queries. In an interview, I’d explain not just the syntax but the reasoning behind each query.
Here are 15 real-world work scenario questions related to Prometheus Alertmanager, along
with first-person answers that demonstrate deep understanding and problem-solving skills.
1. A critical alert is firing, but no one is receiving notifications. What do you do?
Answer:
The first thing I do is check the Alertmanager logs (alertmanager.log) to see if there are any errors.
Then, I verify if the alert is correctly firing in Prometheus using:
If the alert exists in Prometheus but isn't reaching Alertmanager, I check my prometheus.yml
configuration for the alerting section. If it is reaching Alertmanager but not notifying, I inspect
alertmanager.yml for receiver configurations and test it manually with:
If notifications still don’t work, I verify my SMTP, Slack, or PagerDuty API keys and ensure the
notification service is reachable.
Answer:
I implement alert deduplication, grouping, inhibition, and silences in Alertmanager.
Grouping: I ensure similar alerts (e.g., from the same service) are grouped together to
reduce noise.
Inhibition: I configure inhibition rules so that a critical alert suppresses warning alerts
from the same service.
Silences: I encourage teams to set silences during maintenance windows.
Fine-Tuning Alerting Rules: I avoid overly broad conditions and use for: clauses to
prevent flapping alerts.
Answer:
I deploy multiple Alertmanager instances in a cluster by modifying alertmanager.yml:
cluster:
peers:
- "alertmanager-1:9093"
- "alertmanager-2:9093"
listen-address: "0.0.0.0:9094"
curl http://localhost:9093/api/v2/status
4. How do you configure Alertmanager to send alerts to both Slack and email?
Answer:
I configure multiple receivers in alertmanager.yml:
receivers:
- name: 'email'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.example.com:587'
- name: 'slack'
slack_configs:
- channel: '#alerts'
api_url: 'https://hooks.slack.com/services/XXXX/XXXX/XXXX'
route:
receiver: 'email'
routes:
- match:
severity: critical
receiver: 'slack'
This ensures critical alerts go to Slack, while all alerts are emailed.
Answer:
First, I check if the alert rule has a for: condition in alert.rules.yml causing a delay:
- alert: HighMemoryUsage
expr: node_memory_Active_bytes / node_memory_MemTotal_bytes > 0.9
for: 10m # Must be above threshold for 10 minutes before firing
Then, I check the Alertmanager logs for any bottlenecks and test notifications manually to
confirm there are no external delays.
Answer:
I create a silence using the Alertmanager UI (http://localhost:9093) or via API:
Answer:
I define routing rules in alertmanager.yml:
route:
receiver: 'email'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
Answer:
I check if the alert’s for: condition is too short. If so, I increase it.
Next, I verify that Prometheus is correctly resolving alerts:
curl -X GET http://localhost:9090/api/v1/alerts
global:
resolve_timeout: 5m
Answer:
I configure a webhook receiver in alertmanager.yml:
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://webhook-server.com/alerts'
Answer:
Answer:
I configure deduplication using group_by in alertmanager.yml:
route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
Answer:
I route critical alerts to PagerDuty and low-priority alerts to email:
route:
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'email'
Answer:
I run multiple Alertmanager instances with a load balancer and clustering.
Answer:
I include clear descriptions, remediation steps, and links to dashboards in annotations.
Answer:
I use Prometheus’ "Alerts" UI or run:
With these answers, you’re 100% prepared for any Alertmanager interview question!
Great! Here are 15 real-world work scenario questions along with first-person responses that
demonstrate deep knowledge of Scaling and Performance Optimization in Prometheus.
First, I’d check the number of active time series using the count({__name__=~".+"}) query. If it’s too
high, I’d investigate label cardinality issues using count by (__name__)({__name__=~".+"}). If I find
high-cardinality labels like pod_name, I’d drop or relabel them in prometheus.yml. Next, I’d reduce
retention time using --storage.tsdb.retention.time=30d and enable WAL compression with --
storage.tsdb.wal-compression. If needed, I’d offload old data to Thanos or a remote storage
backend.
I’d first check which queries are slow using explain (query). If it involves expensive operations like
rate() over a large time window, I’d create recording rules to precompute and store the results.
I’d also ensure that I’m using group by effectively to reduce cardinality. If the slow queries
involve historical data, I’d integrate Thanos or Cortex to separate long-term storage from real-
time queries.
I’d deploy multiple Prometheus instances, each dedicated to a specific team, and use
Kubernetes namespaces to enforce isolation. For querying across teams, I’d use Thanos Querier
with proper RBAC to control access. Alternatively, if we need true multi-tenancy, I’d opt for
Cortex or Grafana Mimir, which natively support multi-tenancy.
4. Prometheus is scraping too many targets, causing high CPU usage. What’s your
approach?
I’d optimize scrape intervals by increasing scrape_interval for non-critical jobs from 15s to 30s or
1m. I’d also use relabeling rules to drop unnecessary metrics at the source. If we’re in
Kubernetes, I’d use podMonitor selectors to limit what gets scraped. If load is still high, I’d shard
Prometheus instances based on job labels.
I’d deploy two or more Prometheus instances in an Active-Active setup, where each instance
scrapes the same targets. I’d use Thanos Querier or load balancing via Nginx to aggregate
queries. For alerting, I’d configure Alertmanager in an HA setup with consistent hashing to
avoid duplicate alerts.
6. Prometheus is not scraping metrics from a target. How do you debug this?
First, I’d check if the target is reachable using curl <target>:<port>/metrics. If it’s down, I’d
investigate the service itself. If it’s up but missing in Prometheus, I’d check the Service
Discovery page (/targets) to see if the target is listed but failing. If it’s missing entirely, I’d check
the scrape_configs in prometheus.yml. If relabeling is causing issues, I’d validate using promtool check
config.
7. Our alerts are firing multiple times from different Prometheus instances. How do
you prevent duplicates?
I’d deploy Alertmanager in HA mode and ensure Prometheus instances send alerts to all
Alertmanagers. Then, I’d configure --cluster.peer in Alertmanager so instances can deduplicate
alerts. If using Thanos Ruler for alert evaluation, I’d enable --query.replica-label=prometheus_replica
to deduplicate queries.
9. Prometheus keeps restarting due to OOM (Out of Memory) errors. What do you
do?
I’d first check memory consumption using kubectl top pod if running in Kubernetes. If it’s
consuming too much memory, I’d limit retention ( --storage.tsdb.retention.time=15d) and reduce
time series cardinality by relabeling. I’d also check for expensive queries running frequently in
Grafana and optimize them. If needed, I’d increase resources in Kubernetes or offload old data
to Thanos.
10. We need to scale Prometheus for 10,000+ targets. How do you architect it?
I’d shard Prometheus instances using a hashmod strategy based on instance labels, ensuring each
instance scrapes only a subset of targets. I’d deploy Thanos Querier to aggregate queries across
all instances. To handle long-term storage, I’d configure remote_write to send data to Thanos
Store or Cortex.
11. Some metrics are missing from Prometheus. How do you debug?
I’d first check if the target is exposing metrics using curl <target>:<port>/metrics. If it’s there but not
in Prometheus, I’d verify /targets in the Prometheus UI. If relabeling is dropping metrics, I’d
validate the prometheus.yml configuration using promtool check config. If the issue persists, I’d check
prometheus.log for scrape errors.
I’d first reduce retention time (--storage.tsdb.retention.time=30d). Then, I’d enable WAL
compression to reduce storage overhead. I’d drop high-cardinality metrics using relabeling and
ensure only necessary time series are stored. If needed, I’d configure Thanos, Cortex, or Mimir
for long-term storage.
13. Our PromQL query is timing out. How do you optimize it?
I’d check if the query is scanning too much data using explain (query). If it’s aggregating a large
dataset, I’d create recording rules to store precomputed values. If it involves historical data, I’d
ensure it queries Thanos Store instead of Prometheus directly. If using large labels, I’d reduce
cardinality.
14. How do you handle a scenario where one Prometheus instance goes down?
If we’re using Active-Active HA, the second instance continues scraping, and Thanos Querier
ensures continuity. If Active-Passive, I’d have a failover mechanism in place using Kubernetes
PodDisruptionBudgets or an external load balancer like HAProxy.
I’d configure federation by having a global Prometheus scrape key metrics from regional
Prometheus instances using the /federate endpoint. I’d define match[] to ensure only necessary
metrics are scraped to avoid excessive load. If needed, I’d use Thanos Querier for cross-instance
queries instead of federation.
This set of 15 work scenarios gives you real-world problem-solving experience that aligns with
interview-style questions. Would you like more hands-on exercises or mock whiteboard
scenarios?
Here are 15 real-world Grafana scenarios, with answers from a first-person perspective as if I'm
explaining in an interview.
1. How would you troubleshoot a Grafana dashboard that is not displaying any
data?
First, I check if the data source is correctly configured and reachable by navigating to
Configuration > Data Sources and clicking "Test." If it's failing, I verify the endpoint and
authentication details. Next, I inspect the queries in the panel settings to ensure they are
fetching the correct metrics. I also check if filters or time ranges are too restrictive. If it's a
Prometheus issue, I run the query directly in Prometheus to confirm if the data exists. Finally, I
look for any errors in the Grafana logs (/var/log/grafana/grafana.log or docker logs grafana).
I set the visualization to Gauge or Graph. Then, I navigate to the Alert tab, click "Create Alert,"
set the threshold to 80, and define evaluation intervals (e.g., every 30s for 5 minutes). Lastly, I
configure a notification channel (Slack, email, PagerDuty) in Alerting > Notification Channels.
First, I discuss their requirements—do they need slow queries, query count, latency, or error
rates? If the database supports it (e.g., MySQL, PostgreSQL), I connect Grafana to the database
and use queries like:
SELECT query, total_time, calls FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;
I create a table panel for slow queries and a time-series panel for query execution times. If
real-time updates are needed, I set a refresh rate of 5-10 seconds.
4. How would you share a Grafana dashboard with a stakeholder who doesn’t have
access to Grafana?
Use the "Share" button to generate a public link (if public sharing is enabled).
Export the dashboard as a PDF report using Enterprise Grafana.
Export the dashboard as JSON so they can import it into another Grafana instance.
Set up automatic email reports if the stakeholder needs regular updates.
label_values(node_cpu_seconds_total, instance)
This dynamically populates a dropdown with available instances. I then update my panels to
use $instance instead of hardcoding values. This way, users can filter data dynamically.
I use Grafana Organizations to separate access between teams. Within an organization, I assign
roles:
For finer control, I use Folder Permissions to restrict certain dashboards to specific users or
teams.
Optimize queries by reducing the data range or aggregation window (e.g., using rate()
instead of raw metrics).
Set a longer refresh interval (e.g., 30s instead of 5s).
Reduce the number of panels per dashboard.
Enable caching in the data source (if supported).
For Prometheus, use recording rules to precompute heavy queries.
8. How would you migrate dashboards from one Grafana instance to another?
I go to Dashboard Settings > JSON Model and export the JSON. Then, in the new instance, I use
Import Dashboard and paste the JSON. If migrating multiple dashboards, I use the Grafana API:
If gaps appear, I check the query interval—too small an interval may return no data. I also use
"Null value handling" in the panel settings, setting it to "Connected" or "Last Value" to
interpolate missing points. If it's a Prometheus query, I use step= in the query options to adjust
granularity.
10. How would you integrate Grafana with an authentication system like LDAP or
OAuth?
[[servers]]
host = "ldap.example.com"
port = 389
bind_dn = "cn=admin,dc=example,dc=com"
bind_password = "secret"
For OAuth (Google, GitHub, etc.), I enable it in grafana.ini and provide the client ID/secret.
11. A service is reporting incorrect metrics in Grafana. How do you debug this?
First, I run the same query in the data source directly (e.g., PromQL in Prometheus). If the data
looks wrong, I check the exporter’s logs (e.g., node_exporter or custom exporters). I also verify
the scrape configuration in Prometheus and check for missing labels or misconfigurations.
I use Grafana Loki as the data source. I configure Loki to ingest logs and use LogQL to query
them. Then, I add a Logs Panel in Grafana with a query like:
{job="nginx"} |= "error"
I integrate Grafana with NetFlow, SNMP, or Prometheus Node Exporter. A common PromQL
query for bandwidth usage is:
rate(node_network_receive_bytes_total[5m])
I regularly export dashboards as JSON and store them in Git. I also back up the Grafana
database, which stores dashboards:
I enable Grafana's internal metrics via the Prometheus data source and use the built-in
dashboard. A key query is:
rate(grafana_http_request_duration_seconds_sum[5m])
which shows API request latency. I also monitor disk usage, memory consumption, and alert
failures.
Final Thoughts
These are real-world Grafana problems I’ve encountered and solved. If I were in an interview,
I’d explain these in detail while also demonstrating solutions hands-on in Grafana itself!
I first check the query execution time using the Query Inspector to identify slow queries. If it's
a Prometheus query, I optimize it by reducing time ranges, adding rate() instead of increase(), and
avoiding or conditions. If it's a database query, I use indexing and reduce GROUP BY. I also check
panels using multiple queries and optimize the data source settings, such as increasing the
cache refresh rate.
I use templating with variables. I define a query variable like $environment that fetches values
dynamically (e.g., label_values(instance, environment)). Then, I use $environment in all queries so users
can switch environments dynamically. I also use chained variables so that selecting an
environment updates related variables, like pod names.
3. How do you set up alerting for high CPU usage in a Kubernetes cluster?
avg_over_time(node_cpu_seconds_total{mode="idle"}[5m]) < 20
I set thresholds (warning at 40%, critical at 80%) and configure a notification channel (Slack,
PagerDuty). I also enable for duration, so alerts trigger only if the issue persists for at least 5
minutes, preventing false alarms.
{job="nginx"} |= "error"
I then add a logs panel and link it to metrics dashboards using dashboard links. I also enable
logs-to-trace correlation, allowing users to jump from logs to related traces in Tempo.
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
To highlight anomalies, I enable thresholds (e.g., turn the graph red when failures exceed a
limit).
I use Transformations. For example, if one panel shows MySQL order counts and another
shows Prometheus API requests, I apply Merge tables and Add field from calculation to create
a unified metric.
This automatically marks deployments on the dashboard, helping teams correlate performance
drops with releases.
I check the Loki ingestion pipeline to ensure logs are being received. I run:
{job="nginx"}
If nothing appears, I check Promtail logs for ingestion errors. If logs are ingested but not shown,
I verify log labels in the query and look for rate-limited queries.
I create panels for key services with metrics like request rate, error rate, and latency. I enable
service discovery using variables so users can select a service dynamically. I also integrate
Tempo for tracing to visualize service dependencies.
10. How do you identify why a request is slow?
I start with metrics in Grafana (e.g., latency trends). If an API has high response times, I check
logs in Loki for errors. If the issue persists, I use tracing in Tempo to identify slow spans and
bottlenecks (e.g., slow database queries or external calls).
label_values({job="nginx"}, level)
{job="nginx", level="$logLevel"}
12. How do you set up a Grafana alert that prevents false positives?
I use for duration to trigger alerts only if an issue persists. For example, instead of alerting
immediately when CPU > 80%, I require it to be over 80% for at least 5 minutes. I also use
multiple conditions (e.g., CPU and memory must be high) to reduce noise.
I check the alert rule evaluation in the Grafana UI to see if it's returning the expected values. If
it's a PromQL alert, I run the query manually in Explore mode to verify correctness. If it's a
notification issue, I test the notification channel and check logs for errors.
Would you like more specific interview questions tailored to a particular data source, such as
Prometheus, Loki, or MySQL?
15 Work Scenario Questions & Answers for Prometheus & Grafana in Kubernetes
1. How would you set up Prometheus and Grafana in a new Kubernetes cluster?
I would use Helm to deploy the kube-prometheus-stack, which includes Prometheus, Grafana, and
Alertmanager. First, I’d add the Helm repo and update it:
After verifying the pods are running, I’d port-forward Grafana and access it via
http://localhost:3000/, then configure Prometheus as a data source.
2. What would you do if Prometheus stopped collecting metrics from a Kubernetes
service?
I’d first confirm that Prometheus is collecting the data by querying it directly in the
Prometheus UI. If the metrics are missing there, I’d check:
4. A pod is using too much memory, how would you alert on that in Prometheus?
groups:
- name: high-memory-usage
rules:
- alert: HighMemoryUsage
expr: container_memory_usage_bytes{container="my-app"} > 500000000
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage detected"
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: prometheus-ingress
namespace: monitoring
spec:
rules:
- host: prometheus.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-stack-prometheus
port:
number: 9090
I’d use cAdvisor, which is built into Kubelet, to expose container metrics. Prometheus scrapes it
via:
- job_name: 'kubernetes-cadvisor'
static_configs:
- targets: ['kubelet:10250']
For application-specific monitoring, I’d instrument the app to expose /metrics and configure a
ServiceMonitor in Prometheus.
If there are too many unique labels, I’d optimize metric collection by:
I’d use Thanos or Cortex for scalable, distributed storage. If using Thanos, I’d deploy:
10. How do you create a custom Grafana dashboard for Kubernetes metrics?
I’d log into Grafana, go to Dashboards → New Dashboard → Add a new panel.
Then, I’d:
I’d expose a /metrics HTTP endpoint in my app using a Prometheus client library.
For example, in Python using prometheus_client:
def handle_request():
requests_total.inc()
receivers:
- name: 'slack'
slack_configs:
- channel: '#alerts'
api_url: 'https://hooks.slack.com/services/...'
15. How do you ensure Prometheus and Grafana survive pod restarts?
volumeMounts:
- name: prometheus-storage
mountPath: /prometheus
volumes:
- name: prometheus-storage
persistentVolumeClaim:
claimName: prometheus-pvc
For Grafana, I’d store dashboards in a ConfigMap and use a PVC for storage.
These scenarios should prepare you for any real-world Kubernetes monitoring challenges! Let
me know if you need deeper explanations on any.
Great! Here are 15 real-world work scenario questions, along with 1st-person answers to help
you sound confident in an interview.
1. You notice that your application’s latency has increased significantly. How do
you troubleshoot it?
Answer:
First, I check my Prometheus metrics for response time trends and RED metrics (Requests,
Errors, Duration). If latency is high, I look at CPU, memory, and request queue saturation using
system metrics. Next, I correlate logs using trace IDs to pinpoint slow database queries or
external dependencies. If needed, I enable distributed tracing (OpenTelemetry/Jaeger) to
follow a request’s path across services. Based on findings, I optimize queries, scale resources, or
implement caching where necessary.
2. A critical alert fires, indicating high error rates. What steps do you take?
Answer:
I immediately check Prometheus alerts to see which services are affected. Then, I query logs in
Loki/Splunk using the trace_id to find common error patterns. If errors are due to bad
deployments, I check the last few commits in CI/CD and roll back if needed. If it’s an
infrastructure issue, I validate network/DNS failures or database connectivity. Throughout, I
update stakeholders with my findings and resolution timeline.
3. Your manager asks you to define an SLO for an API. How do you approach this?
Answer:
I first determine the critical SLIs, such as availability, latency, and error rate. Then, I define a
realistic SLO target, for example, 99.9% availability over 30 days with an error budget of 0.1%.
I create PromQL queries to measure compliance and set alerts if we exceed our error budget.
Finally, I present the SLO to stakeholders and adjust based on business needs.
4. You need to instrument a new service with Prometheus metrics. What do you
do?
Answer:
I use the appropriate Prometheus client library (e.g., prometheus-client for Python). I define key
metrics like http_requests_total, request_latency_seconds, and expose them via /metrics. I ensure I use
low-cardinality labels to avoid excessive memory usage. Once deployed, I add Prometheus
scrape configurations and create Grafana dashboards to visualize performance trends.
5. A developer complains that logs are not helpful. How do you improve logging?
Answer:
I first assess the log format—structured JSON logs with key fields like timestamp, service_name,
trace_id, and error_message are ideal. I ensure trace IDs are propagated across logs for end-to-end
tracing. If logs are too verbose or missing context, I implement log levels (INFO, DEBUG,
ERROR) and set up log correlation using Loki or Elasticsearch.
6. A security team flags that observability data might contain sensitive information.
How do you address this?
Answer:
I review log masking policies to ensure PII, passwords, and API keys are never logged. I enable
TLS encryption for telemetry data and apply RBAC (Role-Based Access Control) to restrict
access to logs and metrics. I also configure log retention policies to avoid storing data longer
than necessary.
7. A production service is slow, but no alerts have fired. How do you catch this
earlier?
Answer:
I review our SLIs and alert thresholds to ensure they reflect real-world performance. If users
notice slowness before alerts trigger, I lower the latency alert threshold. I also enable synthetic
monitoring (Blackbox Exporter) to proactively detect slowness before users report it.
Answer:
I start by instrumenting our services using OpenTelemetry SDKs while keeping our existing
monitoring intact for comparison. I set up an OpenTelemetry Collector to process and forward
telemetry to Prometheus, Jaeger, and Loki. I ensure traces, logs, and metrics are correlated,
then validate performance before fully migrating.
Answer:
I check which labels have excessive cardinality by running:
If a label like user_id is too dynamic, I replace it with a low-cardinality alternative, like
customer_tier or region. If needed, I aggregate metrics at the service level before pushing them to
Prometheus.
10. A new team is unsure how to use observability tools effectively. How do you
guide them?
Answer:
I host a hands-on session covering Prometheus querying (PromQL), Grafana dashboards,
distributed tracing (Jaeger), and log searching (Loki/Elasticsearch). I provide example SLO
dashboards and teach them how to correlate logs, traces, and metrics for troubleshooting.
11. You suspect that your database is slowing down API responses. How do you
confirm this?
Answer:
I use Prometheus metrics to check for high database query latency. I correlate slow API
requests in traces (Jaeger) to see if a specific SQL query is taking too long. If needed, I enable
slow query logging in the database and check for indexing issues or connection pool
saturation.
12. A recent deployment caused an outage. How do you prevent this in the future?
Answer:
I implement progressive deployments (canary or blue-green) with automated rollback if error
rates exceed thresholds. I add pre-release synthetic monitoring tests and ensure our SLOs
include error budget tracking to catch issues before they impact users.
13. A service is reporting 100% success rates, but users still experience failures.
What could be wrong?
Answer:
I check if our metrics only capture HTTP 500 errors but ignore timeouts and client-side
failures. I analyze distributed traces to detect requests that never complete. If necessary, I
modify the SLOs to include network failures, retries, and timeouts in error calculations.
14. Your team is over-alerted, leading to fatigue. How do you improve alerting?
Answer:
I conduct an alert review session to remove noisy or redundant alerts. I implement multi-level
alerts, where warnings notify teams but only critical issues trigger a page. I also introduce auto-
remediation scripts for common issues (e.g., auto-scaling when CPU usage spikes).
15. A customer reports an issue that occurred 3 days ago. How do you investigate?
Answer:
I use log retention policies to retrieve historical logs from Loki/Elasticsearch. I filter logs by
trace_id, user_id, or request path to reconstruct what happened. I then correlate logs with
Prometheus metrics from that time period to see if any resource spikes or failures occurred. If
necessary, I check Jaeger traces to find where the request failed.
Final Thoughts
Mastering these scenarios means you're well-prepared for real-world observability challenges.
If you want more practice, I can create more complex cases or quiz you on any of these!
Here are 15 work scenario questions with first-person responses, tailored for monitoring,
observability, and troubleshooting with Prometheus, Grafana, and related tools.
1. How would you set up a full observability stack for a new application?
"First, I’d define key metrics, logs, and traces needed for observability. Then, I’d set up
Prometheus for metrics collection, Loki for logs, and Jaeger for tracing. I’d instrument the
application using OpenTelemetry and Prometheus client libraries. Finally, I’d use Grafana for
visualization and set up Alertmanager for proactive issue detection."
"I’d first check Grafana dashboards for latency trends and identify affected endpoints. Using
Prometheus, I’d query response time metrics (http_request_duration_seconds). If the latency
originates from a database, I’d check database_query_duration_seconds. I’d then use Jaeger tracing
to see slow spans and logs in Loki to find error messages. Based on findings, I’d optimize queries,
add caching, or scale resources."
3. Your Prometheus server is consuming too much storage. How do you fix it?
"I’d check Prometheus retention settings (storage.tsdb.retention.time) and ensure it’s not storing
unnecessary historical data. I’d analyze scrape intervals and reduce them if they’re too frequent.
If needed, I’d offload old metrics to a long-term storage solution like Thanos or Cortex."
4. How do you handle alert fatigue from too many Prometheus alerts?
"I’d review existing alerts and classify them by severity (critical vs. warning). I’d use
Alertmanager inhibition rules to suppress lower-priority alerts when a higher-priority alert is
firing. Additionally, I’d tune thresholds and use rate-based alerts instead of static ones to reduce
noise."
"I’d first check if the metric is exposed by the application using curl on the /metrics endpoint. If it's
missing, I’d review the application’s instrumentation code. If the metric is present but not in
Prometheus, I’d check scrape_configs in prometheus.yml to ensure the target is being scraped. Logs
from Prometheus would help diagnose failures."
6. How do you monitor a service that does not expose metrics natively?
"I’d create a custom Prometheus exporter in Python or Go to scrape data from logs, APIs, or
databases. The exporter would expose the data in the Prometheus format, and I’d configure
Prometheus to scrape it. If necessary, I’d use Pushgateway for short-lived jobs."
"I’d check Kubernetes logs using kubectl logs <pod>. If the restart is due to an OOMKill, I’d inspect
memory usage via Prometheus (container_memory_usage_bytes). If it’s crashing due to high CPU, I’d
check container_cpu_usage_seconds_total. If the service is failing health checks, I’d review its
readiness/liveness probes."
8. Your Grafana dashboard is slow. How do you optimize it?
"I’d first check query execution times in Grafana’s Query Inspector. If queries are slow, I’d
optimize them by reducing the time range or aggregating data (e.g., rate() instead of count()). I’d
also enable Prometheus recording rules to precompute frequently used queries. Finally, I’d
ensure Prometheus isn't overwhelmed with high cardinality metrics."
"I’d first check Prometheus metrics for CPU, memory, and request rate spikes. I’d scale the
affected service using Kubernetes Horizontal Pod Autoscaler (HPA) based on CPU or request
rate. If the database is the bottleneck, I’d optimize queries or introduce caching (e.g., Redis). I’d
also set up rate limits to prevent overload."
10. How do you correlate logs, metrics, and traces to investigate an issue?
"I’d use Grafana Loki for logs, Prometheus for metrics, and Jaeger for tracing. If an alert fires
for high latency, I’d check Jaeger traces to see which services contribute to the delay. I’d then
look at Loki logs for errors and correlate them with Prometheus metrics to find root causes."
"I’d check Prometheus target status at /targets to see if the endpoint is reachable. If it's down, I’d
confirm the service is running and exposing metrics. If the target is up but has errors, I’d check
for TLS/SSL issues, authentication problems, or incorrect scrape configurations in
prometheus.yml."
12. Your alerts are firing too late. How do you make them more responsive?
"I’d reduce Prometheus evaluation_interval and adjust alert thresholds to trigger sooner. If the
data is delayed due to scrape intervals, I’d make them more frequent for critical metrics. I’d also
use Grafana real-time alerts for near-instant notifications."
13. How do you measure and improve database query performance using
Prometheus?
"I’d monitor query execution times using pg_stat_statements (for PostgreSQL) or slow query logs
(for MySQL). In Prometheus, I’d collect database_query_duration_seconds and visualize it in Grafana.
If queries are slow, I’d add indexes, optimize joins, or use caching. I’d also check if connection
pooling is needed."
"I’d start by defining SLIs (Service Level Indicators) and SLOs (Service Level Objectives) for key
services. I’d standardize observability tools across teams (e.g., Prometheus for metrics, Loki for
logs, Jaeger for tracing). I’d create dashboards, alerting policies, and a runbook. Finally, I’d set
up a centralized observability platform for a unified view."
15. You need to set up monitoring for a multi-cloud environment. How do you do it?
"I’d deploy a federated Prometheus setup, where regional Prometheus instances collect data
and forward it to a central Prometheus or Thanos for long-term storage. I’d ensure cloud-
specific exporters (AWS CloudWatch Exporter, GCP Stackdriver Exporter) are in place. Grafana
would be used as a single-pane-of-glass dashboard."
Final Thoughts
These responses demonstrate deep knowledge while being clear, structured, and actionable.
If you're preparing for an interview, practice explaining your reasoning behind each step. Want
me to refine anything further?