0% found this document useful (0 votes)
52 views43 pages

Prom Qna

The document provides 15 real-world scenarios and answers related to using Prometheus for monitoring and debugging services. It covers topics such as setting up monitoring, handling high cardinality issues, creating alerts, and optimizing storage usage. Each scenario is presented in a first-person format, simulating interview responses to demonstrate practical knowledge of Prometheus functionalities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views43 pages

Prom Qna

The document provides 15 real-world scenarios and answers related to using Prometheus for monitoring and debugging services. It covers topics such as setting up monitoring, handling high cardinality issues, creating alerts, and optimizing storage usage. Each scenario is presented in a first-person format, simulating interview responses to demonstrate practical knowledge of Prometheus functionalities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Glad you liked it!

Here are 15 real-world Prometheus work scenarios with answers in the first
person, as if I were answering in an interview.

1. A critical production service is running slow. How do you start debugging with
Prometheus?

First, I check the Prometheus UI or Grafana dashboards for key metrics like CPU, memory, and
request latency. I use PromQL queries such as:

rate(http_requests_total[5m])

to see if there’s an increase in traffic. If CPU is high, I check:

100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))

If I identify resource exhaustion, I scale up the service or optimize queries.

2. How do you set up monitoring for a new microservice?

I ensure the service exposes a /metrics endpoint and configure Prometheus to scrape it by
updating prometheus.yml:

scrape_configs:
- job_name: 'my-service'
static_configs:
- targets: ['my-service:8080']

Then, I test the integration with:

curl http://my-service:8080/metrics

If metrics appear, I visualize them in Grafana.

3. Prometheus suddenly stops scraping a target. What do you do?

I first check the Targets tab in Prometheus (http://prometheus:9090/targets). If the target is down, I
verify:

 Service discovery (in Kubernetes: kubectl get endpoints my-service)


 Firewall rules or network issues (e.g., telnet my-service 8080)
 Prometheus logs (docker logs prometheus or kubectl logs prometheus)

If it's a scrape interval issue, I increase logging and manually test with:

curl http://my-service:8080/metrics

4. How do you create an alert if CPU usage goes above 80%?

I create an alerting rule in prometheus.rules.yml:

groups:
- name: cpu_alerts
rules:
- alert: HighCPUUsage
expr: 100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 80
for: 5m
labels:
severity: critical
annotations:
description: "CPU usage is above 80% for 5 minutes"

Then, I reload Prometheus and ensure Alertmanager is configured to send notifications.

5. How do you scale Prometheus for large environments?

For large-scale environments, I use:

1. Thanos or Cortex for long-term storage and high availability.


2. Sharding by running multiple Prometheus instances, each scraping different targets.
3. Remote storage integrations (e.g., InfluxDB, VictoriaMetrics).
4. Federation, where a root Prometheus server aggregates data from multiple child
instances.

6. How do you monitor Prometheus itself?

I enable Prometheus’s built-in metrics by scraping http://prometheus:9090/metrics.


I track:

 Target scrape failures:


 rate(prometheus_target_scrapes_total[5m])
 Query load:
 rate(prometheus_engine_queries_total[5m])
 TSDB disk usage:
 prometheus_tsdb_head_series

If it starts consuming too much memory, I optimize retention settings or offload data to Thanos.

7. What’s the difference between rate() and irate()?

 rate() calculates the average per-second rate over a time range. Best for long-term
trends.
 irate() takes only the most recent points, so it’s better for short-term spikes.
Example: If I want to analyze HTTP request trends, I use:

rate(http_requests_total[5m])

But if I need real-time anomaly detection, I use:

irate(http_requests_total[1m])

8. How do you reduce Prometheus storage usage?

1. Lower retention (default is 15d):


2. --storage.tsdb.retention.time=7d
3. Drop unneeded metrics using metric_relabel_configs.
4. Use remote storage (Thanos, VictoriaMetrics) for long-term storage.
5. Increase scrape intervals from 15s to 30s or 1m if high granularity isn't needed.

9. How do you handle high cardinality issues in Prometheus?

High cardinality (too many unique labels) slows down queries and increases memory usage.
I address this by:

1. Avoiding dynamic labels (e.g., user_id, request_id).


2. Using histograms and summaries instead of raw metrics.
3. Grouping labels with grouping() in PromQL.

10. How do you monitor a database like MySQL with Prometheus?


I deploy the MySQL Exporter:

docker run -d -p 9104:9104 --name mysql-exporter -e


DATA_SOURCE_NAME="user:password@tcp(localhost:3306)/" prom/mysqld-exporter

Then, I configure Prometheus:

scrape_configs:
- job_name: 'mysql'
static_configs:
- targets: ['mysql-exporter:9104']

I query metrics like:

mysql_global_status_threads_running

for active threads.

11. How do you create a custom metric in a Python application?

I use the prometheus_client library:

from prometheus_client import start_http_server, Counter

REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP Requests')

def handle_request():
REQUEST_COUNT.inc()

start_http_server(8000)

This exposes metrics at /metrics, which Prometheus scrapes.

12. How do you monitor Kubernetes with Prometheus?

I deploy the Prometheus Operator using Helm:

helm install prometheus prometheus-community/kube-prometheus-stack

This auto-discovers Kubernetes resources like Nodes, Pods, and Services.


13. What’s the difference between histogram_quantile() and rate()?

 rate() calculates the per-second rate of counters over time.


 histogram_quantile() estimates percentiles (e.g., p95 latency) from histogram buckets.
Example:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

gives the 95th percentile response time.

14. How do you handle Prometheus downtime?

1. Use Thanos or Cortex for HA and remote storage.


2. Run redundant Prometheus instances with a load balancer.
3. Monitor Prometheus itself with alerts on up{job="prometheus"}.

15. How do you debug slow PromQL queries?

1. Enable query logging:


2. --log.level=debug
3. Check query execution time in the Prometheus UI (http://prometheus:9090/graph).
4. Optimize filters by reducing label combinations.
5. Avoid unnecessary joins (group_left and group_right).

Final Thoughts

With these answers, you should be well-prepared for any Prometheus interview! Let me know
if you want more deep dives.

Here are 15 real-world work scenario questions related to Prometheus metrics collection,
along with first-person answers to help you confidently handle an interview.

1. A service is missing in Prometheus. How do you troubleshoot?

I first check if the service is listed in prometheus.yml under scrape_configs. If it's dynamically
discovered, I use the targets API (/api/v1/targets) to verify if it's registered. If not, I inspect service
discovery settings. If it's listed but unhealthy, I check firewall rules, network connectivity, and
ensure the service exposes /metrics. Finally, I check logs for Prometheus and the service itself.

2. You need to monitor custom application metrics. How do you proceed?

I’d first check if the application has built-in Prometheus support. If not, I’d implement a custom
exporter using Python (prometheus_client) or Go (prometheus/client_golang). I’d expose an HTTP
endpoint (/metrics), register relevant metrics (Counters, Gauges, Histograms, or Summaries), and
update prometheus.yml to scrape it.

3. You notice high cardinality metrics are consuming too much memory. What do
you do?

I’d analyze which labels contribute to high cardinality using count by (__name__) and count by
(label_name). If unnecessary labels (like user_id, request_id) exist, I’d relabel or drop them using
metric_relabel_configs. If needed, I’d use histograms instead of summaries to aggregate data
more efficiently.

4. The team wants to monitor API latency. Which metric type should I use?

I’d use a Histogram since it captures the distribution of latencies, allowing us to compute
percentiles later. I’d set appropriate buckets to ensure meaningful analysis, like:

http_request_duration_seconds = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: prometheus.DefBuckets,
})

Then, I’d visualize percentiles in Grafana.

5. How do you scrape Kubernetes pod metrics?

I’d use kubernetes_sd_configs in prometheus.yml to enable automatic discovery of pods.

scrape_configs:
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod

Then, I’d label services properly and use relabel_configs to filter the right workloads.

6. A counter metric is decreasing over time. What could be the reason?

Counters should only increase, so if I see a drop, it's likely due to a process restart. I’d check if
the application restarts frequently using up metrics or container logs. If needed, I’d persist
counters externally to avoid resets.

7. You need to monitor a database but it doesn’t expose Prometheus metrics.


What’s your approach?

I’d use a Prometheus exporter. For example, if it’s PostgreSQL, I’d deploy postgres_exporter:

docker run -d -p 9187:9187 wrouesnel/postgres_exporter

Then, I’d configure Prometheus to scrape localhost:9187/metrics and create alerts on slow queries.

8. How do you monitor container metrics in a Kubernetes cluster?

I’d deploy cAdvisor to collect per-container CPU, memory, and disk usage:

docker run -d --name=cadvisor -p 8080:8080 google/cadvisor

For a Kubernetes setup, I’d ensure Prometheus is scraping cAdvisor's endpoint.

9. How do you check if Prometheus is scraping a target?

I’d visit Prometheus’s web UI (http://prometheus-server:9090/targets) to see if the target is marked


as "up". If it's missing, I’d check prometheus.yml. If it's listed but down, I’d test connectivity with
curl or telnet and inspect logs.
10. A developer asks why Prometheus doesn’t support push-based metrics. How do
you respond?

I’d explain that Prometheus follows a pull-based model for reliability, scalability, and
consistency. However, for short-lived jobs (e.g., batch processing), I’d recommend using
Pushgateway, which buffers metrics until scraped.

11. You need to drop certain metrics before they’re stored. How do you do it?

I’d use metric_relabel_configs in prometheus.yml:

metric_relabel_configs:
- source_labels: [__name__]
regex: "http_requests_total"
action: drop

This ensures http_requests_total is never stored, reducing storage consumption.

12. How do you monitor if an external API is available using Prometheus?

I’d deploy the Blackbox Exporter and configure an HTTP probe:

modules:
http_2xx:
prober: http
timeout: 5s

Then, I’d create an alert if the API isn’t responding.

13. Your team complains that Prometheus is using too much disk space. What do
you do?

I’d check retention settings (--storage.tsdb.retention.time) and analyze which metrics take up space.
I’d optimize by:

 Reducing metric retention if long history isn’t needed.


 Dropping high-cardinality labels.
 Using histograms instead of summaries.
14. How do you create an alert if a service is down?

I’d use Alertmanager with a simple alert rule in alert.rules:

groups:
- name: instance_down
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"

Then, I’d configure Alertmanager to send notifications to Slack or email.

15. What’s the difference between rate() and increase() in PromQL?

I’d explain:

 rate(metric[5m]) calculates per-second rate over 5 minutes, smoothing short-term


fluctuations.
 increase(metric[5m]) shows total increase over 5 minutes, useful for cumulative counters.

For example, if I need requests per second, I’d use:

rate(http_requests_total[5m])

If I want total requests in the last hour, I’d use:

increase(http_requests_total[1h])

Final Thoughts

These 15 real-world scenarios cover common troubleshooting, optimization, and


implementation challenges you’d face using Prometheus in production. If you want me to tailor
more questions to a specific role or environment, let me know!

Here are 15 work scenario-based PromQL interview questions with answers in first-person:
1. How would you find the total HTTP requests handled by all instances over the
past 5 minutes?

Scenario: My team needs to monitor overall web traffic.

Answer:
I would use increase(http_requests_total[5m]) to calculate the total number of requests in the last 5
minutes. If I need to group them by status code or method, I'd modify it like this:

sum by(status_code) (increase(http_requests_total[5m]))

2. How do you calculate CPU usage as a percentage of total CPU time?

Scenario: We need an alert if CPU usage exceeds 80%.

Answer:
I’d use the rate() function to calculate the per-second CPU usage and normalize it against total
CPU cores:

100 * (1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))

This gives me the percentage of CPU actively in use.

3. How can you detect if a counter reset occurred in a time series?

Scenario: We suspect that network byte counters reset after a system reboot.

Answer:
I’d use resets() to detect if the counter reset within a given period:

resets(node_network_transmit_bytes_total[1h])

If the result is greater than zero, it means the counter has reset at least once in the last hour.

4. How do you find the top 3 services consuming the most memory?

Scenario: We need to optimize memory usage across microservices.


Answer:
I’d use topk(3, metric_name), assuming container_memory_usage_bytes is the metric:

topk(3, sum by(container) (container_memory_usage_bytes))

This shows the three services using the most memory.

5. How can you measure the rate of failed HTTP requests over time?

Scenario: Our SRE team wants to track error rates to detect anomalies.

Answer:
I’d filter only failed requests (status_code=~"5..") and apply rate():

rate(http_requests_total{status_code=~"5.."}[5m])

This gives me the per-second failure rate over the last 5 minutes.

6. How do you determine the average request duration per service?

Scenario: The product team wants to ensure API latency stays low.

Answer:
I’d divide total request duration by total requests, grouped by service:

sum by(service) (rate(http_request_duration_seconds_sum[5m])) /


sum by(service) (rate(http_request_duration_seconds_count[5m]))

This provides the average request duration per service.

7. How can you forecast future memory usage trends?

Scenario: We need to plan for upcoming scaling needs.

Answer:
I’d use predict_linear() to estimate memory usage in the next hour:

predict_linear(node_memory_active_bytes[1h], 3600)
This predicts memory usage based on the last hour’s trend.

8. How do you check if a service is experiencing high disk write latency?

Scenario: A database is experiencing slow performance.

Answer:
I’d check the disk I/O latency using rate() on node_disk_write_time_seconds_total:

avg by(instance) (rate(node_disk_write_time_seconds_total[5m]))

A high value here would indicate high write latency.

9. How do you compare today’s traffic to yesterday’s traffic?

Scenario: Marketing wants to compare daily user activity.

Answer:
I’d use offset 1d to compare today’s traffic with yesterday’s:

sum(increase(http_requests_total[1h])) - sum(increase(http_requests_total[1h] offset 1d))

A positive result means today’s traffic increased.

10. How do you detect sudden spikes in request rates?

Scenario: We need to monitor for potential DDoS attacks.

Answer:
I’d use delta() to detect sudden increases over a short time frame:

delta(http_requests_total[2m]) > 1000

This flags cases where requests jump by more than 1000 within 2 minutes.

11. How can you calculate the 99th percentile request latency?
Scenario: We want to ensure that 99% of our requests complete within an SLA.

Answer:
Using histogram_quantile(0.99, metric), I’d extract the 99th percentile from request duration
histograms:

histogram_quantile(0.99, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))

This tells me the latency threshold that 99% of requests stay under.

12. How do you measure network bandwidth usage per instance?

Scenario: The team wants to track network bottlenecks.

Answer:
I’d use rate() on node_network_transmit_bytes_total and node_network_receive_bytes_total:

sum by(instance) (rate(node_network_transmit_bytes_total[5m]) + rate(node_network_receive_bytes_total[5m]))

This shows total bandwidth per instance.

13. How do you check if a metric is missing for a service?

Scenario: Alerts stopped firing because a service metric disappeared.

Answer:
I’d use absent() to check if a metric is missing:

absent(http_requests_total{job="api-service"})

If this returns 1, it means no data exists for that metric.

14. How do you monitor the number of active users on a system?

Scenario: Product wants real-time active user counts.

Answer:
Assuming an active_sessions metric, I’d use count():
count(active_sessions)

This gives me the number of currently active users.

15. How do you optimize a slow dashboard query?

Scenario: Our Grafana dashboard loads too slowly.

Answer:
I’d optimize the query by:

 Using recording rules for frequently queried metrics.


 Reducing label cardinality (too many unique labels slow queries).
 Avoiding long-range queries like [30d], instead using offsets.
 Using subqueries to precompute heavy calculations.

Example of using recording rules:

groups:
- name: dashboard_queries
rules:
- record: job:cpu_usage:rate5m
expr: avg by (job) (rate(node_cpu_seconds_total[5m]))

This precomputes the data and speeds up queries.

Final Thoughts

Mastering PromQL means understanding when to use which function and how to optimize
queries. In an interview, I’d explain not just the syntax but the reasoning behind each query.

Would you like more in-depth examples on any of these?

Here are 15 real-world work scenario questions related to Prometheus Alertmanager, along
with first-person answers that demonstrate deep understanding and problem-solving skills.

1. A critical alert is firing, but no one is receiving notifications. What do you do?
Answer:
The first thing I do is check the Alertmanager logs (alertmanager.log) to see if there are any errors.
Then, I verify if the alert is correctly firing in Prometheus using:

curl -X GET http://localhost:9090/api/v1/alerts

If the alert exists in Prometheus but isn't reaching Alertmanager, I check my prometheus.yml
configuration for the alerting section. If it is reaching Alertmanager but not notifying, I inspect
alertmanager.yml for receiver configurations and test it manually with:

curl -X POST http://localhost:9093/api/v2/alerts -d '[{"labels": {"alertname": "TestAlert"}}]'

If notifications still don’t work, I verify my SMTP, Slack, or PagerDuty API keys and ensure the
notification service is reachable.

2. How do you handle alert fatigue in a production environment?

Answer:
I implement alert deduplication, grouping, inhibition, and silences in Alertmanager.

 Grouping: I ensure similar alerts (e.g., from the same service) are grouped together to
reduce noise.
 Inhibition: I configure inhibition rules so that a critical alert suppresses warning alerts
from the same service.
 Silences: I encourage teams to set silences during maintenance windows.
 Fine-Tuning Alerting Rules: I avoid overly broad conditions and use for: clauses to
prevent flapping alerts.

3. You need to set up Alertmanager HA. How do you do it?

Answer:
I deploy multiple Alertmanager instances in a cluster by modifying alertmanager.yml:

cluster:
peers:
- "alertmanager-1:9093"
- "alertmanager-2:9093"
listen-address: "0.0.0.0:9094"

Then, I start multiple instances with clustering enabled:


./alertmanager --config.file=alertmanager.yml --cluster.listen-address=0.0.0.0:9094

I verify HA status using:

curl http://localhost:9093/api/v2/status

If one instance goes down, the others continue handling alerts.

4. How do you configure Alertmanager to send alerts to both Slack and email?

Answer:
I configure multiple receivers in alertmanager.yml:

receivers:
- name: 'email'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.example.com:587'

- name: 'slack'
slack_configs:
- channel: '#alerts'
api_url: 'https://hooks.slack.com/services/XXXX/XXXX/XXXX'

route:
receiver: 'email'
routes:
- match:
severity: critical
receiver: 'slack'

This ensures critical alerts go to Slack, while all alerts are emailed.

5. A developer reports that an alert was delayed by 10 minutes. How do you


debug?

Answer:
First, I check if the alert rule has a for: condition in alert.rules.yml causing a delay:

- alert: HighMemoryUsage
expr: node_memory_Active_bytes / node_memory_MemTotal_bytes > 0.9
for: 10m # Must be above threshold for 10 minutes before firing
Then, I check the Alertmanager logs for any bottlenecks and test notifications manually to
confirm there are no external delays.

6. How do you temporarily silence an alert for maintenance?

Answer:
I create a silence using the Alertmanager UI (http://localhost:9093) or via API:

curl -X POST -H "Content-Type: application/json" -d '{


"matchers": [{"name": "alertname", "value": "InstanceDown", "isRegex": false}],
"startsAt": "2025-03-16T12:00:00Z",
"endsAt": "2025-03-16T18:00:00Z",
"createdBy": "admin",
"comment": "Scheduled maintenance"
}' http://localhost:9093/api/v2/silences

This prevents unnecessary alert noise.

7. How do you route alerts based on severity?

Answer:
I define routing rules in alertmanager.yml:

route:
receiver: 'email'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'

Critical alerts go to PagerDuty, warnings go to Slack, and others default to email.

8. An alert is firing repeatedly even though the issue is resolved. Why?

Answer:
I check if the alert’s for: condition is too short. If so, I increase it.
Next, I verify that Prometheus is correctly resolving alerts:
curl -X GET http://localhost:9090/api/v1/alerts

If Alertmanager isn't resolving the alert, I check the resolve_timeout in alertmanager.yml:

global:
resolve_timeout: 5m

If resolve_timeout is too long, I reduce it.

9. How do you send custom webhook notifications?

Answer:
I configure a webhook receiver in alertmanager.yml:

receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://webhook-server.com/alerts'

I test it by sending a manual alert:

curl -X POST http://localhost:9093/api/v2/alerts -d '[{"labels": {"alertname": "TestAlert"}}]'

And I monitor incoming requests on the webhook server.

10. How do you debug an Alertmanager notification failure?

Answer:

1. Check Alertmanager logs:


2. docker logs alertmanager
3. Test notification manually:
4. curl -X POST http://localhost:9093/api/v2/alerts -d '[{"labels": {"alertname": "TestAlert"}}]'
5. Verify external service (SMTP, Slack, etc.) credentials.
6. Check for firewall/network issues.

11. How do you handle duplicate alerts?

Answer:
I configure deduplication using group_by in alertmanager.yml:
route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h

This ensures duplicate alerts are grouped and not spammed.

12. How do you prioritize alerts?

Answer:
I route critical alerts to PagerDuty and low-priority alerts to email:

route:
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'email'

13. How do you scale Alertmanager?

Answer:
I run multiple Alertmanager instances with a load balancer and clustering.

14. How do you ensure alerts are actionable?

Answer:
I include clear descriptions, remediation steps, and links to dashboards in annotations.

15. How do you test new alert rules?

Answer:
I use Prometheus’ "Alerts" UI or run:

promtool check rules alert.rules.yml


To manually trigger an alert, I use:

curl -X POST http://localhost:9093/api/v2/alerts -d '[{"labels": {"alertname": "Test"}}]'

With these answers, you’re 100% prepared for any Alertmanager interview question!

Great! Here are 15 real-world work scenario questions along with first-person responses that
demonstrate deep knowledge of Scaling and Performance Optimization in Prometheus.

1. Our Prometheus instance is running out of memory frequently. How do you


troubleshoot and fix this?

First, I’d check the number of active time series using the count({__name__=~".+"}) query. If it’s too
high, I’d investigate label cardinality issues using count by (__name__)({__name__=~".+"}). If I find
high-cardinality labels like pod_name, I’d drop or relabel them in prometheus.yml. Next, I’d reduce
retention time using --storage.tsdb.retention.time=30d and enable WAL compression with --
storage.tsdb.wal-compression. If needed, I’d offload old data to Thanos or a remote storage
backend.

2. Our Prometheus queries are slow. How do you optimize them?

I’d first check which queries are slow using explain (query). If it involves expensive operations like
rate() over a large time window, I’d create recording rules to precompute and store the results.
I’d also ensure that I’m using group by effectively to reduce cardinality. If the slow queries
involve historical data, I’d integrate Thanos or Cortex to separate long-term storage from real-
time queries.

3. We need to set up Prometheus for multiple teams while maintaining isolation.


How do you do this?

I’d deploy multiple Prometheus instances, each dedicated to a specific team, and use
Kubernetes namespaces to enforce isolation. For querying across teams, I’d use Thanos Querier
with proper RBAC to control access. Alternatively, if we need true multi-tenancy, I’d opt for
Cortex or Grafana Mimir, which natively support multi-tenancy.
4. Prometheus is scraping too many targets, causing high CPU usage. What’s your
approach?

I’d optimize scrape intervals by increasing scrape_interval for non-critical jobs from 15s to 30s or
1m. I’d also use relabeling rules to drop unnecessary metrics at the source. If we’re in
Kubernetes, I’d use podMonitor selectors to limit what gets scraped. If load is still high, I’d shard
Prometheus instances based on job labels.

5. How do you ensure high availability in Prometheus?

I’d deploy two or more Prometheus instances in an Active-Active setup, where each instance
scrapes the same targets. I’d use Thanos Querier or load balancing via Nginx to aggregate
queries. For alerting, I’d configure Alertmanager in an HA setup with consistent hashing to
avoid duplicate alerts.

6. Prometheus is not scraping metrics from a target. How do you debug this?

First, I’d check if the target is reachable using curl <target>:<port>/metrics. If it’s down, I’d
investigate the service itself. If it’s up but missing in Prometheus, I’d check the Service
Discovery page (/targets) to see if the target is listed but failing. If it’s missing entirely, I’d check
the scrape_configs in prometheus.yml. If relabeling is causing issues, I’d validate using promtool check
config.

7. Our alerts are firing multiple times from different Prometheus instances. How do
you prevent duplicates?

I’d deploy Alertmanager in HA mode and ensure Prometheus instances send alerts to all
Alertmanagers. Then, I’d configure --cluster.peer in Alertmanager so instances can deduplicate
alerts. If using Thanos Ruler for alert evaluation, I’d enable --query.replica-label=prometheus_replica
to deduplicate queries.

8. We need long-term storage for Prometheus data. What solutions do you


recommend?
If we need object storage (S3, GCS, MinIO), I’d use Thanos because it integrates well with
Prometheus. If we need distributed storage with horizontal scaling, I’d choose Cortex or
Mimir, as they support true multi-tenancy and long-term retention. If we have high write
volumes, I’d ensure we use remote_write efficiently to avoid excessive load.

9. Prometheus keeps restarting due to OOM (Out of Memory) errors. What do you
do?

I’d first check memory consumption using kubectl top pod if running in Kubernetes. If it’s
consuming too much memory, I’d limit retention ( --storage.tsdb.retention.time=15d) and reduce
time series cardinality by relabeling. I’d also check for expensive queries running frequently in
Grafana and optimize them. If needed, I’d increase resources in Kubernetes or offload old data
to Thanos.

10. We need to scale Prometheus for 10,000+ targets. How do you architect it?

I’d shard Prometheus instances using a hashmod strategy based on instance labels, ensuring each
instance scrapes only a subset of targets. I’d deploy Thanos Querier to aggregate queries across
all instances. To handle long-term storage, I’d configure remote_write to send data to Thanos
Store or Cortex.

11. Some metrics are missing from Prometheus. How do you debug?

I’d first check if the target is exposing metrics using curl <target>:<port>/metrics. If it’s there but not
in Prometheus, I’d verify /targets in the Prometheus UI. If relabeling is dropping metrics, I’d
validate the prometheus.yml configuration using promtool check config. If the issue persists, I’d check
prometheus.log for scrape errors.

12. How do you reduce Prometheus disk usage?

I’d first reduce retention time (--storage.tsdb.retention.time=30d). Then, I’d enable WAL
compression to reduce storage overhead. I’d drop high-cardinality metrics using relabeling and
ensure only necessary time series are stored. If needed, I’d configure Thanos, Cortex, or Mimir
for long-term storage.
13. Our PromQL query is timing out. How do you optimize it?

I’d check if the query is scanning too much data using explain (query). If it’s aggregating a large
dataset, I’d create recording rules to store precomputed values. If it involves historical data, I’d
ensure it queries Thanos Store instead of Prometheus directly. If using large labels, I’d reduce
cardinality.

14. How do you handle a scenario where one Prometheus instance goes down?

If we’re using Active-Active HA, the second instance continues scraping, and Thanos Querier
ensures continuity. If Active-Passive, I’d have a failover mechanism in place using Kubernetes
PodDisruptionBudgets or an external load balancer like HAProxy.

15. We need to federate multiple Prometheus instances. How do you do it?

I’d configure federation by having a global Prometheus scrape key metrics from regional
Prometheus instances using the /federate endpoint. I’d define match[] to ensure only necessary
metrics are scraped to avoid excessive load. If needed, I’d use Thanos Querier for cross-instance
queries instead of federation.

This set of 15 work scenarios gives you real-world problem-solving experience that aligns with
interview-style questions. Would you like more hands-on exercises or mock whiteboard
scenarios?

15 Grafana Work Scenario Questions & Answers

Here are 15 real-world Grafana scenarios, with answers from a first-person perspective as if I'm
explaining in an interview.

1. How would you troubleshoot a Grafana dashboard that is not displaying any
data?

First, I check if the data source is correctly configured and reachable by navigating to
Configuration > Data Sources and clicking "Test." If it's failing, I verify the endpoint and
authentication details. Next, I inspect the queries in the panel settings to ensure they are
fetching the correct metrics. I also check if filters or time ranges are too restrictive. If it's a
Prometheus issue, I run the query directly in Prometheus to confirm if the data exists. Finally, I
look for any errors in the Grafana logs (/var/log/grafana/grafana.log or docker logs grafana).

2. How do you set up an alert for CPU usage exceeding 80%?

I go to my dashboard and create a new panel with the query:

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

I set the visualization to Gauge or Graph. Then, I navigate to the Alert tab, click "Create Alert,"
set the threshold to 80, and define evaluation intervals (e.g., every 30s for 5 minutes). Lastly, I
configure a notification channel (Slack, email, PagerDuty) in Alerting > Notification Channels.

3. A team requests a dashboard with real-time database query performance


metrics. How would you approach this?

First, I discuss their requirements—do they need slow queries, query count, latency, or error
rates? If the database supports it (e.g., MySQL, PostgreSQL), I connect Grafana to the database
and use queries like:

SELECT query, total_time, calls FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;

I create a table panel for slow queries and a time-series panel for query execution times. If
real-time updates are needed, I set a refresh rate of 5-10 seconds.

4. How would you share a Grafana dashboard with a stakeholder who doesn’t have
access to Grafana?

I have a few options:

 Use the "Share" button to generate a public link (if public sharing is enabled).
 Export the dashboard as a PDF report using Enterprise Grafana.
 Export the dashboard as JSON so they can import it into another Grafana instance.
 Set up automatic email reports if the stakeholder needs regular updates.

5. A developer needs dynamic filtering in a dashboard. How do you add variables?


I go to Dashboard Settings > Variables and click "Add Variable." If using Prometheus, I set the
query to:

label_values(node_cpu_seconds_total, instance)

This dynamically populates a dropdown with available instances. I then update my panels to
use $instance instead of hardcoding values. This way, users can filter data dynamically.

6. How do you handle permissions in Grafana for multiple teams?

I use Grafana Organizations to separate access between teams. Within an organization, I assign
roles:

 Admin: Full access to dashboards and settings.


 Editor: Can create and modify dashboards but not settings.
 Viewer: Read-only access.

For finer control, I use Folder Permissions to restrict certain dashboards to specific users or
teams.

7. How would you reduce dashboard load times?

 Optimize queries by reducing the data range or aggregation window (e.g., using rate()
instead of raw metrics).
 Set a longer refresh interval (e.g., 30s instead of 5s).
 Reduce the number of panels per dashboard.
 Enable caching in the data source (if supported).
 For Prometheus, use recording rules to precompute heavy queries.

8. How would you migrate dashboards from one Grafana instance to another?

I go to Dashboard Settings > JSON Model and export the JSON. Then, in the new instance, I use
Import Dashboard and paste the JSON. If migrating multiple dashboards, I use the Grafana API:

curl -X GET http://localhost:3000/api/dashboards/uid/<dashboard-uid> -H "Authorization: Bearer <API_TOKEN>"

Then, I post it to the new instance.


9. How do you handle missing data points in a time-series graph?

If gaps appear, I check the query interval—too small an interval may return no data. I also use
"Null value handling" in the panel settings, setting it to "Connected" or "Last Value" to
interpolate missing points. If it's a Prometheus query, I use step= in the query options to adjust
granularity.

10. How would you integrate Grafana with an authentication system like LDAP or
OAuth?

I edit grafana.ini to enable authentication. For LDAP, I configure ldap.toml:

[[servers]]
host = "ldap.example.com"
port = 389
bind_dn = "cn=admin,dc=example,dc=com"
bind_password = "secret"

For OAuth (Google, GitHub, etc.), I enable it in grafana.ini and provide the client ID/secret.

11. A service is reporting incorrect metrics in Grafana. How do you debug this?

First, I run the same query in the data source directly (e.g., PromQL in Prometheus). If the data
looks wrong, I check the exporter’s logs (e.g., node_exporter or custom exporters). I also verify
the scrape configuration in Prometheus and check for missing labels or misconfigurations.

12. How would you integrate logs into a Grafana dashboard?

I use Grafana Loki as the data source. I configure Loki to ingest logs and use LogQL to query
them. Then, I add a Logs Panel in Grafana with a query like:

{job="nginx"} |= "error"

This filters logs for errors.


13. How do you monitor network traffic with Grafana?

I integrate Grafana with NetFlow, SNMP, or Prometheus Node Exporter. A common PromQL
query for bandwidth usage is:

rate(node_network_receive_bytes_total[5m])

I visualize it using a Graph or Gauge panel.

14. How would you back up Grafana dashboards?

I regularly export dashboards as JSON and store them in Git. I also back up the Grafana
database, which stores dashboards:

mysqldump -u grafana -p grafana_db > grafana_backup.sql

For SQLite (default for Grafana OSS), I back up grafana.db.

15. How do you monitor Grafana itself?

I enable Grafana's internal metrics via the Prometheus data source and use the built-in
dashboard. A key query is:

rate(grafana_http_request_duration_seconds_sum[5m])

which shows API request latency. I also monitor disk usage, memory consumption, and alert
failures.

Final Thoughts

These are real-world Grafana problems I’ve encountered and solved. If I were in an interview,
I’d explain these in detail while also demonstrating solutions hands-on in Grafana itself!

Here are 15 work scenario questions with first-person answers:


1. A stakeholder complains that a Grafana dashboard is slow. How do you
troubleshoot it?

I first check the query execution time using the Query Inspector to identify slow queries. If it's
a Prometheus query, I optimize it by reducing time ranges, adding rate() instead of increase(), and
avoiding or conditions. If it's a database query, I use indexing and reduce GROUP BY. I also check
panels using multiple queries and optimize the data source settings, such as increasing the
cache refresh rate.

2. How do you create a dynamic dashboard for multiple environments?

I use templating with variables. I define a query variable like $environment that fetches values
dynamically (e.g., label_values(instance, environment)). Then, I use $environment in all queries so users
can switch environments dynamically. I also use chained variables so that selecting an
environment updates related variables, like pod names.

3. How do you set up alerting for high CPU usage in a Kubernetes cluster?

I create an alert rule in Grafana using a Prometheus query like:

avg_over_time(node_cpu_seconds_total{mode="idle"}[5m]) < 20

I set thresholds (warning at 40%, critical at 80%) and configure a notification channel (Slack,
PagerDuty). I also enable for duration, so alerts trigger only if the issue persists for at least 5
minutes, preventing false alarms.

4. How do you integrate logs into a Grafana dashboard?

I use Grafana Loki as a log source. I create a LogQL query like:

{job="nginx"} |= "error"

I then add a logs panel and link it to metrics dashboards using dashboard links. I also enable
logs-to-trace correlation, allowing users to jump from logs to related traces in Tempo.

5. How do you visualize failed API requests over time?


I create a time-series panel and use a PromQL query like:

sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)

To highlight anomalies, I enable thresholds (e.g., turn the graph red when failures exceed a
limit).

6. How do you handle a request to combine data from different sources?

I use Transformations. For example, if one panel shows MySQL order counts and another
shows Prometheus API requests, I apply Merge tables and Add field from calculation to create
a unified metric.

7. How do you create an annotation for deployments?

I set up an annotation query in MySQL like:

SELECT timestamp AS time, 'Deployment' AS text FROM deployments WHERE $__timeFilter(timestamp)

This automatically marks deployments on the dashboard, helping teams correlate performance
drops with releases.

8. How do you troubleshoot missing logs in Loki?

I check the Loki ingestion pipeline to ensure logs are being received. I run:

{job="nginx"}

If nothing appears, I check Promtail logs for ingestion errors. If logs are ingested but not shown,
I verify log labels in the query and look for rate-limited queries.

9. How do you set up a dashboard for a microservices architecture?

I create panels for key services with metrics like request rate, error rate, and latency. I enable
service discovery using variables so users can select a service dynamically. I also integrate
Tempo for tracing to visualize service dependencies.
10. How do you identify why a request is slow?

I start with metrics in Grafana (e.g., latency trends). If an API has high response times, I check
logs in Loki for errors. If the issue persists, I use tracing in Tempo to identify slow spans and
bottlenecks (e.g., slow database queries or external calls).

11. How do you filter logs dynamically in a dashboard?

I create a templating variable like $logLevel, fetching values dynamically using:

label_values({job="nginx"}, level)

Then, I modify my LogQL query:

{job="nginx", level="$logLevel"}

This allows users to switch log levels dynamically.

12. How do you set up a Grafana alert that prevents false positives?

I use for duration to trigger alerts only if an issue persists. For example, instead of alerting
immediately when CPU > 80%, I require it to be over 80% for at least 5 minutes. I also use
multiple conditions (e.g., CPU and memory must be high) to reduce noise.

13. How do you debug a failing alert?

I check the alert rule evaluation in the Grafana UI to see if it's returning the expected values. If
it's a PromQL alert, I run the query manually in Explore mode to verify correctness. If it's a
notification issue, I test the notification channel and check logs for errors.

14. How do you create a comparative analysis dashboard?


I use multi-dimensional queries (e.g., compare API latency across services using by(service)). I
also add transformations like Group by and Merge tables to compare data from different
sources. Additionally, I use time shift to compare today’s performance against historical trends.

15. How do you handle a dashboard request for non-technical users?

I design user-friendly dashboards by:

 Using simplified queries to reduce complexity.


 Adding annotations to explain key metrics.
 Using visualizations like bar charts instead of raw tables.
 Providing predefined filters so users can easily switch between services and
environments.

Would you like more specific interview questions tailored to a particular data source, such as
Prometheus, Loki, or MySQL?

15 Work Scenario Questions & Answers for Prometheus & Grafana in Kubernetes

1. How would you set up Prometheus and Grafana in a new Kubernetes cluster?

I would use Helm to deploy the kube-prometheus-stack, which includes Prometheus, Grafana, and
Alertmanager. First, I’d add the Helm repo and update it:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts


helm repo update

Then, I’d install the stack in a dedicated monitoring namespace:

helm install prometheus-stack prometheus-community/kube-prometheus-stack --namespace monitoring --create-


namespace

After verifying the pods are running, I’d port-forward Grafana and access it via
http://localhost:3000/, then configure Prometheus as a data source.
2. What would you do if Prometheus stopped collecting metrics from a Kubernetes
service?

First, I’d check if the service endpoint is reachable by running:

kubectl get endpoints -n monitoring

Then, I’d verify if Prometheus is correctly discovering the service:

kubectl get servicemonitors -n monitoring

If the service is missing, I’d inspect Prometheus’s configuration at http://localhost:9090/config. If


there’s a misconfiguration, I’d update the ServiceMonitor and reload Prometheus.

3. How do you troubleshoot missing data in Grafana?

I’d first confirm that Prometheus is collecting the data by querying it directly in the
Prometheus UI. If the metrics are missing there, I’d check:

 If the exporter (e.g., Node Exporter, cAdvisor) is running.


 If the target is up in http://localhost:9090/targets.
If the data is available in Prometheus but missing in Grafana, I’d check:
 If the correct Prometheus data source is selected.
 If query expressions in the panel are correct.
 If the time range includes relevant data.

4. A pod is using too much memory, how would you alert on that in Prometheus?

I’d create an alerting rule in Prometheus like this:

groups:
- name: high-memory-usage
rules:
- alert: HighMemoryUsage
expr: container_memory_usage_bytes{container="my-app"} > 500000000
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage detected"

Then, I’d configure Alertmanager to notify Slack or PagerDuty.


5. How do you expose Prometheus and Grafana externally?

I’d use an Ingress or a LoadBalancer service.


For Ingress, I’d define a rule like this:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: prometheus-ingress
namespace: monitoring
spec:
rules:
- host: prometheus.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-stack-prometheus
port:
number: 9090

Then, I’d set up DNS for prometheus.example.com.

6. How do you monitor Kubernetes nodes with Prometheus?

I’d deploy the Node Exporter as a DaemonSet to collect node-level metrics:

kubectl apply -f https://raw.githubusercontent.com/prometheus/node_exporter/main/kubernetes/node-


exporter-daemonset.yaml

Then, I’d ensure Prometheus scrapes node-exporter by adding it as a target.

7. How do you monitor Kubernetes pods and containers?

I’d use cAdvisor, which is built into Kubelet, to expose container metrics. Prometheus scrapes it
via:

- job_name: 'kubernetes-cadvisor'
static_configs:
- targets: ['kubelet:10250']
For application-specific monitoring, I’d instrument the app to expose /metrics and configure a
ServiceMonitor in Prometheus.

8. What do you do if Prometheus is consuming too much memory?

I’d check time-series cardinality by running:

promtool tsdb analyze /prometheus

If there are too many unique labels, I’d optimize metric collection by:

 Reducing label combinations.


 Increasing scrape intervals.
 Using recording rules to pre-aggregate metrics.

9. How do you scale Prometheus for large Kubernetes clusters?

I’d use Thanos or Cortex for scalable, distributed storage. If using Thanos, I’d deploy:

 Thanos Sidecar to connect Prometheus to object storage.


 Thanos Query for a federated view.
 Thanos Store to store long-term metrics in S3 or GCS.

10. How do you create a custom Grafana dashboard for Kubernetes metrics?

I’d log into Grafana, go to Dashboards → New Dashboard → Add a new panel.
Then, I’d:

 Select Prometheus as the data source.


 Use a query like rate(http_requests_total[5m]) to visualize request rates.
 Set thresholds and customize the panel.

11. How do you automate Grafana dashboard deployment?

I’d use a ConfigMap to store the JSON definition:


apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
data:
kubernetes-dashboard.json: |
{ "title": "Kubernetes Monitoring", "panels": [...] }

Then, I’d mount it in the Grafana container.

12. How do you monitor custom application metrics in Kubernetes?

I’d expose a /metrics HTTP endpoint in my app using a Prometheus client library.
For example, in Python using prometheus_client:

from prometheus_client import start_http_server, Counter

requests_total = Counter('http_requests_total', 'Total HTTP requests')


start_http_server(8000)

def handle_request():
requests_total.inc()

Then, I’d configure a ServiceMonitor for Prometheus.

13. What do you do if an alert is too noisy?

I’d check if the threshold is too sensitive and adjust it.


For example, instead of triggering immediately, I’d use:

expr: http_requests_total > 1000


for: 10m

I’d also use Alertmanager’s inhibition rules to suppress redundant alerts.

14. How do you integrate Prometheus alerts with Slack?

I’d configure Alertmanager with a Slack webhook:

receivers:
- name: 'slack'
slack_configs:
- channel: '#alerts'
api_url: 'https://hooks.slack.com/services/...'

Then, I’d create a route to send critical alerts to Slack.

15. How do you ensure Prometheus and Grafana survive pod restarts?

For Prometheus, I’d use a Persistent Volume:

volumeMounts:
- name: prometheus-storage
mountPath: /prometheus
volumes:
- name: prometheus-storage
persistentVolumeClaim:
claimName: prometheus-pvc

For Grafana, I’d store dashboards in a ConfigMap and use a PVC for storage.

These scenarios should prepare you for any real-world Kubernetes monitoring challenges! Let
me know if you need deeper explanations on any.

Great! Here are 15 real-world work scenario questions, along with 1st-person answers to help
you sound confident in an interview.

1. You notice that your application’s latency has increased significantly. How do
you troubleshoot it?

Answer:
First, I check my Prometheus metrics for response time trends and RED metrics (Requests,
Errors, Duration). If latency is high, I look at CPU, memory, and request queue saturation using
system metrics. Next, I correlate logs using trace IDs to pinpoint slow database queries or
external dependencies. If needed, I enable distributed tracing (OpenTelemetry/Jaeger) to
follow a request’s path across services. Based on findings, I optimize queries, scale resources, or
implement caching where necessary.
2. A critical alert fires, indicating high error rates. What steps do you take?

Answer:
I immediately check Prometheus alerts to see which services are affected. Then, I query logs in
Loki/Splunk using the trace_id to find common error patterns. If errors are due to bad
deployments, I check the last few commits in CI/CD and roll back if needed. If it’s an
infrastructure issue, I validate network/DNS failures or database connectivity. Throughout, I
update stakeholders with my findings and resolution timeline.

3. Your manager asks you to define an SLO for an API. How do you approach this?

Answer:
I first determine the critical SLIs, such as availability, latency, and error rate. Then, I define a
realistic SLO target, for example, 99.9% availability over 30 days with an error budget of 0.1%.
I create PromQL queries to measure compliance and set alerts if we exceed our error budget.
Finally, I present the SLO to stakeholders and adjust based on business needs.

4. You need to instrument a new service with Prometheus metrics. What do you
do?

Answer:
I use the appropriate Prometheus client library (e.g., prometheus-client for Python). I define key
metrics like http_requests_total, request_latency_seconds, and expose them via /metrics. I ensure I use
low-cardinality labels to avoid excessive memory usage. Once deployed, I add Prometheus
scrape configurations and create Grafana dashboards to visualize performance trends.

5. A developer complains that logs are not helpful. How do you improve logging?

Answer:
I first assess the log format—structured JSON logs with key fields like timestamp, service_name,
trace_id, and error_message are ideal. I ensure trace IDs are propagated across logs for end-to-end
tracing. If logs are too verbose or missing context, I implement log levels (INFO, DEBUG,
ERROR) and set up log correlation using Loki or Elasticsearch.
6. A security team flags that observability data might contain sensitive information.
How do you address this?

Answer:
I review log masking policies to ensure PII, passwords, and API keys are never logged. I enable
TLS encryption for telemetry data and apply RBAC (Role-Based Access Control) to restrict
access to logs and metrics. I also configure log retention policies to avoid storing data longer
than necessary.

7. A production service is slow, but no alerts have fired. How do you catch this
earlier?

Answer:
I review our SLIs and alert thresholds to ensure they reflect real-world performance. If users
notice slowness before alerts trigger, I lower the latency alert threshold. I also enable synthetic
monitoring (Blackbox Exporter) to proactively detect slowness before users report it.

8. Your company wants to migrate from a traditional monitoring tool to


OpenTelemetry. How do you handle it?

Answer:
I start by instrumenting our services using OpenTelemetry SDKs while keeping our existing
monitoring intact for comparison. I set up an OpenTelemetry Collector to process and forward
telemetry to Prometheus, Jaeger, and Loki. I ensure traces, logs, and metrics are correlated,
then validate performance before fully migrating.

9. A high-cardinality metric is causing Prometheus to crash. How do you fix it?

Answer:
I check which labels have excessive cardinality by running:

count by (label_name) (http_requests_total)

If a label like user_id is too dynamic, I replace it with a low-cardinality alternative, like
customer_tier or region. If needed, I aggregate metrics at the service level before pushing them to
Prometheus.
10. A new team is unsure how to use observability tools effectively. How do you
guide them?

Answer:
I host a hands-on session covering Prometheus querying (PromQL), Grafana dashboards,
distributed tracing (Jaeger), and log searching (Loki/Elasticsearch). I provide example SLO
dashboards and teach them how to correlate logs, traces, and metrics for troubleshooting.

11. You suspect that your database is slowing down API responses. How do you
confirm this?

Answer:
I use Prometheus metrics to check for high database query latency. I correlate slow API
requests in traces (Jaeger) to see if a specific SQL query is taking too long. If needed, I enable
slow query logging in the database and check for indexing issues or connection pool
saturation.

12. A recent deployment caused an outage. How do you prevent this in the future?

Answer:
I implement progressive deployments (canary or blue-green) with automated rollback if error
rates exceed thresholds. I add pre-release synthetic monitoring tests and ensure our SLOs
include error budget tracking to catch issues before they impact users.

13. A service is reporting 100% success rates, but users still experience failures.
What could be wrong?

Answer:
I check if our metrics only capture HTTP 500 errors but ignore timeouts and client-side
failures. I analyze distributed traces to detect requests that never complete. If necessary, I
modify the SLOs to include network failures, retries, and timeouts in error calculations.

14. Your team is over-alerted, leading to fatigue. How do you improve alerting?
Answer:
I conduct an alert review session to remove noisy or redundant alerts. I implement multi-level
alerts, where warnings notify teams but only critical issues trigger a page. I also introduce auto-
remediation scripts for common issues (e.g., auto-scaling when CPU usage spikes).

15. A customer reports an issue that occurred 3 days ago. How do you investigate?

Answer:
I use log retention policies to retrieve historical logs from Loki/Elasticsearch. I filter logs by
trace_id, user_id, or request path to reconstruct what happened. I then correlate logs with
Prometheus metrics from that time period to see if any resource spikes or failures occurred. If
necessary, I check Jaeger traces to find where the request failed.

Final Thoughts

Mastering these scenarios means you're well-prepared for real-world observability challenges.
If you want more practice, I can create more complex cases or quiz you on any of these!

Here are 15 work scenario questions with first-person responses, tailored for monitoring,
observability, and troubleshooting with Prometheus, Grafana, and related tools.

1. How would you set up a full observability stack for a new application?

"First, I’d define key metrics, logs, and traces needed for observability. Then, I’d set up
Prometheus for metrics collection, Loki for logs, and Jaeger for tracing. I’d instrument the
application using OpenTelemetry and Prometheus client libraries. Finally, I’d use Grafana for
visualization and set up Alertmanager for proactive issue detection."

2. A service is experiencing high latency. How do you troubleshoot it?

"I’d first check Grafana dashboards for latency trends and identify affected endpoints. Using
Prometheus, I’d query response time metrics (http_request_duration_seconds). If the latency
originates from a database, I’d check database_query_duration_seconds. I’d then use Jaeger tracing
to see slow spans and logs in Loki to find error messages. Based on findings, I’d optimize queries,
add caching, or scale resources."
3. Your Prometheus server is consuming too much storage. How do you fix it?

"I’d check Prometheus retention settings (storage.tsdb.retention.time) and ensure it’s not storing
unnecessary historical data. I’d analyze scrape intervals and reduce them if they’re too frequent.
If needed, I’d offload old metrics to a long-term storage solution like Thanos or Cortex."

4. How do you handle alert fatigue from too many Prometheus alerts?

"I’d review existing alerts and classify them by severity (critical vs. warning). I’d use
Alertmanager inhibition rules to suppress lower-priority alerts when a higher-priority alert is
firing. Additionally, I’d tune thresholds and use rate-based alerts instead of static ones to reduce
noise."

5. A specific metric is missing in Prometheus. How do you debug it?

"I’d first check if the metric is exposed by the application using curl on the /metrics endpoint. If it's
missing, I’d review the application’s instrumentation code. If the metric is present but not in
Prometheus, I’d check scrape_configs in prometheus.yml to ensure the target is being scraped. Logs
from Prometheus would help diagnose failures."

6. How do you monitor a service that does not expose metrics natively?

"I’d create a custom Prometheus exporter in Python or Go to scrape data from logs, APIs, or
databases. The exporter would expose the data in the Prometheus format, and I’d configure
Prometheus to scrape it. If necessary, I’d use Pushgateway for short-lived jobs."

7. A microservice keeps restarting. How do you investigate?

"I’d check Kubernetes logs using kubectl logs <pod>. If the restart is due to an OOMKill, I’d inspect
memory usage via Prometheus (container_memory_usage_bytes). If it’s crashing due to high CPU, I’d
check container_cpu_usage_seconds_total. If the service is failing health checks, I’d review its
readiness/liveness probes."
8. Your Grafana dashboard is slow. How do you optimize it?

"I’d first check query execution times in Grafana’s Query Inspector. If queries are slow, I’d
optimize them by reducing the time range or aggregating data (e.g., rate() instead of count()). I’d
also enable Prometheus recording rules to precompute frequently used queries. Finally, I’d
ensure Prometheus isn't overwhelmed with high cardinality metrics."

9. How do you handle a sudden traffic spike affecting system performance?

"I’d first check Prometheus metrics for CPU, memory, and request rate spikes. I’d scale the
affected service using Kubernetes Horizontal Pod Autoscaler (HPA) based on CPU or request
rate. If the database is the bottleneck, I’d optimize queries or introduce caching (e.g., Redis). I’d
also set up rate limits to prevent overload."

10. How do you correlate logs, metrics, and traces to investigate an issue?

"I’d use Grafana Loki for logs, Prometheus for metrics, and Jaeger for tracing. If an alert fires
for high latency, I’d check Jaeger traces to see which services contribute to the delay. I’d then
look at Loki logs for errors and correlate them with Prometheus metrics to find root causes."

11. A Prometheus scrape job is failing. How do you debug it?

"I’d check Prometheus target status at /targets to see if the endpoint is reachable. If it's down, I’d
confirm the service is running and exposing metrics. If the target is up but has errors, I’d check
for TLS/SSL issues, authentication problems, or incorrect scrape configurations in
prometheus.yml."

12. Your alerts are firing too late. How do you make them more responsive?

"I’d reduce Prometheus evaluation_interval and adjust alert thresholds to trigger sooner. If the
data is delayed due to scrape intervals, I’d make them more frequent for critical metrics. I’d also
use Grafana real-time alerts for near-instant notifications."
13. How do you measure and improve database query performance using
Prometheus?

"I’d monitor query execution times using pg_stat_statements (for PostgreSQL) or slow query logs
(for MySQL). In Prometheus, I’d collect database_query_duration_seconds and visualize it in Grafana.
If queries are slow, I’d add indexes, optimize joins, or use caching. I’d also check if connection
pooling is needed."

14. Your company wants to implement a company-wide observability strategy. How


do you approach it?

"I’d start by defining SLIs (Service Level Indicators) and SLOs (Service Level Objectives) for key
services. I’d standardize observability tools across teams (e.g., Prometheus for metrics, Loki for
logs, Jaeger for tracing). I’d create dashboards, alerting policies, and a runbook. Finally, I’d set
up a centralized observability platform for a unified view."

15. You need to set up monitoring for a multi-cloud environment. How do you do it?

"I’d deploy a federated Prometheus setup, where regional Prometheus instances collect data
and forward it to a central Prometheus or Thanos for long-term storage. I’d ensure cloud-
specific exporters (AWS CloudWatch Exporter, GCP Stackdriver Exporter) are in place. Grafana
would be used as a single-pane-of-glass dashboard."

Final Thoughts

These responses demonstrate deep knowledge while being clear, structured, and actionable.
If you're preparing for an interview, practice explaining your reasoning behind each step. Want
me to refine anything further?

You might also like