Prometheus Ebook v2
Prometheus Ebook v2
Monitoring with
Prometheus: with Real
Examples
Credits
Chapter 1 - Shevaun Frazier, Daryna Galata
Chapter 2 - Shevaun Frazier
Chapter 3 - Aymen El Amri
Chapter 4 - Rodrigue Chadoke
Chapter 5 - Ryan Tendonge
Chapter 6 - Madhur Ahjua
Chapter 7 - Aymen El Amri
Chapter 8 - Daryna Galata
Chapter 9 - Giedrius Statkevičius
Chapter 10 - Giedrius Statkevičius
Chapter 11 - Cian Synnott
Chapter 12 - Cian Synnott
Chapter 13 - Vaibhav Thakur
Chapter 14 - Parker Janke
Chapter 15 - Vaibhav Thakur
2
Table of Contents
Introduction
1.1 What is Prometheus and how does it work? 10
1.2 Benefits of using Prometheus 15
1.3 Challenges of using Prometheus 16
1.4 What is MetricFire? 17
3
First contact with Prometheus exporter
4.1 Overview 46
4.2 Quick overview on Prometheus concepts 46
4.2.1 Pull approach of data collection 46
4.2.2 Prometheus exporters 47
4.2.3 Flexible visualization 47
4.3 Implementing a Prometheus exporter 48
4.3.1 Application built-in exporter 49
4.3.2 Standalone/third-party exporter 49
4.4 Examples of exporter implementation using Python 50
4.4.1 Standalone/third-party exposing CPU memory 51
usage
4.4.2 Exporter for a Flask application 53
4.5 Conclusion 56
Visualizations: an overview
5.1 Introduction 57
5.2 Prometheus Expression Browser 57
5.3 Prometheus console templates 59
5.4 Grafana 60
4
Getting started with PromQL
8.1 Introduction 77
8.2 Prometheus data architecture 78
8.3 Installation and configuration aspects affecting 80
PromQL queries
8.4 Prometheus querying with PromQL: 10 examples 85
8.5 Conclusion 96
5
Prometheus remote storage
11.1 Introduction 115
11.2 Remote read 115
11.3 Configuration 116
11.4 Remote write 117
11.5 Configuration 118
11.6 Log messages 119
11.7 Metrics exporter from the remote read storage 120
subsytem
6
13.4 Thanos implementation 131
13.5 Deployment 133
13.6 Grafana dashboards 185
13.7 Conclusion 187
7
Foreword
Foreword
8
Introduction
Introduction
Starting out, it can be tricky to know where to begin with the official
Prometheus docs and the wave of recent Prometheus content. This
book acts as a high level overview of the significant information
relating to Prometheus, as well as a solid collection of tips and
tricks for using Prometheus. We also take a look at where MetricFire
fits into the world of monitoring solutions you could choose from.
While going through this book, we recommend using the MetricFire
14-day free trial of MetricFire's Hosted Prometheus, so you can send
metrics and learn to use Prometheus with no delay for the set up.
9
Introduction
• storage
• aggregations
• visualization
• alerts
12
Introduction
13
Introduction
queried, it will request data via the remote_read endpoint and add
it to the local data. This can produce graphs that display a much
longer timeframe of metrics. MetricFire provides these remote
storage endpoints for your Prometheus installations. See chapter
11 for more information about remote storage.
To make the most of this ebook, get on to the MetricFire free trial
and try it out! You can use our hosted Prometheus service within
minutes of signing up, and you can try Grafana, PromQL and our
add-ons directly in the platform.
18
Deploying Prometheus to a Minikube Cluster with One Node
Deploying Prometheus
to a Minikube Cluster
with One Node
2.1 Introduction
We’ll go over what the YAML files contain and what they do as
we go, though we won’t go too deep into how Kubernetes works.
It should give you a good start for tackling the rest of this book.
Each of these YAML files instructs Kubectl to submit a request to
the Kubernetes API server, and creates resources based on those
instructions.
19
Deploying Prometheus to a Minikube Cluster with One Node
21
Deploying Prometheus to a Minikube Cluster with One Node
22
Deploying Prometheus to a Minikube Cluster with One Node
23
Deploying Prometheus to a Minikube Cluster with One Node
24
Deploying Prometheus to a Minikube Cluster with One Node
We're creating all three of these in one file, and you could bundle
them in with the deployment as well if you like. We’ll keep them
separate for clarity.
25
Deploying Prometheus to a Minikube Cluster with One Node
2.5 Deployment
26
Deploying Prometheus to a Minikube Cluster with One Node
Selector details how the ReplicaSet will know which pods it’s
controlling. This is a common way for one resource to target
another.
27
Deploying Prometheus to a Minikube Cluster with One Node
• Image is the docker image which will be used, in this case the
prometheus image hosted on quay.io.
• Command is the command to run in the container when it’s
launched.
• Args are the arguments to pass to that command, including
the location of the configuration file which we’ll set up below.
28
Deploying Prometheus to a Minikube Cluster with One Node
29
Deploying Prometheus to a Minikube Cluster with One Node
2.6 NodePort
30
Deploying Prometheus to a Minikube Cluster with One Node
31
Deploying Prometheus to a Minikube Cluster with One Node
The metrics available are all coming from Prometheus itself via
that one scrape job in the configuration. We can bring up all the
metrics for that job by searching for the label “job” with the value
“prometheus” {job=”prometheus”}.
A file for creating a DaemonSet looks a lot like the file for a normal
deployment. There’s no number of replicas however, since that’s
fixed by the DaemonSet, but there is a PodTemplate as before,
including metadata with annotations, and the spec for the
container.
The volumes for Node Exporter are quite different though. There’s
no configmap volume, but instead we can see system directories
32
Deploying Prometheus to a Minikube Cluster with One Node
from the node are mapped as volumes into the container. That’s
how Node Exporter accesses metric values. Node Exporter has
permission to access those values because of the securityContext
setting, “privileged: true”
We’ll apply that now, and then look to see the DaemonSet running:
In the nodes job you can see we’ve added details for a secure
connection using credentials provided by Kubernetes. There are
also a number of relabelling rules. These act on the labelset for
the job, which consists of standard labels created by Prometheus,
and metadata labels provided by service discovery. These rules
can create new labels or change the settings of the job itself before
it runs.
33
Deploying Prometheus to a Minikube Cluster with One Node
34
Deploying Prometheus to a Minikube Cluster with One Node
The quickest way to load the new config is to scale the number of
replicas down to 0 and then back up to one, causing a new pod
to be created. This will lose the existing data, but of course, it’s all
been sent to MetricFire.
If we refresh the configuration page we can now see the new jobs.
If we check the targets page, the targets and metadata are visible
as well. Metrics can be found under the kubernetes-pods job, with
the node prefix.
35
Deploying Prometheus to a Minikube Cluster with One Node
Once you’re comfortable with this setup, you can add other
services like cAdvisor for monitoring your containers, and jobs to
get metrics about other parts of Kubernetes. For each new service,
simply configure a new scrape job, update the configMap, and
reload the configuration in Prometheus. Easy!
All good tutorials should end by telling you how to clean up your
environment. In this case, it’s really easy: removing the namespace
will remove everything inside of it! So we'll just run
36
Deploying Prometheus to Kubernetes on GKE with Helm
Deploying Prometheus
to Kubernetes on GKE
with Helm
3.1 Overview
You will need to run a Kubernetes cluster first. You can use a
Minikube cluster like in Chapter 2, or deploy a cloud-managed
solution like GKE. In this chapter we’ll use GKE.
37
Deploying Prometheus to Kubernetes on GKE with Helm
mv linux-amd64/helm /usr/local/bin/helm
MacOS users can use brew install helm, Windows users can use
Chocolatey choco install kubernetes-helm. Linux users (and MacOS
users as well), can use the following script:
Now that our pods are running, we have the option to use the
Prometheus dashboard right from our local machine. This is done
by using the following command:
39
Deploying Prometheus to Kubernetes on GKE with Helm
40
Deploying Prometheus to Kubernetes on GKE with Helm
Note that you should use "admin" as the login and "prom-
operator" as the password. Both can be found in a Kubernetes
Secret object:
41
Deploying Prometheus to Kubernetes on GKE with Helm
Using the "base64 --decode", you will be able to see the clear
credentials.
42
Deploying Prometheus to Kubernetes on GKE with Helm
If you are curious, you can find more details about these
dashboards here. For example, if you want to see how the
"Kubernetes / Compute Resources / Namespace (Pods)" dashboard
works, you should view this ConfigMap. For more on visualizations
with Grafana, see chapter 5.
43
Deploying Prometheus to Kubernetes on GKE with Helm
44
Deploying Prometheus to Kubernetes on GKE with Helm
• prometheus-operator
• prometheus
• alertmanager
• node-exporter
• kube-state-metrics
• grafana
• service monitors to scrape internal kubernetes components
• kube-apiserver
• kube-scheduler
• kube-controller-manager
• etcd
• kube-dns/coredns
• kube-proxy
45
First Contact with Prometheus Exporters
4.1 Introduction
46
First Contact with Prometheus Exporters
47
First Contact with Prometheus Exporters
(i) retrieve the current CPU usage along with the label of each
individual pod
(ii) sum up the usage based on pod labels
(iii) make the results available for scraping
4.4 Examples of exporter implementation using
Python
50
First Contact with Prometheus Exporters
For this program, we’ll need to install the Prometheus client library
for Python.
Our final exporter code looks like this: (see source gist)
51
First Contact with Prometheus Exporters
$ curl -o prometheus_exporter_cpu_memory_
usage.py \
-s -L https://git.io/Jesvq
$ python ./prometheus_exporter_cpu_memory_
usage.py
52
First Contact with Prometheus Exporters
54
First Contact with Prometheus Exporters
$ curl -o prometheus_exporter_flask.py \
-s -L https://git.io/Jesvh
Note that the --wsgi-file shall point to the Python program file
while the value of the --callable option must match the name of
the WSGI application declared in our program (line 23).
55
First Contact with Prometheus Exporters
4.5. Conclusion
56
Visualizations: An Overview
Visualizations: An
Overview
5.1 Introduction
When you hit the Execute button, data relating to this particular
metric, provided it exists, is provided both in a table (on the
Console Tab) and a graph (on the Graph Tab), allowing you to
switch between these two with just a single click. For example,
say we want to pull up data on the prometheus_target_
interval_length_seconds metric. This metric measures
the amount of time between target scrapes. In other words, the
amount of time between data collection from a prometheus
target. Entering this metric into the Expression Browser yields the
following results:
58
Visualizations: An Overview
You can visit the /metrics endpoint to get a list of all the time series
data metrics being monitored by Prometheus. You could have
multiple graphs open on the Expression Browser at a time, but it’s
best practice to keep it at a bare minimum, only monitoring data
that is essential to accomplish your goal. Flooding the page with
irrelevant graphs/lines on a graph can cause you to lose focus on
what is important and thus, miss important signals.
5.4 Grafana
60
Connecting Prometheus and Grafana
Connecting Prometheus
and Grafana
6.1 Introduction
version: '3.2'
services:
prometheus:
image: prom/prometheus
61
Connecting Prometheus and Grafana
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana
ports:
- "3000:3000"
cadvisor:
image: google/cadvisor:latest
container_name: cadvisor
ports:
- 8080:8080
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
depends_on:
- redis
redis:
image: redis:latest
container_name: redis
ports:
- 6379:6379
62
Connecting Prometheus and Grafana
# my global config
global:
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
scrape:
63
Connecting Prometheus and Grafana
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
64
Connecting Prometheus and Grafana
global:
65
Connecting Prometheus and Grafana
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
scrape:
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
66
Connecting Prometheus and Grafana
labels:
alias: 'cadvisor'
67
Connecting Prometheus and Grafana
68
Connecting Prometheus and Grafana
6.4 Conclusion
69
Important Metrics to Watch in Production
Important Metrics to
Watch in Production
70
Important Metrics to Watch in Production
71
Important Metrics to Watch in Production
72
Important Metrics to Watch in Production
After setting up the name of the visualization, you can set alerts.
Let's say that the average of the metric we chose should not
exceed "0.06":
73
Important Metrics to Watch in Production
Let's try a third important metric to watch, like the total number
of restarts of our container. You can access this information
using kube_pod_container_status_restarts_total or kube_pod_
container_status_restarts_total{namespace="<namespace>"} for a
specific namespace.
74
Important Metrics to Watch in Production
The list is long but the important metrics can vary from one
context to another, you may consider that Etcd metrics are not
important for your production environment while it can be for
someone else.
76
Getting Started with PromQL
8.1 Introduction
This article will go through the the data storage architecture, and
then outline 10 examples of how to use PromQL. The examples will
be routed in the theory laid out in parts 8.2 and 8.3.
77
Getting Started with PromQL
78
Getting Started with PromQL
79
Getting Started with PromQL
80
Getting Started with PromQL
81
Getting Started with PromQL
The first method is the simplest - find the package, and start the
installation. However, we should remember that Linux repositories
often don’t contain the latest software versions. The second way
is the most complicated from the user's side, but it allows us to
customize all the components of the monitoring system. The
method using Docker containers is convenient at the stage of
deployment on a remote server or some cloud platform.
82
Getting Started with PromQL
83
Getting Started with PromQL
84
Getting Started with PromQL
85
Getting Started with PromQL
1. Let’s show the value of the all-time time series with the counter
metric prometheus_http_requests_total. This will show us all of
the data denoted by this metric name, or “key”.
87
Getting Started with PromQL
2. Now, let’s take a look at the same metric as in example 1, but add
some labels. Let’s add the values for the labels “job” and “code”.
3. Let’s take a look at the metric from example 1 again, but let’s
add the time interval component. In the example below, we add
the interval component by writing [15m] at the end of the query.
The result will be displayed on the console tab in the web interface,
and not on the graph tab. You will see the result as a list of values.
88
Getting Started with PromQL
89
Getting Started with PromQL
6. The PromQL query function also allows you to create filters that
combine different metrics using their names. Let's demonstrate it
for histogram and counter metrics with similar names.
91
Getting Started with PromQL
92
Getting Started with PromQL
• greater (>)
• greater-or-equal (>=)
• less (<)
• less-or-equal (<=)
• equal (==)
• not equal (!=)
• and (intersection)
93
Getting Started with PromQL
• or (union)
• unless (complement)
Arithmetic operations:
• addition (+)
• subtraction (-)
• division (/)
• multiplication (*)
• power (^)
• modulo (%)
94
Getting Started with PromQL
95
Getting Started with PromQL
8.5 Conclusion
Top 5 Alertmanager
Gotchas
9.1 Introduction
98
Top 5 Alertmanager Gotchas
One of the first things that you will run into while defining alerts
are these two things called annotations and labels. Here is what a
simple alerting rule looks like:
As you can see, annotations and labels are seemingly used for the
same thing: adding extra data to the alert above what is already
there.
Also, you can use the available templating system. For instance,
in this example you can see things such as {{ $value }} which gets
substituted with the value of the alert’s expression. This is not
possible with labels which are ordinary string values. You can
find more information about the different possibilities for the
99
Top 5 Alertmanager Gotchas
After writing some alerting rules, have you ever noticed that they
are continuously firing and becoming resolved time and time
again. This may be happening because your alerting rule fires the
alert to your configured receivers too quickly, without waiting to
see if it gets solved naturally. How do we solve this?
100
Top 5 Alertmanager Gotchas
First of all, in your alerting rules you ought to almost always have
some kind of time component in them that indicates how long the
alert should wait before sending the notification. This is important
because failures are inevitable due to the network connection
being imperfect so we might sometimes fail to scrape a target.
We want to make sure the alert has been active for a designated
amount of time before we send the notification, that way, if the
alert gets resolved within 1 - 2 minutes, we aren't notified at all.
Also, any service level indicators, objectives, or agreements that
you might have for your services are typically defined in terms
of an error budget that you can "spend" over some time or the
percent that your service was available over the last, lets say,
one month. In turn, monitoring systems such as Prometheus are
designed to alert on trends - when something goes really haywire
- but not on exact facts. Here is a useful generator for SLOs that
generates these things automatically for you according to Google's
book.
In this picture the "for" value is equal to 10m or, in other words, 10
minutes. It means that Prometheus will check that the alert has
been active for 10 minutes before firing the alert to your configured
receivers. Also, note that by default alerting rules are evaluated
every 1 minute and you can change that via the evaluation_interval
parameter in your Prometheus configuration.
Now let’s talk about the dangers that lie in the PromQL expressions
that you might use for alerting rules.
9.4.1 Missing metrics
First of all, you should keep in mind that the metric that you have
written in the expr field might not exist at some point in time.
In such a case, alerting rules silently start failing i.e. they do not
become "firing". To solve this problem, you should add extra alerts
which would alert you on missing metrics with the absent function.
Some even coalesce this with the original alert by using the "or"
binary operator:
checker_upload_last_succeeded{instance=”foo.
bar”} != 1 or absent(checker_upload_last_
succeeded{instance=”foo.bar”}) == 1
However, this is mostly useful in cases where you do not use any
aggregation functions, and only use a single metric, since it quickly
becomes unwieldy.
103
Top 5 Alertmanager Gotchas
The labels defined in the alerting rule are static, however the
labels on the metric are not - this is where the cardinality problem
comes from. This could potentially significantly slow down your
Prometheus instance because each new time series is equal to a
new entry in the index which leads to increased look-up times when
you are querying something. This is one of the ways to get what is
called a “cardinality explosion” in your Prometheus instance. You
should always, always validate that your alerting expression will
not touch too many different time series.
has been created. Perhaps, you will have an alerting rule like:
kube_pod_created > 1575763200 to know if any pods have been
created after 12/08/2019 @ 12:00am (UTC) which could be the
start of your Kubernetes cluster's maintenance window. Alas, your
users would continue creating thousands of new pods each day. In
this case, ALERTS and ALERTS_FOR_STATE would match all of the
pods’ information (the pod's name and its namespace, to be more
exact) that you have in your Kubernetes cluster thus leading to a
multiplication of the original time series.
Last but not least, let’s talk about how alerting rules might start
utilizing all of the resources of your Prometheus instance. You might
unwittingly write an alert which would load hundreds of thousands
samples into memory. This is where --query.max-samples jumps in.
By default, it forbids you from loading more than 50 million samples
into memory with one query. You should adjust it accordingly.
If it hits that limit then you will see errors such as this in your
Prometheus logs: "query processing will load too many samples
into memory in query execution". This is a very helpful notification!
But, typically your queries will go through and you will not notice
anything wrong until one day your Prometheus instance might
start responding slower. Fortunately Prometheus has a lot of nice
105
Top 5 Alertmanager Gotchas
metrics itself that show what might be taking longer than usual.
For instance, prometheus_rule_group_last_duration_seconds will
show, by alerting rule group, how long it took to evaluate them
the last time in seconds. You will most likely want to create or
import a Grafana dashboard which will visualize this data for you,
or you could actually write another alert which will notify you in
case it starts taking more than a specified threshold. The alerting
expression could look like this:
avg_over_time(prometheus_rule_group_last_
duration_seconds{instance=”my.prometheus.
metricfire.com”}[5m]) > 20
You will most likely want to add the calculation of the average here
since the duration of the last evaluation will naturally have some
level of jitter because the load of your Prometheus will inevitably
differ over time.
9.5 Conclusion
Understanding
Prometheus Rate()
10.1 Introduction
108
Understanding Prometheus Rate()
As you can see, instant vectors only define the value that has been
most recently scraped. Rate() and its cousins take an argument
of the range type since to calculate any kind of change, you need
at least two points of data. They do not return any result at all
if there are less than two samples available. PromQL indicates
range vectors by writing a time range in square brackets next to a
selector which says how much time into the past it should go.
One could also use the special variable in Grafana called $__
interval - it is defined to be equal to the time range divided by the
step’s size. It could seem like the perfect solution as it looks like all
of the data points between each step would be considered, but it
has the same problems as mentioned previously. It is impossible to
see both very detailed graphs and broad trends at the same time.
Also your time interval becomes tied to your query step, so if your
109
Understanding Prometheus Rate()
scrape interval ever changes then you might have problems with
very small time ranges.
10.2.3 Calculation
Just like everything else, the function gets evaluated on each step.
But, how does it work?
110
Understanding Prometheus Rate()
The nice thing about the rate() function is that it takes into account
all of the data points, not just the first one and the last one. There
is another function, irate, which uses only the first and last data
points.
Well, rate() that we have just described has this nice characteristic:
it automatically adjusts for resets. What this means is that it is
only suitable for metrics which are constantly increasing, a.k.a. the
metric type that is called a “counter”. It’s not suitable for a “gauge”.
Also, a keen reader would have noticed that using rate() is a hack
to work around the limitation that floating-point numbers are used
for metrics’ values and that they cannot go up indefinitely so they
are “rolled over” once a limit is reached. This logic prevents us
from losing old data, so using rate() is a good idea when you need
this feature.
Either way, PromQL currently will not prevent you from using
rate() with a gauge, so this is a very important thing to realize
when choosing which metric should be passed to this function. It
is incorrect to use rate() with gauges because the reset detection
logic will mistakenly catch the values going down as a “counter
reset” and you will get wrong results.
111
Understanding Prometheus Rate()
All in all, let’s say you have a counter metric which is changing like this:
0
4
6
10
2
The reset between “10” and “2” would be caught by irate() and
rate() and it would be taken as if the value after that were “12” i.e.
it has increased by “2” (from zero). Let’s say that we were trying
to calculate the rate with rate() over 60 seconds and we got these
6 samples on ideal timestamps. So the resulting average rate of
increase per second would be:
Last but not least, it’s important to understand that rate() performs
extrapolation. Knowing this will save you from headaches in the
long-term. Sometimes when rate() is executed in a point in time,
there might be some data missing if some of the scrapes had
failed. What’s more, the scrape interval due to added randomness
112
Understanding Prometheus Rate()
In such a case, rate() calculates the rate with the data that it has
and then, if there is any information missing, extrapolates the
beginning or the end of the selected window using either the first
or last two data points. This means that you might get uneven
results even if all of the data points are integers, so this function
is suited only for spotting trends, spikes, and for alerting if
something happens.
10.2.5 Aggregation
10.3 Examples
groups:
113
Understanding Prometheus Rate()
- name: Errors
rules:
- alert: ErrorsCountIncreased
expr: rate(haproxy_connection_errors_
total[5m]) by (backend) > 0.5
for: 10m
labels:
severity: page
Annotations:
Summary: High connection error count in
{{ $labels.backend }}
114
Understanding Prometheus Rate()
As you can see, they calculate the rate of change of the amount of
all of the requests that were not 5xx and then divides by the rate
of change of the total amount of requests. If there are any 5xx
responses then the resulting value would be less than one. You
can, again, use this formula in your alerting rules with some kind of
specified threshold - then you would get an alert if it is violated or
you could predict the near future with predict_linear and avoid any
SLA/SLO problems.
115
Prometheus Remote Storage
Prometheus Remote
Storage
11.1 Introduction
116
Prometheus Remote Storage
11.3 Configuration
At its simplest, you will just specify the read endpoint URL for your
remote storage, plus an authentication method. You can use either
HTTP basic or bearer token authentication.
You might want to use the read_recent flag: when set to true,
all queries will be answered from remote as well as local storage.
When false (the default), any queries that can be answered
completely from local storage will not be sent to the remote
endpoint.
117
Prometheus Remote Storage
118
Prometheus Remote Storage
11.5 Configuration
1. writeRelabelConfigs:
2. # drop all metrics of this name across all jobs
3. - sourceLabels: ["__name__"]
4. regex: some_metric_prefix_to_drop_.*
5. action: drop
Like for remote_read, you can also configure options for request
timeouts, TLS configuration, and proxy setup.
You may see some messages from the remote storage subsystem
in your logs:
120
Prometheus Remote Storage
• prometheus_remote_storage_samples_in_total:
Samples in to remote storage, compare to samples out for
queue managers (counter);
• prometheus_remote_storage_succeeded_
samples_total: Total number of samples successfully sent
to remote storage (counter);
• prometheus_remote_storage_pending_samples:
The number of samples pending in the queue's shards to be
sent to the remote storage (gauge);
• prometheus_remote_storage_shards: The number of
shards used for parallel sending to the remote storage (gauge);
• prometheus_remote_storage_sent_batch_
duration_seconds: Duration of sample batch send calls to
the remote storage (histogram).
121
Example 1: Monitoring a Python Web App with Prometheus
Example 1: Monitoring
a Python Web App with
Prometheus
12.1 Introduction
This has worked out really well for us over the years: as our
own customer, we quickly spot issues in our various ingestion,
storage and rendering services. It also drives the service status
transparency our users love.
This chapter describes how we’ve done that in one instance, with
a fully worked example of monitoring a simple Flask application
running under uWSGI + nginx. We’ll also discuss why it remains
surprisingly involved to get this right.
122
Example 1: Monitoring a Python Web App with Prometheus
123
Example 1: Monitoring a Python Web App with Prometheus
For example, each scrape of a specific counter will return the value
for one worker rather than the whole job: the value jumps all over
the place and tells you nothing useful about the application as a
whole.
12.4 A solution
full example will help anyone doing similar work in the future.
126
Example 1: Monitoring a Python Web App with Prometheus
12.5 Futures
127
Example 1: Monitoring a Python Web App with Prometheus
It’s worth noting that having one target per worker contributes
to something of a time series explosion. For example, in this case
a single default Histogram metric to track response times from
the Python client across 8 workers would produce around 140
individual time series, before multiplying by other labels we might
include. That’s not a problem for Prometheus to handle, but it
does add up (or likely, multiply) as you scale, so be careful!
12.6 Conclusion
128
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
Example 2: HA Kubernetes
Monitoring with
Prometheus and Thanos
13.1 Introduction
129
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
131
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
Simple load balancing will also not work -- say your app crashes.
The replica might be up, but querying it will result in a small time
gap for the period during which it was down. This isn’t fixed by
having a second replica because it could be down at any moment,
for example, during a rolling restart. These instances show how
load balancing can fail.
Thanos Query pulls the data from both replicas and deduplicates
those signals, filling the gaps, if any, to the Querier consumer.
Thanos Ruler basically does the same thing as the querier but for
Prometheus’ rules. The only difference is that it can communicate
with Thanos components.
132
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
133
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
13.5 Deployment
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: monitoring
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: monitoring
namespace: monitoring
rules:
134
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
- apiGroups: [""]
resources:
- configmaps
verbs: ["get"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: monitoring
subjects:
- kind: ServiceAccount
name: monitoring
namespace: monitoring
roleRef:
kind: ClusterRole
Name: monitoring
apiGroup: rbac.authorization.k8s.io
---
135
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-server-conf
labels:
name: prometheus-server-conf
namespace: monitoring
data:
prometheus.yaml.tmpl: |-
global:
scrape_interval: 5s
evaluation_interval: 5s
external_labels:
cluster: prometheus-ha
replica: $(POD_NAME)
rule_files:
- /etc/prometheus/rules/*rules.yaml
alerting:
136
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
alert_relabel_configs:
- regex: replica
action: labeldrop
alertmanagers:
- scheme: http
path_prefix: /
static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: kubernetes-nodes-cadvisor
scrape_interval: 10s
scrape_timeout: 10s
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/
serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/
serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# See: https://github.com/prometheus/prometheus/
137
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
issues/2916
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/
cadvisor
metric_relabel_configs:
- action: replace
source_labels: [id]
regex: '^/machine\.slice/machine-rkt\\
x2d([^\\]+)\\.+/([^/]+)\.service$'
target_label: rkt_container_name
replacement: '${2}-${1}'
- action: replace
source_labels: [id]
regex: '^/system\.slice/(.+)\.service$'
target_label: systemd_service_name
replacement: '${1}'
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
138
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_pod_
annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_
annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_pod_
annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
pod_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
139
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
tls_config:
ca_file: /var/run/secrets/kubernetes.io/
serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/
serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __
meta_kubernetes_service_name, __meta_kubernetes_endpoint_
port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- source_labels: [__meta_kubernetes_service_
annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_
140
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_
annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)(?::\d+);(\d+)
replacement: $1:$2
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
labels:
name: prometheus-rules
namespace: monitoring
data:
alert-rules.yaml: |-
groups:
141
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
- name: Deployment
rules:
annotations:
expr: |
sum(kube_deployment_status_replicas{pod_
for: 1m
labels:
team: devops
annotations:
limited state
expr: |
(sum(kube_hpa_status_
condition{condition="ScalingLimited",status="true"}) by
(hpa,namespace)) == 1
for: 1m
labels:
team: devops
annotations:
142
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
expr: |
((sum(kube_hpa_spec_max_replicas) by
(hpa,namespace)) - (sum(kube_hpa_status_current_replicas)
by (hpa,namespace))) == 0
for: 1m
labels:
team: devops
- name: Pods
rules:
annotations:
expr: |
sum(increase(kube_pod_container_status_restarts_
total{namespace!="kube-system",pod_template_hash=""}[1m]))
by (pod,namespace,container) > 0
for: 0m
labels:
team: dev
annotations:
143
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
expr: |
((( sum(container_memory_usage_
bytes{image!="",container_name!="POD", namespace!="kube-
system"}) by (namespace,container_name,pod_name) /
sum(container_spec_memory_limit_bytes{image!="",container_
name!="POD",namespace!="kube-system"}) by
75
for: 5m
labels:
team: dev
annotations:
expr: |
((sum(irate(container_cpu_usage_seconds_
total{image!="",container_name!="POD", namespace!="kube-
system"}[30s])) by (namespace,container_name,pod_name)
/ sum(container_spec_cpu_quota{image!="",container_
spec_cpu_period{image!="",container_name!="POD",
namespace!="kube-system"}) by (namespace,container_
for: 5m
144
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
labels:
team: dev
- name: Nodes
rules:
annotations:
expr: |
(sum (container_memory_working_set_
bytes{id="/",container_name!="POD"}) by (kubernetes_io_
for: 5m
labels:
team: devops
annotations:
Capacity.
expr: |
(sum(rate(container_cpu_usage_seconds_
for: 5m
labels:
145
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
team: devops
annotations:
expr: |
(sum(container_fs_usage_bytes{device=~"^/dev/[sv]d[a-z]
[1-9]$",id="/",container_name!="POD"}) by (kubernetes_
io_hostname) / sum(container_fs_limit_bytes{container_
name!="POD",device=~"^/dev/[sv]d[a-z][1-9]$",id="/"}) by
for: 5m
labels:
team: devops
apiVersion: storage.k8s.io/v1beta1
kind: StorageClass
metadata:
name: fast
namespace: monitoring
provisioner: kubernetes.io/gce-pd
allowVolumeExpansion: true
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
146
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
name: prometheus
namespace: monitoring
spec:
replicas: 3
serviceName: prometheus-service
template:
metadata:
labels:
app: prometheus
thanos-store-api: "true"
spec:
serviceAccountName: monitoring
containers:
- name: prometheus
image: prom/prometheus:v2.4.3
args:
- "--config.file=/etc/prometheus-shared/
prometheus.yaml"
- "--storage.tsdb.path=/prometheus/"
- "--web.enable-lifecycle"
- "--storage.tsdb.no-lockfile"
- "--storage.tsdb.min-block-duration=2h"
- "--storage.tsdb.max-block-duration=2h"
ports:
- name: prometheus
containerPort: 9090
volumeMounts:
- name: prometheus-storage
mountPath: /prometheus/
147
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
- name: prometheus-config-shared
mountPath: /etc/prometheus-shared/
- name: prometheus-rules
mountPath: /etc/prometheus/rules
- name: thanos
image: quay.io/thanos/thanos:v0.8.0
args:
- "sidecar"
- "--log.level=debug"
- "--tsdb.path=/prometheus"
- "--prometheus.url=http://127.0.0.1:9090"
{bucket: prometheus-long-term}}"
- "--reloader.config-file=/etc/prometheus/
prometheus.yaml.tmpl"
- "--reloader.config-envsubst-file=/etc/
prometheus-shared/prometheus.yaml"
- "--reloader.rule-dir=/etc/prometheus/rules/"
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name : GOOGLE_APPLICATION_CREDENTIALS
value: /etc/secret/thanos-gcs-credentials.json
ports:
- name: http-sidecar
containerPort: 10902
- name: grpc
148
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
containerPort: 10901
livenessProbe:
httpGet:
port: 10902
path: /-/healthy
readinessProbe:
httpGet:
port: 10902
path: /-/ready
volumeMounts:
- name: prometheus-storage
mountPath: /prometheus
- name: prometheus-config-shared
mountPath: /etc/prometheus-shared/
- name: prometheus-config
mountPath: /etc/prometheus
- name: prometheus-rules
mountPath: /etc/prometheus/rules
- name: thanos-gcs-credentials
mountPath: /etc/secret
readOnly: false
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
volumes:
- name: prometheus-config
configMap:
defaultMode: 420
149
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
name: prometheus-server-conf
- name: prometheus-config-shared
emptyDir: {}
- name: prometheus-rules
configMap:
name: prometheus-rules
- name: thanos-gcs-credentials
secret:
secretName: thanos-gcs-credentials
volumeClaimTemplates:
- metadata:
name: prometheus-storage
namespace: monitoring
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: fast
resources:
requests:
storage: 20Gi
apiVersion: v1
kind: Service
metadata:
name: prometheus-0-service
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
namespace: monitoring
labels:
name: prometheus
spec:
selector:
statefulset.kubernetes.io/pod-name: prometheus-0
ports:
151
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
- name: prometheus
port: 8080
targetPort: prometheus
---
apiVersion: v1
kind: Service
metadata:
name: prometheus-1-service
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
namespace: monitoring
labels:
name: prometheus
spec:
selector:
statefulset.kubernetes.io/pod-name: prometheus-1
ports:
- name: prometheus
port: 8080
targetPort: prometheus
---
apiVersion: v1
kind: Service
metadata:
name: prometheus-2-service
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
152
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
namespace: monitoring
labels:
name: prometheus
spec:
selector:
statefulset.kubernetes.io/pod-name: prometheus-2
ports:
- name: prometheus
port: 8080
targetPort: prometheus
---
about store-api's
apiVersion: v1
kind: Service
metadata:
name: thanos-store-gateway
namespace: monitoring
spec:
type: ClusterIP
clusterIP: None
ports:
- name: grpc
port: 10901
targetPort: grpc
selector:
thanos-store-api: "true"
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-querier
namespace: monitoring
labels:
app: thanos-querier
spec:
replicas: 1
154
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
selector:
matchLabels:
app: thanos-querier
template:
metadata:
labels:
app: thanos-querier
spec:
containers:
- name: thanos
image: quay.io/thanos/thanos:v0.8.0
args:
- query
- --log.level=debug
- --query.replica-label=replica
- --store=dnssrv+thanos-store-gateway:10901
ports:
- name: http
containerPort: 10902
- name: grpc
containerPort: 10901
livenessProbe:
httpGet:
port: http
path: /-/healthy
readinessProbe:
httpGet:
port: http
path: /-/ready
155
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
---
apiVersion: v1
kind: Service
metadata:
labels:
app: thanos-querier
name: thanos-querier
namespace: monitoring
spec:
ports:
- port: 9090
protocol: TCP
targetPort: http
name: http
selector:
app: thanos-querier
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
156
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
name: thanos-store-gateway
namespace: monitoring
labels:
app: thanos-store-gateway
spec:
replicas: 1
selector:
matchLabels:
app: thanos-store-gateway
serviceName: thanos-store-gateway
template:
metadata:
labels:
app: thanos-store-gateway
thanos-store-api: "true"
spec:
containers:
- name: thanos
image: quay.io/thanos/thanos:v0.8.0
args:
- "store"
- "--log.level=debug"
- "--data-dir=/data"
prometheus-long-term}}"
- "--index-cache-size=500MB"
- "--chunk-pool-size=500MB"
env:
- name : GOOGLE_APPLICATION_CREDENTIALS
157
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
value: /etc/secret/thanos-gcs-credentials.json
ports:
- name: http
containerPort: 10902
- name: grpc
containerPort: 10901
livenessProbe:
httpGet:
port: 10902
path: /-/healthy
readinessProbe:
httpGet:
port: 10902
path: /-/ready
volumeMounts:
- name: thanos-gcs-credentials
mountPath: /etc/secret
readOnly: false
volumes:
- name: thanos-gcs-credentials
secret:
secretName: thanos-gcs-credentials
---
apiVersion: v1
kind: Namespace
metadata:
158
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
name: monitoring
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: thanos-compactor
namespace: monitoring
labels:
app: thanos-compactor
spec:
replicas: 1
selector:
matchLabels:
app: thanos-compactor
serviceName: thanos-compactor
template:
metadata:
labels:
app: thanos-compactor
spec:
containers:
- name: thanos
image: quay.io/thanos/thanos:v0.8.0
args:
- "compact"
- "--log.level=debug"
- "--data-dir=/data"
{bucket: prometheus-long-term}}"
159
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
- "--wait"
env:
- name : GOOGLE_APPLICATION_CREDENTIALS
value: /etc/secret/thanos-gcs-credentials.json
ports:
- name: http
containerPort: 10902
livenessProbe:
httpGet:
port: 10902
path: /-/healthy
readinessProbe:
httpGet:
port: 10902
path: /-/ready
volumeMounts:
- name: thanos-gcs-credentials
mountPath: /etc/secret
readOnly: false
volumes:
- name: thanos-gcs-credentials
secret:
secretName: thanos-gcs-credentials
apiVersion: v1
kind: Namespace
metadata:
160
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
name: monitoring
---
apiVersion: v1
kind: ConfigMap
metadata:
name: thanos-ruler-rules
namespace: monitoring
data:
alert_down_services.rules.yaml: |
groups:
- name: metamonitoring
rules:
- alert: PrometheusReplicaDown
annotations:
discovery.
expr: |
sum(up{cluster="prometheus-ha",
instance=~".*:9090", job="kubernetes-service-endpoints"})
by (job,cluster) < 3
for: 15s
labels:
severity: critical
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
labels:
161
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
app: thanos-ruler
name: thanos-ruler
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: thanos-ruler
serviceName: thanos-ruler
template:
metadata:
labels:
app: thanos-ruler
thanos-store-api: "true"
spec:
containers:
- name: thanos
image: quay.io/thanos/thanos:v0.8.0
args:
- rule
- --log.level=debug
- --data-dir=/data
- --eval-interval=15s
- --rule-file=/etc/thanos-ruler/*.rules.yaml
- --alertmanagers.url=http://alertmanager:9093
- --query=thanos-querier:9090
{bucket: thanos-ruler}}"
- --label=ruler_cluster="prometheus-ha"
162
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
- --label=replica="$(POD_NAME)"
env:
- name : GOOGLE_APPLICATION_CREDENTIALS
value: /etc/secret/thanos-gcs-credentials.json
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
ports:
- name: http
containerPort: 10902
- name: grpc
containerPort: 10901
livenessProbe:
httpGet:
port: http
path: /-/healthy
readinessProbe:
httpGet:
port: http
path: /-/ready
volumeMounts:
- mountPath: /etc/thanos-ruler
name: config
- name: thanos-gcs-credentials
mountPath: /etc/secret
readOnly: false
volumes:
- configMap:
163
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
name: thanos-ruler-rules
name: config
- name: thanos-gcs-credentials
secret:
secretName: thanos-gcs-credentials
---
apiVersion: v1
kind: Service
metadata:
labels:
app: thanos-ruler
name: thanos-ruler
namespace: monitoring
spec:
ports:
- port: 9090
protocol: TCP
targetPort: http
name: http
selector:
app: thanos-ruler
gateway
Server: 10.63.240.10
164
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
Address: 10.63.240.10#53
Name: thanos-store-gateway.monitoring.svc.cluster.local
Address: 10.60.25.2
Name: thanos-store-gateway.monitoring.svc.cluster.local
Address: 10.60.25.4
Name: thanos-store-gateway.monitoring.svc.cluster.local
Address: 10.60.30.2
Name: thanos-store-gateway.monitoring.svc.cluster.local
Address: 10.60.30.8
Name: thanos-store-gateway.monitoring.svc.cluster.local
Address: 10.60.31.2
root@my-shell-95cb5df57-4q6w8:/# exit
apiVersion: v1
kind: Namespace
metadata:
165
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
name: monitoring
---
kind: ConfigMap
apiVersion: v1
metadata:
name: alertmanager
namespace: monitoring
data:
config.yml: |-
global:
resolve_timeout: 5m
slack_api_url: "<your_slack_hook>"
victorops_api_url: "<your_victorops_hook>"
templates:
- '/etc/alertmanager-templates/*.tmpl'
route:
group_wait: 10s
group_interval: 1m
repeat_interval: 5m
receiver: default
routes:
- match:
team: devops
receiver: devops
continue: true
- match:
team: dev
166
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
receiver: dev
continue: true
receivers:
- name: 'default'
- name: 'devops'
victorops_configs:
- api_key: '<YOUR_API_KEY>'
routing_key: 'devops'
message_type: 'CRITICAL'
}}'
.CommonLabels }}'
slack_configs:
- channel: '#k8-alerts'
send_resolved: true
- name: 'dev'
victorops_configs:
- api_key: '<YOUR_API_KEY>'
routing_key: 'dev'
message_type: 'CRITICAL'
}}'
167
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
.CommonLabels }}'
slack_configs:
- channel: '#k8-alerts'
send_resolved: true
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: alertmanager
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
name: alertmanager
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:v0.15.3
args:
- '--config.file=/etc/alertmanager/config.yml'
- '--storage.path=/alertmanager'
ports:
168
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
- name: alertmanager
containerPort: 9093
volumeMounts:
- name: config-volume
mountPath: /etc/alertmanager
- name: alertmanager
mountPath: /alertmanager
volumes:
- name: config-volume
configMap:
name: alertmanager
- name: alertmanager
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/scrape: 'true'
prometheus.io/path: '/metrics'
labels:
name: alertmanager
name: alertmanager
namespace: monitoring
spec:
selector:
app: alertmanager
ports:
- name: alertmanager
169
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
protocol: TCP
port: 9093
targetPort: 9093
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: monitoring
---
170
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
apiVersion: rbac.authorization.k8s.io/v1
authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups: [""]
resources:
- configmaps
- secrets
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
- apiGroups: ["extensions"]
resources:
- daemonsets
- deployments
- replicasets
- apiGroups: ["apps"]
171
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
resources:
- statefulsets
- apiGroups: ["batch"]
resources:
- cronjobs
- jobs
- apiGroups: ["autoscaling"]
resources:
- horizontalpodautoscalers
---
apiVersion: rbac.authorization.k8s.io/v1
authorization.k8s.io/v1beta1
kind: RoleBinding
metadata:
name: kube-state-metrics
namespace: monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: kube-state-metrics-resizer
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: monitoring
---
172
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
apiVersion: rbac.authorization.k8s.io/v1
authorization.k8s.io/v1beta1
kind: Role
metadata:
namespace: monitoring
name: kube-state-metrics-resizer
rules:
- apiGroups: [""]
resources:
- pods
verbs: ["get"]
- apiGroups: ["extensions"]
resources:
- deployments
resourceNames: ["kube-state-metrics"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: monitoring
173
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
spec:
selector:
matchLabels:
k8s-app: kube-state-metrics
replicas: 1
template:
metadata:
labels:
k8s-app: kube-state-metrics
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: quay.io/mxinden/kube-state-metrics:v1.4.0-
gzip.3
ports:
- name: http-metrics
containerPort: 8080
- name: telemetry
containerPort: 8081
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
- name: addon-resizer
image: k8s.gcr.io/addon-resizer:1.8.3
resources:
174
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
limits:
cpu: 150m
memory: 50Mi
requests:
cpu: 150m
memory: 50Mi
env:
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
command:
- /pod_nanny
- --container=kube-state-metrics
- --cpu=100m
- --extra-cpu=1m
- --memory=100Mi
- --extra-memory=2Mi
- --threshold=5
- --deployment=kube-state-metrics
---
apiVersion: v1
kind: Service
metadata:
name: kube-state-metrics
175
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
namespace: monitoring
labels:
k8s-app: kube-state-metrics
annotations:
prometheus.io/scrape: 'true'
spec:
ports:
- name: http-metrics
port: 8080
targetPort: http-metrics
protocol: TCP
- name: telemetry
port: 8081
targetPort: telemetry
protocol: TCP
selector:
k8s-app: kube-state-metrics
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: extensions/v1beta1
kind: DaemonSet
176
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
metadata:
name: node-exporter
namespace: monitoring
labels:
name: node-exporter
spec:
template:
metadata:
labels:
name: node-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9100"
spec:
hostPID: true
hostIPC: true
hostNetwork: true
containers:
- name: node-exporter
image: prom/node-exporter:v0.16.0
securityContext:
privileged: true
args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
ports:
- containerPort: 9100
protocol: TCP
resources:
177
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 10m
memory: 100Mi
volumeMounts:
- name: dev
mountPath: /host/dev
- name: proc
mountPath: /host/proc
- name: sys
mountPath: /host/sys
- name: rootfs
mountPath: /rootfs
volumes:
- name: proc
hostPath:
path: /proc
- name: dev
hostPath:
path: /dev
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPath:
path: /
178
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: storage.k8s.io/v1beta1
kind: StorageClass
metadata:
name: fast
namespace: monitoring
provisioner: kubernetes.io/gce-pd
allowVolumeExpansion: true
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
179
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
name: grafana
namespace: monitoring
spec:
replicas: 1
serviceName: grafana
template:
metadata:
labels:
task: monitoring
k8s-app: grafana
spec:
containers:
- name: grafana
image: k8s.gcr.io/heapster-grafana-amd64:v5.0.4
ports:
- containerPort: 3000
protocol: TCP
volumeMounts:
- mountPath: /etc/ssl/certs
name: ca-certificates
readOnly: true
- mountPath: /var
name: grafana-storage
env:
- name: GF_SERVER_HTTP_PORT
value: "3000"
180
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
clusters, we recommend
- name: GF_AUTH_BASIC_ENABLED
value: "false"
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "true"
- name: GF_AUTH_ANONYMOUS_ORG_ROLE
value: Admin
- name: GF_SERVER_ROOT_URL
# value: /api/v1/namespaces/kube-system/
services/monitoring-grafana/proxy
value: /
volumes:
- name: ca-certificates
hostPath:
path: /etc/ssl/certs
volumeClaimTemplates:
- metadata:
name: grafana-storage
namespace: monitoring
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: fast
resources:
requests:
181
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
labels:
kubernetes.io/cluster-service: 'true'
kubernetes.io/name: grafana
name: grafana
namespace: monitoring
spec:
ports:
- port: 3000
targetPort: 3000
selector:
k8s-app: grafana
Deploying the Ingress Object: This is the final piece in the puzzle.
This will help expose all our services outside the Kubernetes
cluster and help us access them.
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: monitoring-ingress
namespace: monitoring
182
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
annotations:
kubernetes.io/ingress.class: "nginx"
spec:
rules:
- host: grafana.<yourdomain>.com
http:
paths:
- path: /
backend:
serviceName: grafana
servicePort: 3000
- host: prometheus-0.<yourdomain>.com
http:
paths:
- path: /
backend:
serviceName: prometheus-0-service
servicePort: 8080
- host: prometheus-1.<yourdomain>.com
http:
paths:
- path: /
backend:
serviceName: prometheus-1-service
servicePort: 8080
- host: prometheus-2.<yourdomain>.com
http:
paths:
- path: /
183
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
backend:
serviceName: prometheus-2-service
servicePort: 8080
- host: alertmanager.<yourdomain>.com
http:
paths:
- path: /
backend:
serviceName: alertmanager
servicePort: 9093
- host: thanos-querier.<yourdomain>.com
http:
paths:
- path: /
backend:
serviceName: thanos-querier
servicePort: 9090
- host: thanos-ruler.<yourdomain>.com
http:
paths:
- path: /
backend:
serviceName: thanos-ruler
servicePort: 9090
184
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
If you click on Stores, you will be able to see all the active
endpoints discovered by thanos-store-gateway.
185
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
186
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
187
Example 2: HA Kubernetes Monitoring with Prometheus and Thanos
13.7. Conclusion
188
Example 3: Monitoring Redis Clusters with Prometheus
Example 3: Monitoring
Redis Clusters with
Prometheus
14.1 Introduction
189
Example 3: Monitoring Redis Clusters with Prometheus
190
Example 3: Monitoring Redis Clusters with Prometheus
191
Example 3: Monitoring Redis Clusters with Prometheus
192
Example 3: Monitoring Redis Clusters with Prometheus
193
Example 3: Monitoring Redis Clusters with Prometheus
As seen below, you can see the Prometheus data source settings
menu. Change the URL to http://localhost:9090. For Access, select
Browser. Then, click Save & Test.
195
Example 3: Monitoring Redis Clusters with Prometheus
This dashboard is showing four metrics pushed from our Redis DB.
They are:
1. Redis Client view - the total number of Redis clients
2. Key view - the total number of keys in each Redis DB instance
3. Commands processed - the number of commands processed
per group of machines
4. Memory - total memory usage for each different aggregation
machines
196
Example 3: Monitoring Redis Clusters with Prometheus
This graph shows the total memory usage for different aggregation
machines. These machines are responsible for gathering data
that is ingested and aggregating the data into more manageable
formats. We want to monitor how much memory each resource
is using. When a resource is getting close to max memory
consumption, performance will start to decrease. A spike in
memory usage can act as an identifier for important changes in
your application and processes.
197
Example 3: Monitoring Redis Clusters with Prometheus
198
Example 3: Monitoring Redis Clusters with Prometheus
This is the zoomed in Key View graph from the dashboard row
above. This is showing the total number of keys in each Redis DB
instance.
Similar to the other graphs, knowing the total number of keys within
an instance gives administrators greater insight into each Redis DB.
If you are using Redis DB as a distributed caching store, then a graph
like this will be useful to ensure each instance is being properly
balanced and utilized. If an instance is showing a significant drop in
keys then this is an indicator to look into this issue further.
There are a lot of metrics that are automatically pushed from Redis
DB. Take a look at a few below, and you can find a full list on the
Redis website.
199
Example 3: Monitoring Redis Clusters with Prometheus
200
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
Example 4: Prometheus
Metrics Based Autoscaling
in Kubernetes
15.1 Introduction
15.2.2 Prerequisites
Let’s first deploy a sample app over which we will be testing our
Prometheus metrics autoscaling. We can use the manifest below to
do it:
apiVersion: v1
kind: Namespace
metadata:
name: nginx
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
namespace: nginx
name: nginx-deployment
spec:
replicas: 1
template:
metadata:
annotations:
prometheus.io/path: "/status/format/prometheus"
prometheus.io/scrape: "true"
prometheus.io/port: "80"
labels:
app: nginx-server
203
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx-server
topologyKey: kubernetes.io/hostname
containers:
- name: nginx-demo
image: vaibhavthakur/nginx-vts:1.0
imagePullPolicy: Always
resources:
limits:
cpu: 2500m
requests:
cpu: 2000m
ports:
- containerPort: 80
name: http
---
apiVersion: v1
kind: Service
metadata:
204
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
namespace: nginx
name: nginx-service
spec:
ports:
- port: 80
targetPort: 80
name: http
selector:
app: nginx-server
type: LoadBalancer
205
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
Name: nginx-deployment
Namespace: nginx
Labels: app=nginx-server
Annotations: deployment.kubernetes.io/revision: 1
kubectl.kubernetes.io/last-applied-
configuration:
{"apiVersion":"extensions/v1beta1","kind":"Deployment","me
tadata":{"annotations":{},"name":"nginx-deployment","names
pace":"nginx"},"spec":...
Selector: app=nginx-server
1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
Pod Template:
Labels: app=nginx-server
prometheus
prometheus.io/port: 80
prometheus.io/scrape: true
Containers:
nginx-demo:
Image: vaibhavthakur/nginx-vts:v1.0
Port: 80/TCP
206
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
Limits:
cpu: 250m
Requests:
cpu: 200m
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
OldReplicaSets: <none>
created)
Events: <none>
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
body {
width: 35em;
margin: 0 auto;
</style>
</head>
207
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
<body>
<h1>Welcome to nginx!</h1>
is required.</p>
<a href="http://nginx.org/">nginx.org</a>.<br/>
<a href="http://nginx.com/">nginx.com</a>.</p>
</body>
</html>
$ curl nginx.gotham.com/status/format/prometheus
nginx_vts_info{hostname="nginx-deployment-65d8df7488-
c578v",version="1.13.12"} 1
nginx_vts_start_time_seconds 1574283147.043
nginx_vts_main_connections{status="accepted"} 215
nginx_vts_main_connections{status="active"} 4
208
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
nginx_vts_main_connections{status="handled"} 215
nginx_vts_main_connections{status="reading"} 0
nginx_vts_main_connections{status="requests"} 15577
nginx_vts_main_connections{status="waiting"} 3
nginx_vts_main_connections{status="writing"} 1
http_vhost_traffic_status] info
nginx_vts_main_shm_usage_bytes{shared="max_size"} 1048575
nginx_vts_main_shm_usage_bytes{shared="used_size"} 3510
nginx_vts_main_shm_usage_bytes{shared="used_node"} 1
bytes
counter
counter
209
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
nginx_vts_server_bytes_total{host="_",direction="in"}
3303449
nginx_vts_server_bytes_total{host="_",direction="out"}
61641572
nginx_vts_server_requests_total{host="_",code="1xx"} 0
nginx_vts_server_requests_total{host="_",code="2xx"} 15574
nginx_vts_server_requests_total{host="_",code="3xx"} 0
nginx_vts_server_requests_total{host="_",code="4xx"} 2
nginx_vts_server_requests_total{host="_",code="5xx"} 0
nginx_vts_server_requests_total{host="_",code="total"}
15576
nginx_vts_server_request_seconds_total{host="_"} 0.000
nginx_vts_server_request_seconds{host="_"} 0.000
nginx_vts_server_cache_total{host="_",status="miss"} 0
nginx_vts_server_cache_total{host="_",status="bypass"} 0
nginx_vts_server_cache_total{host="_",status="expired"} 0
nginx_vts_server_cache_total{host="_",status="stale"} 0
nginx_vts_server_cache_total{host="_",status="updating"} 0
nginx_vts_server_cache_
total{host="_",status="revalidated"} 0
nginx_vts_server_cache_total{host="_",status="hit"} 0
nginx_vts_server_cache_total{host="_",status="scarce"} 0
nginx_vts_server_bytes_total{host="*",direction="in"}
3303449
nginx_vts_server_bytes_total{host="*",direction="out"}
61641572
nginx_vts_server_requests_total{host="*",code="1xx"} 0
nginx_vts_server_requests_total{host="*",code="2xx"} 15574
nginx_vts_server_requests_total{host="*",code="3xx"} 0
210
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
nginx_vts_server_requests_total{host="*",code="4xx"} 2
nginx_vts_server_requests_total{host="*",code="5xx"} 0
nginx_vts_server_requests_total{host="*",code="total"}
15576
nginx_vts_server_request_seconds_total{host="*"} 0.000
nginx_vts_server_request_seconds{host="*"} 0.000
nginx_vts_server_cache_total{host="*",status="miss"} 0
nginx_vts_server_cache_total{host="*",status="bypass"} 0
nginx_vts_server_cache_total{host="*",status="expired"} 0
nginx_vts_server_cache_total{host="*",status="stale"} 0
nginx_vts_server_cache_total{host="*",status="updating"} 0
nginx_vts_server_cache_
total{host="*",status="revalidated"} 0
nginx_vts_server_cache_total{host="*",status="hit"} 0
nginx_vts_server_cache_total{host="*",status="scarce"} 0
211
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
SHELL=bash
PURPOSE:=metrics
SERVICE_NAME:=custom-metrics-apiserver
ALT_NAMES:=”custom-metrics-apiserver.monitoring”,”custom-
metrics-apiserver.monitoring.svc”
SECRET_FILE:=./cm-adapter-serving-certs.yaml
.PHONY: gencerts
gencerts:
@mkdir -p output
@touch output/apiserver.pem
@touch output/apiserver-key.pem
@echo '{"signing":{"default":{"expiry":"43800h","usag
"$(PURPOSE)-ca-config.json"
@echo
'{"CN":"'$(SERVICE_NAME)'","hosts":[$(ALT_NAMES)],"
${HOME}:${HOME} -v ${PWD}/metrics-ca.key:/go/src/github.
com/cloudflare/cfssl/metrics-ca.key -v ${PWD}/metrics-ca.
crt:/go/src/github.com/cloudflare/cfssl/metrics-ca.crt -v
212
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
${PWD}/metrics-ca-config.json:/go/src/github.com/cloudflare/
${HOME}:${HOME} -v ${PWD}/output:/go/src/github.com/
apiserver
.PHONY: gensecret
gensecret: gencerts
FILE)
endif
213
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
endif
.PHONY: rmcerts
rmcerts:
.PHONY: deploy-secret
deploy-secret:
Once you have created the make file, just run the following
command:
make certs
and it will create ssl certificates and the corresponding Kubernetes
secret for you. Make sure that monitoring namespace exists before
you create the secret. This secret will be using the Prometheus
Adapter which we will deploy next.
15.2.5 Create Prometheus Adapter ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
214
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
name: adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'nginx_vts_server_requests_total'
resources:
overrides:
kubernetes_namespace:
resource: namespace
kubernetes_pod_name:
resource: pod
name:
matches: "^(.*)_total"
as: "${1}_per_second"
metricsQuery: (sum(rate(<<.Series>>{<<.
LabelMatchers>>}[1m])) by (<<.GroupBy>>))
215
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: custom-metrics-apiserver
name: custom-metrics-apiserver
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: custom-metrics-apiserver
template:
metadata:
labels:
app: custom-metrics-apiserver
name: custom-metrics-apiserver
spec:
serviceAccountName: monitoring
containers:
- name: custom-metrics-apiserver
image: quay.io/coreos/k8s-prometheus-adapter-
amd64:v0.4.1
args:
- /adapter
216
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
- --secure-port=6443
- --tls-cert-file=/var/run/serving-cert/serving.crt
- --tls-private-key-file=/var/run/serving-cert/
serving.key
- --logtostderr=true
- --prometheus-url=http://thanos-querier.
monitoring:9090/
- --metrics-relist-interval=30s
- --v=10
- --config=/etc/adapter/config.yaml
ports:
- containerPort: 6443
volumeMounts:
- mountPath: /var/run/serving-cert
name: volume-serving-cert
readOnly: true
- mountPath: /etc/adapter/
name: config
readOnly: true
volumes:
- name: volume-serving-cert
secret:
secretName: cm-adapter-serving-certs
- name: config
configMap:
name: adapter-config
This will create our deployment which will spawn the Prometheus
Adapter pod to pull metrics from Prometheus. It should be noted
that we have set the argument --prometheus-url=http://thanos-
217
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
If you notice the logs for this container, you can see that it is
fetching the metric defined in the config file:
nos-querier.monitoring:9090/api/v1/series?match%5B%5D=ng-
inx_vts_server_requests_total&start=1574381213.217 200 OK
{“status”:”success”,”data”:
[{“__name__”:”nginx_vts_server_requests_total”,”app”:”ng-
inx-server”,”cluster”:”prometheus-
ha”,”code”:”1xx”,”host”:”*”,”instance”:”10.60.64.39:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
sbp95”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”1xx”,”host”:”*”,”instance”:”10.60.64.8:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
mwzxg”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
218
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
ha”,”code”:”1xx”,”host”:”_”,”instance”:”10.60.64.39:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
sbp95”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”1xx”,”host”:”_”,”instance”:”10.60.64.8:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
mwzxg”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”2xx”,”host”:”*”,”instance”:”10.60.64.39:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
sbp95”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”2xx”,”host”:”*”,”instance”:”10.60.64.8:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
mwzxg”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”2xx”,”host”:”_”,”instance”:”10.60.64.39:80”,”-
219
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
sbp95”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”2xx”,”host”:”_”,”instance”:”10.60.64.8:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
mwzxg”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”3xx”,”host”:”*”,”instance”:”10.60.64.39:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
sbp95”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”3xx”,”host”:”*”,”instance”:”10.60.64.8:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
mwzxg”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”3xx”,”host”:”_”,”instance”:”10.60.64.39:80”,”-
job”:”kubernetes-
220
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
sbp95”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”3xx”,”host”:”_”,”instance”:”10.60.64.8:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
mwzxg”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”4xx”,”host”:”*”,”instance”:”10.60.64.39:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
sbp95”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”4xx”,”host”:”*”,”instance”:”10.60.64.8:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
mwzxg”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”4xx”,”host”:”_”,”instance”:”10.60.64.39:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
221
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
name”:”nginx-deployment-65d8df7488-
sbp95”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”4xx”,”host”:”_”,”instance”:”10.60.64.8:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
mwzxg”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”5xx”,”host”:”*”,”instance”:”10.60.64.39:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
sbp95”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”5xx”,”host”:”*”,”instance”:”10.60.64.8:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
mwzxg”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”5xx”,”host”:”_”,”instance”:”10.60.64.39:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
222
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
sbp95”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”5xx”,”host”:”_”,”instance”:”10.60.64.8:80”,”-
job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
mwzxg”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”total”,”host”:”*”,”in-
stance”:”10.60.64.39:80”,”job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
sbp95”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”total”,”host”:”*”,”in-
stance”:”10.60.64.8:80”,”job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
mwzxg”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”total”,”host”:”_”,”in-
stance”:”10.60.64.39:80”,”job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
sbp95”,”pod_template_hash”:”65d8df7488”},{“__name__”:”ng-
223
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
inx_vts_server_requests_total”,”app”:”nginx-
server”,”cluster”:”prometheus-
ha”,”code”:”total”,”host”:”_”,”in-
stance”:”10.60.64.8:80”,”job”:”kubernetes-
pods”,”kubernetes_namespace”:”nginx”,”kubernetes_pod_
name”:”nginx-deployment-65d8df7488-
mwzxg”,”pod_template_hash”:”65d8df7488”}]}
apiVersion: v1
kind: Service
metadata:
name: custom-metrics-apiserver
namespace: monitoring
spec:
ports:
- port: 443
targetPort: 6443
selector:
app: custom-metrics-apiserver
---
apiVersion: apiregistration.k8s.io/v1beta1
kind: APIService
metadata:
name: v1beta1.custom.metrics.k8s.io
224
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
spec:
service:
name: custom-metrics-apiserver
namespace: monitoring
group: custom.metrics.k8s.io
version: v1beta1
insecureSkipTLSVerify: true
groupPriorityMinimum: 100
versionPriority: 100
v1beta1" | jq .
"kind": "APIResourceList",
"apiVersion": "v1",
"groupVersion": "custom.metrics.k8s.io/v1beta1",
"resources": [
"name": "pods/nginx_vts_server_requests_per_second",
"singularName": "",
"namespaced": true,
"kind": "MetricValueList",
"verbs": [
"get"
225
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
},
"name": "namespaces/nginx_vts_server_requests_per_
second",
"singularName": "",
"namespaced": false,
"kind": "MetricValueList",
"verbs": [
"get"
v1beta1/namespaces/nginx/pods/*/nginx_vts_server_requests_
per_second" | jq .
"kind": "MetricValueList",
"apiVersion": "custom.metrics.k8s.io/v1beta1",
"metadata": {
"selfLink": "/apis/custom.metrics.k8s.io/v1beta1/
namespaces/nginx/pods/%2A/nginx_vts_server_requests_per_
226
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
second"
},
"items": [
"describedObject": {
"kind": "Pod",
"namespace": "nginx",
"name": "nginx-deployment-65d8df7488-v575j",
"apiVersion": "/v1"
},
"metricName": "nginx_vts_server_requests_per_
second",
"timestamp": "2019-11-19T18:38:21Z",
"value": "1236m"
Create an HPA which will utilize these metrics. We can use the
manifest below to do it:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: nginx-custom-hpa
namespace: nginx
spec:
scaleTargetRef:
apiVersion: extensions/v1beta1
227
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
kind: Deployment
name: nginx-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metricName: nginx_vts_server_requests_per_second
targetAverageValue: 4000m
Once you have applied this manifest, you can check the current
status of HPA as follows:
Name: nginx-custom-hpa
Namespace: nginx
Labels: <none>
Annotations: autoscaling.alpha.kubernetes.io/
metrics:
[{"type":"Pods","pods":{"metricName":"nginx_vts_server_
requests_per_second","targetAverageValue":"4"}}]
kubectl.kubernetes.io/last-applied-
configuration:
{"apiVersion":"autoscaling/v2beta1",
"kind":"HorizontalPodAutoscaler","metadata":{"annotations"
:{},"name":"nginx-custom-hpa","namespace":"n...
Reference: Deployment/nginx-deployment
228
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
Min replicas: 2
Max replicas: 10
Events: <none>
NAME AGE
nginx-deployment-65d8df7488-pwjzm 0s
nginx-deployment-65d8df7488-pwjzm 0s
nginx-deployment-65d8df7488-pwjzm 0s
nginx-deployment-65d8df7488-pwjzm 2s
nginx-deployment-65d8df7488-pwjzm 4s
nginx-deployment-65d8df7488-jvbvp 0s
nginx-deployment-65d8df7488-jvbvp 0s
nginx-deployment-65d8df7488-jvbvp 1s
nginx-deployment-65d8df7488-jvbvp 4s
229
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
nginx-deployment-65d8df7488-jvbvp 7s
nginx-deployment-65d8df7488-skjkm 0s
nginx-deployment-65d8df7488-skjkm 0s
nginx-deployment-65d8df7488-jh5vw 0s
nginx-deployment-65d8df7488-skjkm 0s
nginx-deployment-65d8df7488-jh5vw 0s
nginx-deployment-65d8df7488-jh5vw 1s
nginx-deployment-65d8df7488-skjkm 2s
nginx-deployment-65d8df7488-jh5vw 2s
nginx-deployment-65d8df7488-skjkm 3s
nginx-deployment-65d8df7488-jh5vw 4s
10 3 5m5s
44.601823883s, 61.98298ms
230
Example 4: Prometheus Metrics Based Autoscaling in Kubernetes
Error Set:
15.4 Conclusion
This set-up demonstrates how we can use Prometheus Adapter
to autoscale deployments based on some custom metrics. For
the sake of simplicity we have only fetched one metric from our
Prometheus Server. However, the Adapter configmap can be
extended to fetch some or all the available metrics and use them
for autoscaling.
231