Prometheus Monitor
Prometheus Monitor
prometheus is and what are different use cases where prometheus is used and why is it such an
important tool in modern infrastructure we're going to go through prometheus architecture so
different components that it contains we're going to see an example configura on and also some
of these key characteris cs why it became so widely accepted and popular especially in
containerized environments
h ps://www.youtube.com/watch?v=h4Sl21AKiDg
Prometheus was created to monitor highly dynamic container environments like kubernetes
docker swarm etc. however it can also be used in a tradi onal non-container infrastructure where
you have just bare servers with applica ons deployed directly on it. Over the past years
prometheus has become the mainstream monitoring tool of choice in container and micro service
world.
So typically, you have mul ple servers that run containerized applica ons and there are hundreds
of different processes running on that infrastructure and things are interconnected so maintaining
such setup to run smoothly and without applica on down mes is very challenging. Imagine
having such a complex infrastructure with loads of servers distributed over many loca ons and you
have no insight of what is happening on hardware level or on applica on level like errors response
latency hardware down or overloaded maybe running out of resources etc. In such complex
infrastructure there are more things that can go wrong when you have tons of services and
applica ons deployed any one of them can crash and cause failure of other services and only have
so many moving pieces and suddenly applica on becomes unavailable. To users you must quickly
iden fy what exactly out of these hundred different things went wrong and that could be difficult
and me-consuming when debugging the system. Manually so let's take a specific example. Say
one specific server ran out of memory and kicked off a running container that was responsible for
providing database sync between two database pods in a kubernetes cluster. That in turn caused
those two database parts to fail that database was used by an authen ca on service that also
stopped working because the database became unavailable and then applica on that depended
on that authen ca on service couldn't authen cate users in the ui anymore. But from a user
perspec ve all you see is error in the ui can't login. So how do you know what actually went wrong
when you don't have any insight of what's going on inside. The cluster you don't see that red line
of the chain of events as displayed here you just see the error.
So now you start working backwards from there to find the cause and fix it. So, you check is the
applica on back and running does it show an excep on. Is the authen ca on service running did
it, crash why did it crash in all the way to the ini al container failure. But what will make this
searching the problem process more efficient would be to have a tool that constantly monitors
whether services are running and alerts the maintainers as soon as one service crashes. So, you
know exactly what happened or even be er it iden fies problems before they even occur and
alerts the system administrators responsible for that infrastructure to prevent that issue.
So, for example in this case it would check regularly the status of memory usage on each server,
and when on one of the servers it spikes over for example 70 percent for over an hour or keeps
increasing, no fy about the risk that the memory on that server. Might soon run out or let's
consider another scenario where suddenly you stop seeing logs for your applica on because
elas csearch doesn't accept any new logs because the server ran out of disk space or elas csearch
reached the storage limit that was allocated for it. again, the monitoring tool would check
con nuously the storage space and compare with the elas c search consump on of space of
storage and it will see the risk and no fy maintainers of the possible storage issue and you can
tell the monitoring tool what that cri cal point is when the alert should be triggered. for example
if you have a very important applica on that absolutely can have any log data loss you may be very
strict and want to take measures as soon as 50 or 60 percent capacity is reached or maybe you
know adding more storage space will take long because it's a bureaucra c process in your
organiza on where you need approval of some it department and several other people then
maybe you also want to be no fied earlier about the possible storage issue.
So that you have more me to fix it or a third scenario where applica on suddenly becomes too
slow because one service breaks down and starts sending hundreds of error messages in a loop
across the network that creates high network traffic and slows down other services too. having a
tool that detects such spikes in network load plus tells you which service is responsible for
causing it can give you mely alert to fix the issue and such automated monitoring and aler ng
is exactly what prometheus offers as a part of a modern devops workflow.
So how does prometheus actually work or how does its architecture actually look like at its core.
prometheus has the main component called prometheus server that does the actual monitoring
work and is made up of three parts it has a me:
series database that stores all the metrics data like current cpu usage or number of
excep ons in an applica on
second it has a data retrieval worker that is responsible for ge ng or pulling those metrics
from applica ons services servers and other target resources and storing them or pushing
them into that database
third it has a web server or server api that accepts queries for that stored data and that web
server component or the server api is used to display the data in a dashboard or ui either
through prometheus dashboard or some other data visualiza on tool like Grafana
So, the prometheus server monitors a par cular thing and that thing could be anything. it could
be an en re Linux server or windows server. it could be a standalone Apache server a single
applica on or service like a database and those things that prometheus monitors are called
targets and each target has units of monitoring for Linux. server target it could be a current cpu
status its memory usage disk space usage etc. for an applica on for example it could be number of
excep ons number of requests or request dura on and that unit that you would like to monitor for
a specific target is called a metric and metrics are what gets saved into prometheus database
component. prometheus defines human readable text-based format for this metrics entries or
data has type and help a ributes to increase its readability. so, help is basically a descrip on that
just describe what the metrics is about and type is one of three metrics types for metrics about
how many mes something happened, like number of excep ons that applica on had or number
of requests it has received.
There is
a counter type metric that can go both up and down is represented by a gauche example what is
the current value of cpu usage now or what is the current capacity of disk space now or what is the
number of concurrent requests at that given moment and for tracking how long something took
or how big for example the size of a request was there is a histogram type.
So now the interes ng ques on is how does prometheus actually collect those metrics from the
targets. prometheus pulls metrics data from the targets from an h p endpoint which by default is
host address slash metrics and for that to work one targets must expose that slash metrics
endpoint and two data available at slash matrix endpoint must be in the format that prometheus
understands and we saw that example metrics before some servers are already exposing
prometheus endpoints. So, you don't need extra work to gather metrics from them but many
services don't have na ve prometheus endpoints.
So extra component is required to do that and this component is exporter. so, exporter is
basically a script or service that fetches metrics from your target and converts them in format
prometheus understands and exposes this converted data at its own slash metrics endpoint
where prometheus can scrape them and prometheus has a list of exporters for different services
like mysql elas csearch Linux server build tools cloud pla orms and so on.
So, for example if you want to monitor a Linux server you can download a node exporter tar file
from prometheus repository you can untar and execute it and it will start conver ng the metrics
of the server and making them scrapable at its own slash matrix endpoint and then you can go and
configure prometheus to scrape that endpoint and these exporters are also available as docker
images.
If you want to monitor your mysql container in kubernetes cluster you can deploy a sidecar
container of mysql exporter that will run inside the pod with mysql container connect to it and
start transla ng mysql metrics for prometheus and making them available at its own slash metrics
endpoint and again. once you add mysql exporter endpoint to prometheus configura on
prometheus will start collec ng those metrics and saving them in its database what about
monitoring your own applica ons.
Let's say you want to see how many requests your applica on is ge ng at different mes, or how
many excep ons are occurring ,how many server resources your applica on is using etc for this
use. case there are prometheus client libraries for different languages like node.js java etc using
these libraries you can expose the slash metrics scraping endpoint in your applica on and provide
different metrics that are relevant for you on that endpoint and this is a pre y convenient way for
the infrastructure team to tell developers emit metrics that are relevant to you and will collect
and monitor them in our infrastructure and I will also link the list of client libraries prometheus
supports where you can see the documenta on of how to use them.
Prometheus pulls
Prometheus pulls this data from endpoints and that's actually an important characteris c of
Prometheus. let's see why most monitoring systems like amazon cloud watch or new relief etc use
a push system meaning applica ons and servers are responsible for pushing their metric data to a
centralized collec on pla orm of that monitoring tool. When you're working with many
microservices and you have each service pushing their metrics to the monitoring system, it creates
a high load of traffic within your infrastructure and your monitoring can actually become your
bo leneck. So you have monitoring which is great but you pay the price of overloading your
infrastructure with constant push requests from all the services and thus flooding the network,
plus you also have to install daemons on each of these targets to push the metrics to monitoring
server, while prometheus requires just a scraping endpoint and this way metrics can also be pulled
by mul ple prometheus instances and another advantage of that is using Pull prometheus can
easily detect whether service is up and running for example when he doesn't respond on the pull
or when the endpoint isn't available while with push if the service doesn't push any data or send
its health status it might have many reasons other than the service isn't running it could be that
network isn't working the package got lost on the way or some other problem so you don't really
have an insight of what happened . But there are limited number of cases where a target that needs
to be monitored runs only for a short me so they aren't around long enough to be scrapped.
Example could be a batch job or scheduled job that say cleans up some old data or does backups
etc , for such jobs prometheus offers push gateway component so that these services can push
their metrics directly to prometheus database but obviously using push gateway to gather metrics
in, prometheus should be an excep on because of the reasons I men oned earlier so how does
prometheus know what to scrape and when all that is configured in prometheus.yaml
configura on file. so, you define which targets prometheus should scrape and at what interval
prometheus then uses a service discovery mechanism to find those target endpoints.
Service Discovery
when you first download and install prometheus you will see the sample config file with some
default values in it here is an example we have global config that defines scrape interval or how
o en prometheus will scrape its targets and you can override these for individual targets. The
rule files block specifies the loca on of any rules we want prometheus server to load and the rules
are basically either for aggrega ng matrix values or crea ng alerts when some condi on is met
like cpu usage reached 80 percent for example.
So prometheus uses rules to create new me series entries and to generate alerts and the
evalua on interval op on in global config defines how o en prometheus will evaluate these rules
in the last block scrape configs controls what resources prometheus monitors. This is where you
define the targets since prometheus has its own metrics endpoint to expose its own data it can
monitor its own health.
In this default configura on there is a single job called prometheus which scrapes the metrics
exposed by the prometheus server. so, it has a single target at localhost 1990 and prometheus
expects metrics to be available on a target on a path of slash metrics which is a default path that
is configured for that endpoint and here you can also define other endpoints to scrape through
jobs. so, you can create another job and for example override the scrape interval from the global
configura on and define the target host address. so, a couple of important points here. so, the
first one is how does prometheus actually trigger the alerts that are defined by rules and who
receives them prometheus has a component called alert manager that is responsible for firing
alerts via different channels it could be email it could be a slack channel or some other
no fica on client .
Alert Manager
So prometheus server will read the alert rules and if the condi on in the rules is met an alert gets
fired through that configured channel and the second one is prometheus data storage where
does prometheus store all this data that it collects and then aggregates and how can other
systems access this data prometheus stores the metrics data on disk so it includes a local on disk
me series database but also op onally integrates with remote storage system and the data is
stored in a custom me series format and because of that you can't write prometheus data directly
into a rela onal database for example. Once you've collected the metrics prometheus also lets you
query the metrics data on targets through its server api using promptql query language . You can
use prometheus dashboard ui to ask the prometheus server via promql to for example show the
status of a par cular target right now or you can use more powerful data visualiza on tools like
grafana to display the data which under the hood. Also uses promql to get the data out of
prometheus and this is an example of a promql query which this one here basically queries all h p
status codes except the ones in 400 range and this one basically does some sub query on that for
a period of 30 minutes and this is just to give you an example of how is query language look like.
But with grafana instead of wri ng promptq queries directly into the prometheus server ui, you
basically have grafina ui where you can create dashboards that can then in the background use
prom ql to query the data that you want to display now concerning promql the prometheus
configura on in grafana ui. Configuring prometheus yml file to scrape different targets and then
crea ng all those dashboards to display meaningful data out of the script metrics can actually be
pre y complex and it's also not very well documented. So there is some steep learning curve to
learning how to correctly configure prometheus and how to then query the collected metrics data
to create dashboards. The final point is an important characteris c of prometheus that it is
designed to be reliable. Even when other systems have an outage so that you can diagnose the
problems and fix them so each prometheus server is standalone and self-containing meaning it
doesn't depend on network storage or other remote services it's meant to work when other parts
of the infrastructure are broken and you don't need to set up extensive infrastructure to use it
which of course is a great thing. however it also has disadvantage that prometheus can be difficult
to scale so when you have hundreds of servers you might want to have mul ple prometheus
servers that somewhere aggregate all this metrics data and configuring that and scaling
prometheus in that way can actually be very difficult because of this characteris c so while using
a single node is less complex and you can get started very easily it puts a limit on the number of
metrics that can be monitored by Prometheus. so, to work around that you either increase the
capacity of the prometheus server so it can store more metrics data or you limit the number of
metrics that prometheus collects from the applica ons to keep it down to only the relevant ones.
and finally in terms of prometheus with docker and kubernetes as I men oned throughout the
video with different examples prometheus is fully compa ble with both and prometheus
components are available as docker images and therefore can easily be deployed in kubernetes or
other container environments and it integrates great with kubernetes infrastructure providing
cluster node resource monitoring out of the box which means once it's deployed on kubernetes it
starts gathering matrix data on each kubernetes node server without any extra configura on.