0% found this document useful (0 votes)
289 views12 pages

Prometheus Promql For Humans

This document provides an introduction and cheatsheet for PromQL, the query language for Prometheus. It explains the basics of instant and range vectors, important functions for visualizing range vectors as instant vectors like rate() and increase(), and how to filter and aggregate metrics using labels. It also provides examples for common queries like HTTP request rates, CPU and memory usage, and alert firing counts.

Uploaded by

Shirouit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
289 views12 pages

Prometheus Promql For Humans

This document provides an introduction and cheatsheet for PromQL, the query language for Prometheus. It explains the basics of instant and range vectors, important functions for visualizing range vectors as instant vectors like rate() and increase(), and how to filter and aggregate metrics using labels. It also provides examples for common queries like HTTP request rates, CPU and memory usage, and alert firing counts.

Uploaded by

Shirouit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

PromQL For Humans

PromQL is a built in query-language made for Prometheus. Here at Timber we've


found Prometheus to be awesome, but PromQL difficult to wrap our heads around.
This is our attempt to change that.

PromQL Cheatsheet

Basics
Instant Vectors

Only Instant Vectors can be graphed.

http_requests_total
This gives us all the http requests, but we've got 2 issues.

1. There are too many data points to decipher what's going on.
2. You'll notice that http_requests_total only goes up, because it's a counter.
These are common in Prometheus, but not useful to graph.

I'll show you how to approach both.

It's Easy To Filter By Label.

http_requests_total{job="prometheus", code="200"}
You Can Check A Substring Using Regex Matching.

http_requests_total{status_code=~"2.*"}

If you're interested in learning more, here are the docs on Regex.

Range Vectors

Contain data going back in time.


Recall: Only Instant Vectors can be graphed. You'll soon be able to see how to
visualize Range Vectors using functions.

http_requests_total[5m]

You can also use (s, m, h, d, w, y) to represent (seconds, minutes, hours, ...)
respectively.

Important Functions
For Range Vectors

You'll notice that we're able to graph all these functions. Since only Instant Vectors
can be graphed, they take a Range Vector as a parameter and return a Instant
Vector.

Increase Of Http_requests_total Averaged Over The Last 5 Minutes.

rate(http_requests_total[5m])

Irate

Looks at the 2 most recent samples (up to 5 minutes in the past), rather than
averaging like rate
irate(http_requests_total[5m])

It's best to use rate when alerting, because it creates a smooth graph since the
data is averaged over a period of time. Spikey graphs can cause alert overload,
fatigue, and bad times for all due to repeatedly triggering thresholds.

HTTP Requests In The Last Hour.

This is equal to the rate * # of seconds

increase(http_requests_total[1h])
These are a small fraction of the functions, just what we found most popular. You
can find the rest here.

For Instant Vectors


Broken By Status Code

sum(rate(http_requests_total[5m]))

You'll notice that rate(http_requests_total[5m]) above provides a large


amount of data. You can filter that data using your labels, but you can also look at
your system as a whole using sum (or do both).

You can also use min , max , avg , count , and quantile similarly.
This query tells you how many total HTTP requests there are, but isn't directly useful
in deciphering issues in your system. I'll show you some functions that allow you to
gain insight into your system.

Sum By Status Code

sum by (status_code) (rate(http_requests_total[5m]))

You can also use without rather than by to sum on everything not passed as a
parameter to without.
Now, you can see the difference between each status code.

Offset

You can use offset to change the time for Instant and Range Vectors. This can
be helpful for comparing current usage to past usage when determining the
conditions of an alert.

sum(rate(http_requests_total[5m] offset 5m))

Remember to put offset directly after the selector.

Operators
Operators can be used between scalars, vectors, or a mix of the two. Operations
between vectors expect to find matching elements for each side (also known as
one-to-one matching), unless otherwise specified.

There are Arithmetic (+, -, *, /, %, ^), Comparison (==, !=, >, <, >=, <=) and Logical
(and, or, unless) operators.

Vector Matching

One-to-One

Vectors are equal i.f.f. the labels are equal.

API 5xxs Are 10% Of HTTP Requests

rate(http_requests_total{status_code=~"5.*"}[5m]) > .1 * rate(http


_requests_total[5m])
We're looking to graph whenever more than 10% of an instance's HTTP requests
are errors. Before comparing rates, PromQL first checks to make sure that the
vector's labels are equal.

You can use on to compare using certain labels or ignoring to compare on all
labels except.

Many-to-One

It's possible to use comparison and arithmetic operations where an element on one
side can be matched with many elements on the other side. You must explicitly tell
Prometheus what to do with the extra dimensions.

You can use group_left if the left side has a higher cardinality, else use group
_right .

Examples
Disclaimer: We've hidden some of the information in the pictures using the Legend
Format for privacy reasons.

CPU Usage By Instance


100 * (1 - avg by(instance)(irate(node_cpu{mode='idle'}[5m])))

Average CPU Usage per instance for a 5 minute window.

Memory Usage

node_memory_Active / on (instance) node_memory_MemTotal

Percentage of memory being used by instance.


Disk Space

node_filesystem_avail{fstype!~"tmpfs|fuse.lxcfs|squashfs"} / node_
filesystem_size{fstype!~"tmpfs|fuse.lxcfs|squashfs"}

Percentage of disk space being used by instance. We're looking for the available
space, ignoring instances that have tmpfs , fuse.lxcfs , or squashfs in their
fstype and dividing that by their total size.

HTTP Error Rates As A % Of Traffic

rate(http_requests_total{status_code=~"5.*"}[5m]) / rate(http_requ
ests_total[5m])

Alerts Firing In The Last 24 Hours


sum(sort_desc(sum_over_time(ALERTS{alertstate="firing"}[24h]))) by
(alertname)

You can find more useful examples here.

3 Pillars Of Observability
It's important to understand where metrics fit in when it comes to observing your
application. I recommend you take a look at the 3 pillars of observability principle.
Metrics are an important part of your observability stack, but logs and tracing are
equally so.

We're a cloud-based logging company at Timber that seamlessly augments your


logs with context. We've got a great product built, and you can check it out for free!

You might also like