0% found this document useful (0 votes)
103 views28 pages

Sovos Grafana Overview Kickoff Intro

This document discusses Sovos' goals in using Grafana to create a centralized observability platform. It outlines their current state with multiple monitoring tools and lack of cohesion. Their future state vision is to use Grafana's unified and scalable approach to provide a "first pane of glass" dashboard that empowers teams with correlated metrics and reduces mean time to resolution. The document then demonstrates Grafana's approach to observability including its open source roots, pillars of observability for metrics, logs and traces, and how the Grafana agent can collect telemetry data.

Uploaded by

kakopah863
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views28 pages

Sovos Grafana Overview Kickoff Intro

This document discusses Sovos' goals in using Grafana to create a centralized observability platform. It outlines their current state with multiple monitoring tools and lack of cohesion. Their future state vision is to use Grafana's unified and scalable approach to provide a "first pane of glass" dashboard that empowers teams with correlated metrics and reduces mean time to resolution. The document then demonstrates Grafana's approach to observability including its open source roots, pillars of observability for metrics, logs and traces, and how the Grafana agent can collect telemetry data.

Uploaded by

kakopah863
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

● Shared Understanding

○ Sovos’ goals with Grafana


● Grafana Labs Overview
Agenda ● Approach to Observability
● Three Pillars
● Demo
Shared Understanding
Project Overview

Sovos is a software company that creates tax & compliance solutions, and have experienced significant growth organically and via acquisition.
The Cloud, SRE, and Development Engineering teams are focused on building and delivering great software platforms that drive high margins
for the business. Our mission is to provide a unified, centralized, and scalable observability platform to support all lines of business and
products.

Current State Future State Required Capabilities Success Metrics

Lots of acquisitions, multiple different Centralized ‘first pane of glass’ that Plug-in architecture that unifies data Improved user experience with
toolsets, leads to lack of cohesion and will enable a Global Product Strategy from legacy monitoring (AppD, Splunk) correlated observability strategy and
focus on metrics that matter by visualizing persona-based KPIs. and can report on APM, Log, platform stability.
Infrastructure, DB.

Limits with existing entitlements for Empower dev/technology teams with Custom data retention that is Improve FTE productivity and
tooling creates a barrier to adopting a data that tells the story of horizontally scalable and able to performance of troubleshooting system
performance-based mindset, performance → enable confident natively correlate metric, log, and with reduction in MTTR.
increasing MTTR incidents. growth at scale with new products. distributed tracing data.

Overspending drives up operational Create economies of scale → establish A unified, scalable, and cost-effective Reduced overhead costs of expensive
costs & reduces SaaS margins an observability model that is flexible logging solution that improves O11y solutions.
enough to align to Sovos specific approachability and adoption of tools.
needs.
Open and composable observability
Meeting Sovos on the journey to reliably deliver your mission critical systems.

Unify Scale Optimize


Observability Observability-as-a-service costs at scale

❏ Visualize, alert and correlate ❏ Standardized approach to ❏ Consolidate tools to reduce


data, no matter where the your global observability spend while eliminating
data lives stack vendor lock-in

❏ Choose your unique ❏ Onboard teams faster with ❏ Regulate telemetry growth
best-of-breed stack API-driven automation with intelligent data controls

❏ Eliminate alert fatigue, ❏ Build persona-based ❏ Leverage a transparent and


accelerate incident resolution workflows tailored to multiple predictable pricing model with
with SLO-driven observability dev teams no surprise overages
Open source is at the heart of what we do

Employ 91% of the Loki team Employ 89% of Grafana team Employ 100% of the Tempo Employ 100% of the Mimir
members, including project members, including project team members, including team members, including
founders founders project founders project founders

Employ 100% Employ 100% Employ 44% of The leading Employ Employ 100% Employ 100%
of Pyroscope of k6 team the contributors to contributors, of OnCall team of Faro team
team members, members, Prometheus the Graphite including a members, members,
including the including the team members project Governance including the including the
project project Committee project project
founders founders member founders founders

1,200+ Employees across


40+ countries 1M+ Instances across
Grafana Cloud and OSS 20M+ Users across OSS
and Cloud Free tier
Your open and composable observability stack

“Big Tent”
Data source integrations
API without data
consolidation

Performance Application Infrastructure Incident Response


Testing Observability Observability Management

Frontend Service Maps Kubernetes Server/VM SLO Alerting


Load Browser
testing testing
Database Cloud
Serverless eBPF providers OnCall Incident

Your
environment ML Insights | Security and governance | Configuration (as code) Native
OTel, Prometheus

No lock-in
Applications Grafana Agent Open standard

and Platform
Infrastructure OSS or commercial Pyroscope
Loki Grafana Tempo Mimir
Logs Visualizations Traces Metrics Profiles
The Pillars of Observability

WHAT
went wrong

WHERE WHY
it went wrong it went wrong
How to collect your telemetry data
The community way Or The Grafana way
Anatomy of the Grafana Agent

Metrics - Shares the same codebase as the


Prometheus Agent.

Logs - Embeds Promtail, the log forwarder


built by Grafana, for Loki.

Traces - Based on OpenTelemetry


Collector.
Metrics +

Grafana Metrics is the simple and scalable solution for unifying your Prometheus metrics across multiple
systems, enabling both real-time and historical analysis in Grafana cloud or self-hosted.

Unified Metrics Fast Global query Explore all your metrics


Bring together the raw, performance Centralize the analysis,
unsampled metrics for all your Query high-cardinality data with visualization, and alerting on all
applications and infrastructure, blazing fast PromQL and of your metrics
spread around the globe, in one Graphite queries
place
A bit of history

Prometheus and Grafana have become the de facto standard solutions for monitoring modern,
cloud-native applications and infrastructures.
Running Prometheus at scale

Prometheus is great but…

Limited horizontal No robust federation Not designed for long No security model
scalability term retention

Out of the box, Centralised view of Prometheus use local No authentication


Prometheus scales metrics can only be storage. Traditionally mechanism or role
only vertically. achieved using retention is set to 15 based access controls
hierarchical federation days and rarely above for protecting your
or cross-service 30 days. data.
federation.
Prometheus on steroids: Mimir
Durable storage

Blazing fast query performance

Production-proven dashboards,

= +
alerts, and playbooks

High availability

Mimir Prometheus
Horizontal scalability

Real multi-tenancy
Simplified architecture
Running Prometheus at scale

For Prometheus users


Application 1 ● Leverage your existing investment by
Rem
ote
Application 2 wri
te using Prometheus as a Metrics forwarder.
Application N Rem
ote w ● 100% compatible with your existing
ri te
queries, alerts and recording rules are .
Region A

ri te
ote w
Rem For all
te
Application 1 wri
ote ● Get started in a few clicks using the
Application 2
Rem
Query
Grafana agent (embeds the Prometheus
Application N
agent).
Region B ● Query your Mimir metrics using Grafana.
Logs
Grafana Logs brings together logs from all your applications and infrastructure in a single place. By using the
exact same service discovery as Prometheus, Loki can systematically guarantee your logs have consistent labels
with your metrics.

Cost effective Easy to learn & use Turn-key solution for


By doing minimal indexing and Loki’s query language, LogQL, is metrics and logs correlation
relying only on object storage, like PromQL. This means one less Loki uses the same service
users can store far more logs thing for developers to learn discovery and label-based
than other solutions for the same once they know PromQL. architecture as Prometheus
price. making it easy to jump between
metrics and logs- preserving
context and saving time.
Who did we make Loki for?

DevOps SRE DataEng

Effective Visualise and Build actionable


Debugging and alert on insights from log
troubleshooting services/apps data and other
of applications performance supported data
metrics sources
Why do they like Loki?

Efficient at scale Built for correlation

Logs as metrics Format agnostic


A NEW APPROACH TO LOGGING

Loki does not index the text of logs,


instead entries are grouped into
streams and indexed with labels.

19
Under the hood

2019-12-11T10:01:02.123456789Z {app=”nginx”, env=”dev”} GET /about 1034 Debug “page not found”

Timestamp Labels/Selectors Log content


with nanosecond precision key-value pairs JSON, logfmt, custom, etc.

Indexed Unindexed
The better tradeoff
Grafana Loki VS Content indexing
Query processing Upfront processing

● Log any and all formats ● Decide on log formats

● Smaller indexes ● Larger indexes

● Cheaper to run ● More expensive to run

● Fast queries ● Faster queries

● Cut and slice your logs in dynamic ways ● Restricted to format chosen at ingestion time
Get the most out of your logs with LogQL

● Inspired from PromQL syntax for effortless


{app=”nginx”,instance=”1.1.1.1”}
correlations between Metrics and Logs.
Label matchers

!= "Googlebot/" | json
`
● Build Metrics from Logs and unlock new use
Line filters Parser
cases.
| request_time >= 100 and status == 200

Label filters ● Use your LogQL queries for creating


advanced alerting rules.
*Successful requests with a latency superior to 100ms
(Googlebot requests excluded)

22
Traces
Grafana Traces (powered by Tempo) provides an easy-to-use, highly scalable, and cost-effective distributed
tracing back-end. Without indexing the traces makes it possible to store orders of magnitude more trace data for
the same cost, and removing the need for sampling.

Scalable Simple Integrated


Uses object storage to provide Uses similar label format to Native integration with Grafana
affordable long term storage of Prometheus & Loki, making it makes it easy to visualize traces
traces easy to jump from a metric or log
line into a trace Accepts trace spans in multiple
formats including Jaeger,
OpenTelemetry, and Zipkin
How to get started with distributed tracing?

Instrument Collect Store Visualize

Instrument your Use tracing Store all the Use Grafana to


code using pipelines to traces for detect and
agents and collect, querying and investigate
libraries to transform and building more service issues.
generate spans enrich spans. insights. Correlate your
for your traces with
services. metrics and log
data.
Instrumentation - What are your options?
● Large collection of libraries and agents are available
○ SDKs provided by OpenTelemery, Zipkin, Jaeger (deprecated), etc
○ Open Telemetry is becoming the de facto standard for distributed
tracing.

● Manual instrumentation for maximum flexibility


○ Intrusive process - requires code changes.
○ Give the developers the opportunity to control the amount of
generated data.
○ Available for all programming languages.

● Automatic instrumentation for ease of life


○ The easiest way to start with distributed tracing
○ Available via runtime or interpreter for some programming
languages or frameworks (e.g: .Net, Java, Ruby) or via eBPF for
compiled languages (e.g. Golang, C++)
○ Also supported by popular service meshes and proxies (e.g: Istio
or Traefik).
Collecting your traces - Set up your tracing pipeline
● Spans can be emitted to the backend directly from your code or
via a collector
○ Multiple protocols are available such as Jaeger, Zipkin or OTLP
(OpenTelemetry)

○ Head-based sampling only.

● Collectors are optional but highly recommended


○ Buffer spans for efficient data transmission.
○ Ability to sample the flow of the traces (e.g: tail-based sampling).
○ Can be used to transform, remove and add tags.
○ Can generate ad-hoc metrics from spans.

● OTEL collector
○ OpenTelemetry has become the de facto option for tracing.
○ The Grafana Agent embeds OpenTelemetry collector.
How we do it
● Inexpensive to run
○ Leverage object-based storage.
○ No costly index store to maintain

● Designed for Grafana


○ Seamless navigation between observability pillars using
Exemplars.

● Powerful
○ Fully distributed architecture.
○ Scale to support 100% of your traces.
○ Compatible with all standards (Otel, Zipkin, Jaeger, etc)
Demo

You might also like