Mastering Distributed Tracing
Mastering Distributed Tracing
Yuri Shkuro
BIRMINGHAM - MUMBAI
Mastering Distributed Tracing
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded
in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the author, nor Packt
Publishing or its dealers and distributors, will be held liable for any damages
caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
ISBN 978-1-78862-846-4
www.packtpub.com
I dedicate this book to my family and my loving partner Yelena.
- Yuri Shkuro
mapt.io
Mapt is an online digital library that gives you full access to over 5,000 books
and videos, as well as industry leading tools to help you plan your personal
development and advance your career. For more information, please visit
our website.
Why subscribe?
• Spend less time learning and more time coding with practical eBooks
and Videos from over 4,000 industry professionals
• Learn better with Skill Plans built especially for you
• Get a free eBook or video every month
• Mapt is fully searchable
• Copy and paste, print, and bookmark content
Packt.com
Did you know that Packt offers eBook versions of every book published, with
PDF and ePub files available? You can upgrade to the eBook version at www.Packt.
com and as a print book customer, you are entitled to a discount on the eBook copy.
Get in touch with us at [email protected] for more details.
At www.Packt.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters, and receive exclusive discounts and offers
on Packt books and eBooks.
Contributors
Yuri's open source credentials include being a co-founder of the OpenTracing project,
and the creator and the tech lead of Jaeger, a distributed tracing platform developed
at Uber. Both projects are incubating at the Cloud Native Computing Foundation.
Yuri serves as an invited expert on the W3C Distributed Tracing working group.
Dr. Yuri Shkuro holds a Ph.D. in Computer Science from University of Maryland,
College Park, and a Master's degree in Computer Engineering from MEPhI
(Moscow Engineering & Physics Institute), one of Russia's top three universities.
He is the author of many academic papers in the area of machine learning and
neural networks; his papers have been cited in over 130 other publications.
Outside of his academic and professional career, Yuri helped edit and produce
several animated shorts directed by Lev Polyakov, including Only Love (2008), which
screened at over 30 film festivals and won several awards, Piper the Goat and the Peace
Pipe (2005), a winner at the Ottawa International Animation Festival, and others.
I'd like to say thank you to many people who made this book possible:
my producer Andrew, who reached out and convinced me to pursue this
book; my editors Tom and Joanne, who reviewed and edited my drafts;
my technical reviewer Pavol, who provided many great suggestions
on improving the book; Ben Sigelman, who helped me to structure the
content and from whom I learned a lot about tracing in general; Lev
Polyakov, the author of the Jaeger project's adorable logo, who made
brilliant illustrations for this book; and most of all, my family and my
partner Yelena, who supported me and put up with me working on the
book on weekends over many months.
About the reviewer
Pavol Loffay is a software engineer at Red Hat working on Observability tools for
microservice architectures. He is an active maintainer of the Jaeger and OpenTracing
projects. He is also a member of the OpenTracing Specification Council (OTSC) and
a lead for the MicroProfile OpenTracing specification. In his free time, Pavol likes to
travel and he is a passionate skier and rock climber.
Lev Polyakov has been active in the animation world since 2004, starting as
an intern for Signe Baumane, one of New York's most prominent independent
animators, and proceeding to write and direct his own animated films. His
first short, Piper the Goat and the Peace Pipe, won the first place at the 2005 Ottawa
Animation Festival. For his next film, Morning, Day, Evening, Night… and Morning
Again, Lev was awarded a grant and an honorary membership from the National
Board of Review of Motion Pictures. During his junior year at School of Visual Arts,
Lev directed and produced Only Love, a 15-minute animated short that premiered
at the prestigious Woodstock Film Festival, and has been shown at more than 30 film
festivals around the world, winning several first place awards.
Lev is currently the Chair of the Art and Technology Committee at the National
Arts Club in New York City.
Packt is searching for authors like you
If you're interested in becoming an author for Packt, please visit authors.packtpub.
com and apply today. We have worked with thousands of developers and tech
professionals, just like you, to help them share their insight with the global tech
community. You can make a general application, apply for a specific hot topic
that we are recruiting an author for, or submit your own idea.
Table of Contents
Prefaceix
Part I: Introduction 1
Chapter 1: Why Distributed Tracing? 3
Microservices and cloud-native applications 4
What is observability? 7
The observability challenge of microservices 8
Traditional monitoring tools 11
Metrics11
Logs12
Distributed tracing 13
My experience with tracing 15
Why this book? 18
Summary20
References20
Chapter 2: Take Tracing for a HotROD Ride 23
Prerequisites24
Running from prepackaged binaries 24
Running from Docker images 25
Running from the source code 25
Go language development environment 26
Jaeger source code 26
Start Jaeger 27
Meet the HotROD 29
The architecture 32
The data flow 33
[i]
Table of Contents
Contextualized logs 35
Span tags versus logs 39
Identifying sources of latency 40
Resource usage attribution 55
Summary59
References60
Chapter 3: Distributed Tracing Fundamentals 61
The idea 62
Request correlation 63
Black-box inference 63
Schema-based63
Metadata propagation 64
Anatomy of distributed tracing 66
Sampling67
Preserving causality 67
Inter-request causality 69
Trace models 70
Event model 70
Span model 72
Clock skew adjustment 74
Trace analysis 76
Summary77
References77
[ ii ]
Table of Contents
[ iii ]
Table of Contents
[ iv ]
Table of Contents
[v]
Table of Contents
[ viii ]
Table of Contents
[ ix ]
Preface
Distributed tracing, also known as end-to-end tracing, while not a new idea,
has recently began receiving a lot of attention as a must-have observability tool for
complex distributed systems. Unlike most other tools that only monitor individual
components of the architecture, like a process or a server, tracing plays a rather
unique role by being able to observe end-to-end execution of individual requests,
or transactions, following them across process and network boundaries. With
the rise of such architectural patterns as microservices and functions-as-a-service
(or FaaS, or serverless), distributed tracing is becoming the only practical way
of managing the complexity of modern architectures.
The book you are about to read is based on my personal experiences of being
a technical lead for the tracing team at Uber Technologies. During that time, I have seen
the engineering organization grow from a few hundred to several thousand engineers,
and the complexity of Uber's microservices-based architecture increasing from a
few hundred microservices when we first rolled out Jaeger, our distributed tracing
platform, to several thousand microservices we have today. As most practitioners of
distributed tracing would tell you, building a tracing system is "the easy part"; getting
it widely adopted in a large organization is a completely different challenge altogether,
one that unfortunately does not have easy-to-follow recipes. This book is my attempt
to provide an end-to-end overview of the problem space, including the history and
theoretical underpinning of the technology, the ways to address instrumentation
and organizational adoption challenges, the standards emerging in the industry
for instrumentation and data formats, and practical suggestions for deploying
and operating a tracing infrastructure in real world scenarios.
The book is not intended as a reference material or a tutorial for any particular
technology. Instead, I want you to gain an understanding of the underlying principles
and trade-offs of distributed tracing and its applications. Equipped with these
fundamentals, you should be able to navigate this fairly complex area of technology
and find effective ways to apply it to your own use cases and your systems.
[ xi ]
Preface
• Application developers, SREs, and DevOps, who are the end users of distributed
tracing. This group is generally less interested in how tracing infrastructure
and instrumentation work; they are more interested in what the technology
can do for their day-to-day work. The book provides many examples of the
benefits of distributed tracing, from the simplest use cases of "let's look at
one trace and see what performance problems it can help us discover" to
advanced data mining scenarios of "how do we process that vast amounts
of tracing data we are collecting and gain insights into the behaviors of our
distributed system that cannot be inferred from individual transactions."
• Framework and infrastructure developers, who are building libraries and
tools for other developers and want to make those tools observable through
integration with distributed tracing. This group would benefit from the
thorough review of the instrumentation techniques and patterns, and the
discussion of the emerging standards for tracing.
• Engineering managers and executives, who have the "power of the purse"
and need to understand and be convinced of the value that tracing provides
to an organization.
• Finally, the tracing teams, that is, engineers tasked with building, deploying,
and operating tracing infrastructure in an organization. This group must deal
with many challenges, both technical and organizational, if it wants to scale
its technology and its own efforts to amplify the impact of tracing on the
organization at large.
Chapter 1, Why Distributed Tracing, frames the observability problem that distributed
tracing aims to solve and explains why other monitoring tools fall short when it
comes to troubleshooting pathological behavior in complex distributed systems.
The chapter includes a brief history of my personal experience with tracing and
an explanation of why I felt that writing this book would be a useful contribution
to the industry.
[ xii ]
Preface
Chapter 2, Take Tracing for a HotROD Ride, dives in with an easy to run, hands-on
example used to illustrate the core features, benefits, and capabilities of distributed
tracing, using Jaeger, an open source tracing platform, the OpenTracing
instrumentation, and a demo application HotROD (Rides on Demand).
Part II, Data Gathering Problem, is dedicated to discussions about the different
ways of getting tracing data out of the applications, through manual and automatic
(agent-based) instrumentation, for both RPC-style and asynchronous (for example,
using message queues) applications.
Chapter 7, Tracing with Service Mesh, uses the service mesh Istio, running on
Kubernetes, to trace an application and compare the results with tracing an
application that is natively instrumented for tracing via the OpenTracing API.
It reviews the pros and cons of each approach.
[ xiii ]
Preface
Chapter 8, All About Sampling, explains why tracing platforms are often required
to sample transactions and provides an in-depth review of different sampling
techniques, from consistent head-based sampling strategies (probabilistic, rate
limiting, adaptive, and so on) to the emerging favorite, tail-based sampling.
Part III, Getting Value from Tracing, talks about the different ways engineers
and organization can benefit from adopting a distributed tracing solution.
Chapter 9, Turning the Lights On, gives examples of the core value proposition
of tracing, covering features that are commonly available in most tracing solutions;
such as service graphs; critical path analysis; performance analysis with trace
patterns; latency histograms and exemplars; and the long-term profiling techniques.
Chapter 10, Distributed Context Propagation, steps back to discuss context propagation,
a technology that underpins most existing tracing infrastructures. It covers Tracing
Plane from Brown University, which implements a general-purpose, tool-agnostic
framework for context propagation, or "baggage," and covers a number of useful
techniques and tools for observability and chaos engineering that have been built
on top of context propagation and tracing.
Chapter 11, Integration with Metrics and Logs, shows how all is not lost for traditional
monitoring tools, and how combining them with tracing infrastructure gives them
new capabilities and makes them more useful in microservices environments.
Chapter 12, Gathering Insights with Data Mining, begins with the basics of data mining
and feature extraction from tracing data, followed by a practical example involving
the Jaeger backend, Apache Kafka, Elasticsearch, Kibana, an Apache Flink data
mining job, and a microservices simulator, microsim. It ends with a discussion of
further evolution of data mining techniques, such as inferring and observing trends,
and historical and ad hoc data analysis.
Part IV, Deploying and Operating Tracing Infrastructure, completes the book with
an assortment of practical advice to the tracing teams about implementing and
operating tracing platforms in large organizations.
[ xiv ]
Preface
Chapter 14, Under the Hood of a Distributed Tracing System, starts with a brief
discussion of build versus buy considerations, then goes deep into many technical
details of the architecture and deployment modes of a tracing platform, such
as multi-tenancy, security, operation in multiple data centers, monitoring, and
resiliency. The Jaeger project is used to illustrate many architectural decisions,
yet overall the content is applicable to most tracing infrastructures.
The included exercises make heavy use of Docker and docker-compose to bring
up various third-party dependencies, such as MySQL and Elasticsearch databases,
Kafka and Zookeeper, and various observability tools like Jaeger, Kibana, Grafana,
and Prometheus. A working installation of Docker is required to run most of the
examples.
I strongly advise you to not only try running and playing with the provided
examples, but also to try adopting them to your own applications and use cases.
I have seen time and again how engineers find silly mistakes and inefficiencies
simply by looking at a sample trace from their application. If is often surprising
how much more visibility into the system behavior is provided by tracing.
If this is your first time dealing with this technology, then instrumenting your
own application, instead of running the provided abstract examples, is the most
effective way to learn and appreciate tracing.
[ xv ]
Preface
Once the file is downloaded, please make sure that you unzip or extract the folder
using the latest version of:
We also have other code bundles from our rich catalog of books and videos available
at https://github.com/PacktPublishing/. Check them out!
Conventions used
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names,
filenames, file extensions, pathnames, dummy URLs, user input, and Twitter
handles. For example; "Mount the downloaded WebStorm-10*.dmg disk image file as
another disk in your system."
[ xvi ]
Preface
flags byte
baggage map[string]string
debugID string
}
Any command-line input or output is written as follows:
$ go run ./exercise1/hello.go
Listening on http://localhost:8080/
Bold: Indicates a new term, an important word, or words that you see on the screen,
for example, in menus or dialog boxes, also appear in the text like this. For example:
"Select System info from the Administration panel."
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention
the book title in the subject of your message and email us at customercare@
packtpub.com.
Errata: Although we have taken every care to ensure the accuracy of our content,
mistakes do happen. If you have found a mistake in this book we would be grateful
if you would report this to us. Please visit, http://www.packt.com/submit-errata,
selecting your book, clicking on the Errata Submission Form link, and entering the
details.
Piracy: If you come across any illegal copies of our works in any form on the
Internet, we would be grateful if you would provide us with the location address or
website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have
expertise in and you are interested in either writing or contributing to a book,
please visit http://authors.packtpub.com.
[ xvii ]
Preface
Reviews
Please leave a review. Once you have read and used this book, why not leave a
review on the site that you purchased it from? Potential readers can then see and use
your unbiased opinion to make purchase decisions, we at Packt can understand what
you think about our products, and our authors can see your feedback on their book.
Thank you!
[ xviii ]
I
Introduction
Why Distributed Tracing?
In this chapter, I will talk about the challenges of monitoring and troubleshooting
distributed systems, including those built with microservices, and discuss how
and why distributed tracing is in a unique position among the observability tools
to address this problem. I will also describe my personal history with distributed
tracing and why I decided to write this book.
[3]
Why Distributed Tracing?
[4]
Chapter 1
• Design for failure: The services are always expected to tolerate failures of
their dependencies and either retry the requests or gracefully degrade their
own functionality.
• Evolutionary design: Individual components of a microservices architecture
are expected to evolve independently, without forcing upgrades on the
components that depend on them.
These techniques enable loosely coupled systems that are resilient, manageable,
and observable. Combined with robust automation, they allow engineers to make
high-impact changes frequently and predictably with minimal toil."
At the time of writing, the list of graduated and incubating projects at CNCF [3]
contained 20 projects (Figure 1.1). They all have a single common theme: providing
a platform for efficient deployment and operation of cloud-native applications. The
observability tools occupy an arguably disproportionate (20 percent) number of slots:
[5]
Why Distributed Tracing?
CNCF sandbox projects, the third category not shown in Figure 1.1, include two more
monitoring-related projects: OpenMetrics and Cortex. Why is observability in such
high demand for cloud-native applications?
[6]
Chapter 1
What is observability?
The term "observability" in control theory states that the system is observable if the
internal states of the system and, accordingly, its behavior, can be determined
by only looking at its inputs and outputs. At the 2018 Observability Practitioners
Summit [4], Bryan Cantrill, the CTO of Joyent and one of the creators of the tool
dtrace, argued that this definition is not practical to apply to software systems
because they are so complex that we can never know their complete internal
state, and therefore the control theory's binary measure of observability is always
zero (I highly recommend watching his talk on YouTube: https://youtu.be/
U4E0QxzswQc). Instead, a more useful definition of observability for a software
system is its "capability to allow a human to ask and answer questions". The
more questions we can ask and answer about the system, the more observable it is.
There are also many debates and Twitter zingers about the difference between
monitoring and observability. Traditionally, the term monitoring was used to
describe metrics collection and alerting. Sometimes it is used more generally
to include other tools, such as "using distributed tracing to monitor distributed
transactions." The definition by Oxford dictionaries of the verb "monitor" is "to
observe and check the progress or quality of (something) over a period of time;
keep under systematic review." However, it is better thought of as the process of
observing certain a priori defined performance indicators of our software system,
such as those measuring an impact on the end user experience, like latency or error
counts, and using their values to alert us when these signals indicate an abnormal
behavior of the system. Metrics, logs, and traces can all be used as a means to extract
those signals from the application. We can then reserve the term "observability"
for situations when we have a human operator proactively asking questions that
were not predefined. As Brian Cantrill put it in his talk, this process is debugging,
and we need to "use our brains when debugging." Monitoring does not require
a human operator; it can and should be fully automated.
[7]
Why Distributed Tracing?
"If you want to talk about (metrics, logs, and traces) as pillars of observability–great.
The human is the foundation of observability!"
-- Brian Cantrill
In the end, the so-called "three pillars of observability" (metrics, logs, and traces)
are just tools, or more precisely, different ways of extracting sensor data from the
applications. Even with metrics, the modern time series solutions like Prometheus,
InfluxDB, or Uber's M3 are capable of capturing the time series with many labels,
such as which host emitted a particular value of a counter. Not all labels may be
useful for monitoring, since a single misbehaving service instance in a cluster of
thousands does not warrant an alert that wakes up an engineer. But when we are
investigating an outage and trying to narrow down the scope of the problem, the
labels can be very useful as observability signals.
[8]
Chapter 1
The picture is already so complex that we don't even have space to include the names
of the services (in the real Jaeger UI you can see them by moving the mouse over
nodes). Every time a user takes an action on the mobile app, a request is executed by
the architecture that may require dozens of different services to participate in order
to produce a response. Let's call the path of this request a distributed transaction.
So, what are the challenges of this design? There are quite a few:
[9]
Why Distributed Tracing?
When we see that some requests to our system are failing or slow, we want
our observability tools to tell us the story about what happens to that request.
We want to be able to ask questions like these:
[ 10 ]
Chapter 1
Metrics
It goes like this: "Once upon a time…something bad happened. The end." How
do you like this story? This is what the chart in Figure 1.5 tells us. It's not completely
useless; we do see a spike and we could define an alert to fire when this happens.
But can we explain or troubleshoot the problem?
Figure 1.5: A graph of two time series representing (hypothetically) the volume of traffic to a service
Yet the same capacity for aggregation is what makes metrics ill-suited for explaining
the pathological behavior of the application. By aggregating data, we are throwing
away all the context we had about the individual transactions.
In Chapter 11, Integration with Metrics and Logs, we will talk about how integration
with tracing and context propagation can make metrics more useful by providing
them with the lost context. Out of the box, however, metrics are a poor tool to
troubleshoot problems within microservices-based applications.
[ 11 ]
Why Distributed Tracing?
Logs
Logging is an even more basic observability tool than metrics. Every programmer
learns their first programming language by writing a program that prints (that is,
logs) "Hello, World!" Similar to metrics, logs struggle with microservices because
each log stream only tells us about a single instance of a service. However, the
evolving programming paradigm creates other problems for logs as a debugging
tool. Ben Sigelman, who built Google's distributed tracing system Dapper [7],
explained it in his KubeCon 2016 keynote talk [8] as four types of concurrency
(Figure 1.6):
Years ago, applications like early versions of Apache HTTP Server handled
concurrency by forking child processes and having each process handle a single
request at a time. Logs collected from that single process could do a good job
of describing what happened inside the application.
[ 12 ]
Chapter 1
In order to reconstruct the flight of the request from the many log streams, we
need powerful logs aggregation technology and a distributed context propagation
capability to tag all those logs in different processes with a unique request id that
we can use to stitch those requests together. We might as well be using the real
distributed tracing infrastructure at this point! Yet even after tagging the logs
with a unique request id, we still cannot assemble them into an accurate sequence,
because the timestamps from different servers are generally not comparable due to
clock skews. In Chapter 11, Integration with Metrics and Logs, we will see how tracing
infrastructure can be used to provide the missing context to the logs.
Distributed tracing
As soon as we start building a distributed system, traditional monitoring tools begin
struggling with providing observability for the whole system, because they were
designed to observe a single component, such as a program, a server, or a network
switch. The story of a single component may no doubt be very interesting, but it tells
us very little about the story of a request that touches many of those components. We
need to know what happens to that request in all of them, end-to-end, if we want to
understand why a system is behaving pathologically. In other words, we first want
a macro view.
At the same time, once we get that macro view and zoom in to a particular
component that seems to be at fault for the failure or performance problems with
our request, we want a micro view of what exactly happened to that request in that
component. Most other tools cannot tell that to us either because they only observe
what "generally" happens in the component as a whole, for example, how many
requests per second it handles (metrics), what events occurred on a given thread
(logs), or which threads are on and off CPU at a given point in time (profilers).
They don't have the granularity or context to observe a specific request.
[ 13 ]
Why Distributed Tracing?
Figure 1.7: Jaeger UI view of a single request to the HotROD application, further discussed
in chapter 2. In the bottom half, one of the spans (named GetDriver from service redis,
with a warning icon) is expanded to show additional information, such as tags and span logs.
[ 14 ]
Chapter 1
Figure 1.8: Jaeger UI view of two traces A and B being compared structurally in the graph form
(best viewed in color). Light/dark green colors indicate services that were encountered more/only
in trace B, and light/dark red colors indicate services encountered more/only in trace A.
One of the observability challenges with the system was that each trade had to go
through a complicated sequence of additional changes, matching, and confirmation
flows, implemented by the different components of the system.
[ 15 ]
Why Distributed Tracing?
To give us visibility into the various state transitions of the individual trades, we
used an APM vendor (now defunct) that was essentially implementing a distributed
tracing platform. Unfortunately, our experience with that technology was not
particularly stellar, with the main challenge being the difficulty of instrumenting
our applications for tracing, which involved creating aspect-oriented programming
(AOP) - style instructions in the XML files and trying to match on the signature of the
internal APIs. The approach was very fragile, as changes to the internal APIs would
cause the instrumentation to become ineffective, without good facilities to enforce it
via unit testing. Getting instrumentation into existing applications is one of the main
difficulties in adopting distributing tracing, as we will discuss in this book.
When I joined Uber in mid-2015, the engineering team in New York had only
a handful of engineers, and many of them were working in the metrics system,
which later became known as M3. At the time, Uber was just starting its journey
towards breaking the existing monolith and replacing it with microservices. The
Python monolith, appropriately called "API", was already instrumented with another
home-grown tracing-like system called Merckx.
The major shortcoming with Merckx was its design for the days of a monolithic
application. It lacked any concept of distributed context propagation. It recorded
SQL queries, Redis calls, and even calls to other services, but there was no way to
go more than one level deep. It also stored the existing in-process context in a global,
thread-local storage, and when many new Python microservices at Uber began
adopting an event-loop-based framework Tornado, the propagation mechanism
in Merckx was unable to represent the state of many concurrent requests running
on the same thread. By the time I joined Uber, Merckx was in maintenance mode,
with hardly anyone working on it, even though it had active users. Given the
new observability theme of the New York engineering team, I, along with another
engineer, Onwukike Ibe, took the mantle of building a fully-fledged distributed
tracing platform.
I had no experience with building such systems in the past, but after reading the
Dapper paper from Google, it seemed straightforward enough. Plus, there was
already an open source clone of Dapper, the Zipkin project, originally built by
Twitter. Unfortunately, Zipkin did not work for us out of the box.
In 2014, Uber started building its own RPC framework called TChannel. It did not
really become popular in the open source world, but when I was just getting started
with tracing, many services at Uber were already using that framework for inter-
process communications. The framework came with tracing instrumentation built-in,
even natively supported in the binary protocol format. So, we already had traces
being generated in production, only nothing was gathering and storing them.
[ 16 ]
Chapter 1
Having a working tracing backend, however, was only half of the battle.
Although TChannel was actively used by some of the newer services, many more
existing services were using plain JSON over HTTP, utilizing many different HTTP
frameworks in different programming languages. In some of the languages, for
example, Java, TChannel wasn't even available or mature enough. So, we needed
to solve the same problem that made our tracing experiment at Morgan Stanley
fizzle out: how to get tracing instrumentation into hundreds of existing services,
implemented with different technology stacks.
As luck would have it, I was attending one of the Zipkin Practitioners workshops
organized by Adrian Cole from Pivotal, the lead maintainer of the Zipkin project,
and that same exact problem was on everyone's mind. Ben Sigelman, who founded
his own observability company Lightstep earlier that year, was at the workshop
too, and he proposed to create a project for a standardized tracing API that could
be implemented by different tracing vendors independently, and could be used to
create completely vendor-neutral, open source, reusable tracing instrumentation
for many existing frameworks and drivers. We brainstormed the initial design of
the API, which later became the OpenTracing project [10] (more on that in Chapter 6,
Tracing Standards and Ecosystem). All examples in this book use the OpenTracing
APIs for instrumentation.
The evolution of the OpenTracing APIs, which is still ongoing, is a topic for another
story. Yet even the initial versions of OpenTracing gave us the peace of mind
that if we started adopting it on a large scale at Uber, we were not going to lock
ourselves into a single implementation. Having different vendors and open source
projects participating in the development of OpenTracing was very encouraging.
We implemented Jaeger-specific, fully OpenTracing-compatible tracing libraries
in several languages (Java, Go, Python, and Node.js), and started rolling them
out to Uber microservices. Last time I checked, we had close to 2,400 microservices
instrumented with Jaeger.
I have been working in the area of distributed tracing even since. The Jaeger project
has grown and matured. Eventually, we replaced the Zipkin UI with Jaeger's own,
more modern UI built with React, and in April 2017, we open sourced all of Jaeger,
from client libraries to the backend components.
[ 17 ]
Why Distributed Tracing?
In the fall of 2017, Jaeger was accepted as an incubating project to CNCF, following
in the footsteps of the OpenTracing project. Both projects are very active, with
hundreds of contributors, and are used by many organizations around the
world. The Chinese giant Alibaba even offers hosted Jaeger as part of its Alibaba
Cloud services [12]. I probably spend 30-50% of my time at work collaborating
with contributors to both projects, including code reviews for pull requests and
new feature designs.
In the early 2018, I realized that I had pretty good answers to these questions,
while most people who were just starting to look into tracing still didn't,
and no comprehensive guide has been published anywhere. Even the basic
instrumentation steps are often confusing to people if they do not understand
the underlying concepts, as evidenced by the many questions posted in the Jaeger
and OpenTracing chat rooms.
[ 18 ]
Chapter 1
When I gave the OpenTracing tutorial at the Velocity NYC conference in 2017,
I created a GitHub repository that contained step-by-step walkthroughs for
instrumentation, from a basic "Hello, World!" program to a small microservices-
based application. The tutorials were repeated in several programming languages
(I originally created ones for Java, Go, and Python, and later other people created
more, for Node.js and C#). I have seen time and again how these most simple
tutorials help people to learn the ropes:
So, I was thinking, maybe I should write a book that would cover not just the
instrumentation tutorials, but give a comprehensive overview of the field, from its
history and fundamentals to practical advice about where to start and how to get the
most benefits from tracing. To my surprise, Andrew Waldron from Packt Publishing
reached out to me offering to do exactly that. The rest is history, or rather, this book.
One aspect that made me reluctant to start writing was the fact that the boom of
microservices and serverless created a big gap in the observability solutions that can
address the challenges posed by these architectural styles, and tracing is receiving
a lot of renewed interest, even though the basic idea of distributed tracing systems
is not new. Accordingly, there are a lot of changes happening in this area, and there
was a risk that anything I wrote would quickly become obsolete. It is possible that in
the future, OpenTracing might be replaced by some more advanced API. However,
the thought that made me push through was that this book is not about OpenTracing
or Jaeger. I use them as examples because they are the projects that are most familiar
to me. The ideas and concepts introduced throughout the book are not tied to these
projects. If you decide to instrument your applications with Zipkin's Brave library,
or with OpenCensus, or even with some vendor's proprietary API, the fundamentals
of instrumentation and distributed tracing mechanics are going to be the same, and
the advice I give in the later chapters about practical applications and the adoption
of tracing will still apply equally.
[ 19 ]
Why Distributed Tracing?
Summary
In this chapter, we took a high-level look at observability problems created by
the new popular architectural styles, microservices and FaaS, and discussed why
traditional monitoring tools are failing to fill this gap, whereas distributed tracing
provides a unique way of getting both a macro and micro view of the system
behavior when it executes individual requests.
I have also talked about my own experience and history with tracing, and why
I wrote this book as a comprehensive guide to many engineers coming to the field
of tracing.
In the next chapter, we are going to take a hands-on deep dive into tracing, by
running a tracing backend and a microservices-based demo application. It will
complement the claims made in this introduction with concrete examples of the
capabilities of end-to-end tracing.
References
1. Martin Fowler, James Lewis. Microservices: a definition of this new architectural
term: https://www.martinfowler.com/articles/microservices.html.
2. Cloud Native Computing Foundation (CNCF) Charter: https://github.
com/cncf/foundation/blob/master/charter.md.
3. CNCF projects: https://www.cncf.io/projects/.
4. Bryan Cantrill. Visualizing Distributed Systems with Statemaps. Observability
Practitioners Summit at KubeCon/CloudNativeCon NA 2018, December 10:
https://sched.co/HfG2.
5. Vijay Gill. The Only Good Reason to Adopt Microservices: https://lightstep.
com/blog/the-only-good-reason-to-adopt-microservices/.
6. Global Microservices Trends Report: https://go.lightstep.com/global-
microservices-trends-report-2018.
7. Benjamin H. Sigelman, Luiz A. Barroso, Michael Burrows, Pat Stephenson,
Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag.
Dapper, a large-scale distributed system tracing infrastructure. Technical Report
dapper-2010-1, Google, April 2010.
8. Ben Sigelman. Keynote: OpenTracing and Containers: Depth, Breadth, and
the Future of Tracing. KubeCon/CloudNativeCon North America, 2016,
Seattle: https://sched.co/8fRU.
9. Yuri Shkuro. Evolving Distributed Tracing at Uber Engineering. Uber Eng
Blog, February 2, 2017: https://eng.uber.com/distributed-tracing/.
[ 20 ]
Chapter 1
[ 21 ]
Take Tracing for
a HotROD Ride
[ 23 ]
Take Tracing for a HotROD Ride
Prerequisites
All relevant screenshots and code snippets are included in this chapter, but you
are strongly encouraged to try running the example and explore the features of
the web UIs, in order to better understand the capabilities of distributed tracing
solutions like Jaeger.
Both the Jaeger backend and the demo application can be run as downloadable
binaries for macOS, Linux, and Windows, as Docker containers, or directly from the
source code. Since Jaeger is an actively developed project, by the time you read this
book, some of the code organization or distributions may have changed. To ensure
you are following the same steps as described in this chapter, we are going to use
Jaeger version 1.6.0, released in July 2018.
[ 24 ]
Chapter 2
x jaeger-1.6.0-darwin-amd64/jaeger-standalone
x jaeger-1.6.0-darwin-amd64/jaeger-agent
x jaeger-1.6.0-darwin-amd64/jaeger-collector
This archive includes the production-grade binaries for the Jaeger backend, namely
jaeger-query, jaeger-agent, and jaeger-collector, which we will not use in
this chapter. We only need the all-in-one packaging of the Jaeger backend jaeger-
standalone, which combines all the backend components into one executable,
with no additional dependencies.
The Jaeger backend listens on half a dozen different ports, so if you run into port
conflicts, you may need to find out which other software listens on the same ports
and shut it down temporarily. The risk of port conflicts is greatly reduced if you
run Jaeger all-in-one as a Docker container.
You may first see some Docker output as it downloads the container images,
followed by the program's output:
{"gitCommit":"77a057313273700b8a1c768173a4c663ca351907","GitVersion":"v1.
6.0","BuildDate":"2018-07-10T16:23:52Z"}
[ 25 ]
Take Tracing for a HotROD Ride
Alternatively, you can download the source code bundle from the Jaeger release
page (https://github.com/jaegertracing/jaeger/releases/tag/v1.6.0/),
and make sure that the code is extracted into the $GOPATH/src/github.com/
jaegertracing/jaeger/ directory.
After installing glide, run it to download the libraries that Jaeger depends on:
$ cd $GOPATH/src/github.com/jaegertracing/jaeger/
$ glide install
Now you should be able to build and run the HotROD binary:
$ go run ./examples/hotrod/main.go help
HotR.O.D. - A tracing demo application.
Usage:
jaeger-demo [command]
Available Commands:
all Starts all services
[ 26 ]
Chapter 2
It is also possible to run the Jaeger backend from the source code. However,
it requires an additional setup of Node.js in order to compile the static assets for
the UI, which may not even work on an OS like Windows, so I do not recommend
it for this chapter's examples.
Start Jaeger
Before we run the demo application, let's make sure we can run the Jaeger backend
to collect the traces, as otherwise we might get a lot of error logs. A production
installation of the Jaeger backend would consist of many different components,
including some highly scalable databases like Cassandra or Elasticsearch. For
our experiments, we do not need that complexity or even the persistence layer.
Fortunately, the Jaeger distribution includes a special component called all-in-one
just for this purpose. It runs a single process that embeds all other components of
a normal Jaeger installation, including the web user interface. Instead of a persistent
storage, it keeps all traces in memory.
If you are using Docker, you can run Jaeger all-in-one with the following command:
$ docker run -d --name jaeger \
-p 6831:6831/udp \
-p 16686:16686 \
-p 14268:14268 \
jaegertracing/all-in-one:1.6
The -d flag makes the process run in the background, detached from the terminal.
The --name flag sets a name by which this process can be located by other Docker
containers. We also use the -p flag to expose three ports on the host network that
the Jaeger backend is listening to.
The first port, 6831/udp, is used to receive tracing data from applications
instrumented with Jaeger tracers, and the second port, 16686, is where we can
find the web UI. We also map the third port, 14268, in case we have issues with
UDP packet limits and need to use HTTP transport for sending traces (discussed
as follows).
[ 27 ]
Take Tracing for a HotROD Ride
The process listens to other ports as well, for example, to accept traces in other
formats, but they are not relevant for our exercise. Once the container starts,
open http://127.0.0.1:16686/ in the browser to access the UI.
If you chose to download the binaries instead of Docker images, you can run the
executable named jaeger-standalone, without any arguments, which will listen on
the same ports. jaeger-standalone is the binary used to build the jaegertracing/
all-in-one Docker image (in the later versions of Jaeger, it has been renamed
jaeger-all-in-one).
$ cd jaeger-1.6.0-darwin-amd64/
$ ./jaeger-standalone
[... skipped ...]
{"msg":"Starting agent"}
{"msg":"Starting jaeger-collector TChannel server","port":14267}
{"msg":"Starting jaeger-collector HTTP server","http-port":14268}
[... skipped ...]
{"msg":"Starting jaeger-query HTTP server","port":16686}
{"msg":"Health Check state change","status":"ready"}
[... skipped ...]
We removed some fields of the log statements (level, timestamp, and caller)
to improve readability.
[ 28 ]
Chapter 2
Since the all-in-one binary runs the Jaeger backend with an in-memory database,
which is initially empty, there is not much to see in the UI right away. However,
the Jaeger backend has self-tracing enabled, so if we reload the home page a few
times, we will see the Services dropdown in the top-left corner display jaeger-query,
which is the name of the microservice running the UI component. We can now hit
the Search button to find some traces, but let's first run the demo application to get
more interesting traces.
[ 29 ]
Take Tracing for a HotROD Ride
If we are running both the Jaeger all-in-one and the HotROD application from the
binaries, they bind their ports directly to the host network and are able to find each
other without any additional configuration, due to the default values of the flags.
Sometimes users experience issues with getting traces from the HotROD
application due to the default UDP settings in the OS. Jaeger client libraries batch
up to 65,000 bytes per UDP packet, which is still a safe number to send via the
loopback interface (that is, localhost) without packet fragmentation. However,
macOS, for example, has a much lower default for the maximum datagram size.
Rather than adjusting the OS settings, another alternative is to use the HTTP
protocol between Jaeger clients and the Jaeger backend. This can be done by
passing the following flag to the HotROD application:
--jaeger-agent.host-port=http://localhost:14268/api/traces
Once the HotROD process starts, the logs written to the standard output will show
the microservices starting several servers on different ports (for better readability,
we removed the timestamps and references to the source files):
INFO Starting all services
INFO Starting {"service": "route", "address":
"http://127.0.0.1:8083"}
INFO Starting {"service": "frontend", "address":
"http://127.0.0.1:8080"}
INFO Starting {"service": "customer", "address":
"http://127.0.0.1:8081"}
INFO TChannel listening {"service": "driver", "hostPort":
"127.0.0.1:8082"}
[ 30 ]
Chapter 2
We have four customers, and by clicking one of the four buttons, we summon
a car to arrive to the customer's location, perhaps to pick up a product and deliver
it elsewhere. Once a request for a car is sent to the backend, it responds with the
car's license plate number, T757183C, and the expected time of arrival of two
minutes:
1. In the top-left corner, there is a web client id: 6480. This is a random session
ID assigned by the JavaScript UI. If we reload the page, we get a different
session ID.
2. In the brackets after the car information, we see a request ID, req: 6480-1.
This is a unique ID assigned by the JavaScript UI to each request it makes
to the backend, composed of the session ID and a sequence number.
3. The last bit of debugging data, latency: 772ms, is measured by the
JavaScript UI and shows how long the backend took to respond.
[ 31 ]
Take Tracing for a HotROD Ride
The architecture
Now that we have seen what the HotROD application does, we may want to know how
it is architected. After all, maybe all those servers we saw in the logs are just for show,
and the whole application is simply a JavaScript frontend. Rather than asking someone
for a design document, wouldn't it be great if our monitoring tools could build the
architecture diagram automatically, by observing the interactions between the services?
That's exactly what distributed tracing systems like Jaeger can do. That request for
a car we executed earlier has provided Jaeger with enough data to connect the dots.
Let's go to the Dependencies page in the Jaeger UI. At first, we will see a tiny diagram
titled Force Directed Graph, but we can ignore it, as that particular view is really
designed for showing architectures that contain hundreds or even thousands of
microservices. Instead, click on the DAG tab (Directed Acyclic Graph), which
shows an easier-to-read graph. The graph layout is non-deterministic, so your view
may have the second-level nodes in a different order than in the following screenshot:
As it turns out, the single HotROD binary is actually running four microservices
and, apparently, two storage backends: Redis and MySQL. The storage nodes are
not actually real: they are simulated by the application as internal components, but
the top four microservices are indeed real. We saw each of them logging a network
address of the servers they run. The frontend microservice serves the JavaScript
UI and makes RPC calls to the other three microservices.
[ 32 ]
Chapter 2
The graph also shows the number of calls that were made to handle the single
request for a car, for example, the route service was called 10 times, and there were
14 calls to Redis.
Figure 2.5: Results of searching for all traces in the last hour from the service frontend
The system found two traces and displayed some metadata about them, such as the
names of different services that participated in the traces, and the number of spans
each service emitted to Jaeger. We will ignore the second trace that represents the
request to load the JavaScript UI and focus on the first trace, named frontend: HTTP
GET /dispatch. This name is a concatenation of the service name frontend and the
operation name of the top-level span, in this case HTTP GET /dispatch.
[ 33 ]
Take Tracing for a HotROD Ride
On the right side, we see that the total duration of the trace was 757.73ms. This
is shorter than the 772ms we saw in the HotROD UI, which is not surprising because
the latter was measured from the HTTP client side by JavaScript, while the former
was reported by the Go backend. The 14.27ms difference between these numbers
can be attributed to the network latency. Let's click on the trace title bar.
Figure 2.6: Trace timeline view. At the top is the name of the trace, which is combined from the
service name and the operation name of the root span. On the left is the hierarchy of calls between
microservices, as well as within microservices (internal operations can also be represented as spans).
The calls from the frontend service to the route service are collapsed to save space. Some of the calls
from the driver service to Redis have red circles with white exclamation points in them, indicating an
error in the operation. On the right side is the Gantt chart showing spans on the horizontal timeline.
The Gantt chart is interactive and clicking on a span can reveal additional information.
[ 34 ]
Chapter 2
The timeline view shows a typical view of a trace as a time sequence of nested
spans, where a span represents a unit of work within a single service. The top-level
span, also called the root span, represents the handling of the main HTTP request
from the JavaScript UI by the frontend service (server span), which in turn called
the customer service, which in turn called a MySQL database. The width of the
spans is proportional to the time a given operation took. This may represent
a service doing some work or waiting on a downstream call.
From this view, we can see how the application handles a request:
1. The frontend service receives the external HTTP GET request at its
/dispatch endpoint.
2. The frontend service makes an HTTP GET request to the /customer
endpoint of the customer service.
3. The customer service executes SELECT SQL statement in MySQL. The results
are returned back to the frontend service.
4. Then the frontend service makes an RPC request, Driver::findNearest,
to the driver service. Without drilling more into the trace details, we cannot
tell which RPC framework is used to make this request, but we can guess it is
not HTTP (it is actually made over TChannel [1]).
5. The driver service makes a series of calls to Redis. Some of those calls show
a red circle with an exclamation point, indicating failures.
6. After that, the frontend service executes a series of HTTP GET requests
to the /route endpoint of the route service.
7. Finally, the frontend service returns the result to the external caller
(for example, the UI).
We can tell all of this pretty much just by looking at the high-level Gantt chart
presented by the end-to-end tracing tool.
Contextualized logs
We now have a pretty good idea about what the HotROD application does, if
not exactly how it does it. For example, why does the frontend service call the
/customer endpoint of the customer service? Of course, we can look at the source
code, but we are trying to approach this from the point of view of application
monitoring. One direction we could take is to look at the logs the application
writes to its standard output (Figure 2.7).
[ 35 ]
Take Tracing for a HotROD Ride
It is quite difficult to follow the application logic from these logs and we are
only looking at the logs when a single request was executed by the application.
We are also lucky that the logs from four different microservices are combined in
a more-or-less consistent stream. Imagine many concurrent requests going through
the system and through microservices running in different processes! The logs would
become nearly useless in that case. So, let's take a different approach. Let's view the
logs collected by the tracing system. For example, click on the root span to expand
it and then click on the Logs (18) section to expand and see the logs (18 refers to the
number of log statements captured in this span). These logs give us more insight into
what the /dispatch endpoint was doing (Figure 2.8):
[ 36 ]
Chapter 2
Figure 2.8: Logs recorded by the tracing system in the root span.
The hostname is masked in all screenshots for privacy.
Let's close the root span and open another one; specifically, one of the failed calls to
Redis (Figure 2.9). The span has a tag error=true, which is why the UI highlighted
it as failed. The log statement explains the nature of the error as "Redis timeout." The
log also includes the driver_id that the driver service was attempting to retrieve
from Redis. All these details may provide very useful information during debugging.
[ 37 ]
Take Tracing for a HotROD Ride
Figure 2.9: Expanded span details after clicking on a failed GetDriver span, in redis service
which is marked with a white exclamation point in a red circle. The log entry explains that this
was a redis timeout and indicates which driver ID was queried from the database.
The distinct feature of a tracing system is that it only shows the logs that happened
during the execution of a given request. We call these logs contextualized because
they are captured not only in the context of a specific request, but also in the context
of a specific span within the trace for that request.
In the traditional log output, these log statement would have been mixed with a lot
of other statements from parallel requests, but in the tracing system, they are neatly
isolated to the service and span where they are relevant. Contextualized logs allow
us to focus on the behavior of the application, without worrying about logs from
other parts of the program or from other concurrent requests.
As we can see, using a combination of a Gantt chart, span tags, and span logs, the
end-to-end tracing tool lets us easily understand the architecture and data flow of
the application, and enables us to zoom in on the details of individual operations.
[ 38 ]
Chapter 2
Figure 2.10: Two more spans expanded to show a variety of tags and logs.
Each span also contains a section called "Process" that also looks like a collection of tags.
The process tags describe the application that was producing the tracing record, rather than an individual span.
In the customer span, we can see a tag http.url that shows that the request at the
/customer endpoint had a parameter customer=123, as well as two logs narrating
the execution during that span. In the mysql span, we see an sql.query tag
showing the exact SQL query that was executed: SELECT * FROM customer
WHERE customer_id=123, and a log about acquiring some lock.
What is the difference between a span tag and a span log? They are both annotating
the span with some contextual information. Tags typically apply to the whole span,
while logs represent some events that happened during the span execution. A log
always has a timestamp that falls within the span's start-end time interval. The
tracing system does not explicitly track causality between logged events the way
it keeps track of causality relationships between spans, because it can be inferred
from the timestamps.
[ 39 ]
Take Tracing for a HotROD Ride
An acute reader will notice that the /customer span records the URL of the request
twice, in the http.url tag and in the first log. The latter is actually redundant but
was captured in the span because the code logged this information using the normal
logging facility, which we will discuss later in this chapter.
The OpenTracing Specification [3] defines semantic data conventions that prescribe
certain well-known tag names and log fields for common scenarios. Instrumentation
is encouraged to use those names to ensure that the data reported to the tracing
system is well defined and portable across different tracing backends.
[ 40 ]
Chapter 2
Figure 2.11: Recognizing the sources of latency. The call to mysql appears to be on the critical path and takes
almost 40% of the trace time, so clearly it is a good target for some optimization. The calls from the driver
service to Redis look like a staircase, hinting at a strictly sequential execution that perhaps can be done
in parallel to expedite the middle part of the trace.
[ 41 ]
Take Tracing for a HotROD Ride
Figure 2.12: Recognizing sources of latency (continued). This screenshot is taken after we used
a zoom-in feature in the mini-map to only look at the last 200ms of the trace (by dragging the mouse
horizontally across the area of interest). It is easy to see that the requests from the frontend service to
the route service are done in parallel, but no more than three requests at a time. Red arrows point out
how as soon as one request ends, another one starts. This pattern indicates some sort of contention,
most likely a worker pool that only has three workers.
[ 42 ]
Chapter 2
Figure 2.13: Executing many requests simultaneously shows increasing latency of the responses
[ 43 ]
Take Tracing for a HotROD Ride
As we can see, the more requests that are being processed concurrently, the
longer it takes for the backend to respond. Let's take a look at the trace of the
longest request. We could do it in two ways. We can simply search for all traces
and pick the one with the highest latency, represented by the longest cyan-colored
title bar (Figure 2.14):
Figure 2.14: Multiple traces returned in the search results, sorted by most recent first
[ 44 ]
Chapter 2
Another way is to search by tags or logs on the span. The root span emits a final
log, where it records the license plate number of the closest car as one of the log fields:
Figure 2.15: License plate number T796774C recorded in one of the log events as the field driver.
Each log entry can be individually expanded to show fields in a table, as opposed to a single row.
The Jaeger backend indexes all spans by both tags and log fields, and we
can find that trace by specifying driver=T796774C in the Tags search box:
Figure 2.16: Searching for a single trace by a field in a log entry: driver=T796774C
[ 45 ]
Take Tracing for a HotROD Ride
This trace took 1.43 seconds, about 90% longer than our first trace, which took
only 757ms (measured from the server side). Let's open it and investigate what is
different:
Figure 2.17: Higher-latency trace. The database query (mysql span) took 1s,
significantly longer than the 300ms or so that it took when only a single request
was processed by the application.
The most apparent difference is that the database query (the mysql span) takes
a lot longer than before: 1s instead of 323ms. Let's expand that span and try to
find out why:
[ 46 ]
Chapter 2
In the log entries of the span, we see that execution was blocked waiting for a lock for
more than 700ms. This is clearly a bottleneck in the application, but before we dive
into that, let's look at the first log record, evidently emitted before getting blocked on
the lock: Waiting for lock behind 3 transactions. Blockers=[6480-4 6480-
5 6480-6]. It tells us how many other requests were already queued for this lock,
and even gives us the identity of those requests. It is not too hard to imagine a lock
implementation that keeps track of how many goroutines are blocked, but where
would it get the identity of the requests?
If we expand the previous span for the customer service, we can see that the only
data passed to it via an HTTP request was the customer ID 392. In fact, if we inspect
every span in the trace, we will not find any remote call where the request ID, like
6480-5, was passed as a parameter.
[ 47 ]
Take Tracing for a HotROD Ride
Figure 2.19: The parent span of the database call expanded. It represents an HTTP call from the
frontend service to the customer service. The Tags section of the span is expanded to show tags
in a tabular format. The http.url tag shows that customer=392 was the only parameter passed by the
caller to the HTTP endpoint.
This magic appearance of blocking request IDs in the logs is due to a custom
instrumentation in HotROD that makes use of a distributed context propagation
mechanism, which is called baggage in the OpenTracing API.
[ 48 ]
Chapter 2
In our example, knowing the identities of the requests stuck in the queue ahead
of our slow request allows us to find traces for those requests, and analyze them
as well. In real production systems, this could lead to unexpected discoveries, such
as a long-running request spoiling a lot of other requests that are normally very fast.
Later in this chapter, we will see another example of using baggage.
Now that we know that the mysql call gets stuck on a lock, we can easily fix it.
As we mentioned earlier, the application does not actually use the MySQL database,
just a simulation of it, and the lock is meant to represent a single database connection
shared between multiple goroutines. We can find the code in the file examples/
hotrod/services/customer/database.go:
if !config.MySQLMutexDisabled {
// simulate misconfigured connection pool that only gives
// one connection at a time
d.lock.Lock(ctx)
defer d.lock.Unlock()
}
If the locking behavior is not disabled via configuration, we acquire the lock before
simulating the SQL query delay. The statement defer d.lock.Unlock() is used
to release the lock before we exit the surrounding function.
Notice how we pass the ctx parameter to the lock object. context.Context is
a standard way in Go to pass request-scoped data throughout an application.
The OpenTracing span is stored in the Context, which allows the lock to inspect it
and retrieve the JavaScript's request ID from the baggage. The code for this custom
implementation of a mutex can be found in the source file examples/hotrod/pkg/
tracing/mutex.go.
[ 49 ]
Take Tracing for a HotROD Ride
Fortunately, the HotROD applications expose command line flags to change these
configuration parameters. We can find the flags by running the HotROD binary
with help command:
$ ./example-hotrod help
HotR.O.D. - A tracing demo application.
[... skipped ...]
Flags:
-D, --fix-db-query-delay, duration Average lagency of MySQL DB
query (default 300ms)
-M, --fix-disable-db-conn-mutex Disables the mutex guarding
db connection
-W, --fix-route-worker-pool-size, int Default worker pool size
(default 3)
[... skipped ...]
The flags that control parameters for latency-affecting logic all start with the
--fix prefix. In this case, we want the flag --fix-disable-db-conn-mutex, or -M
as a short form, to disable the blocking behavior. We also want to reduce the default
300ms latency of simulated database queries, controlled by flag -D, to make it easier
to see the results of this optimization.
Let's restart the HotROD application using these flags, to pretend that we fixed the code
to use a connection pool with enough capacity that our concurrent requests do not have
to compete for connections (the logs are again trimmed for better readability):
$ ./example-hotrod -M -D 100ms all
INFO Using expvar as metrics backend
INFO fix: overriding MySQL query delay {"old": "300ms", "new":
"100ms"}
INFO fix: disabling db connection mutex
INFO Starting all services
[ 50 ]
Chapter 2
We can see in the logs that the changes are taking effect. To see how it works
out, reload the HotROD web page and repeat the experiment of issuing many
simultaneous requests by clicking one of the buttons many times in quick succession.
Figure 2.20: Partially improved request latency after "fixing" the database query bottleneck. No requests run
longer than a second, but some are still pretty slow; for example, request #5 is still 50% slower than request #1.
The latency still increases as we add more requests to the system, but it no longer
grows as dramatically as with the single database bottleneck from before. Let's
look at one of the longer traces again.
[ 51 ]
Take Tracing for a HotROD Ride
Figure 2.21: Trace of another pretty slow request, after removing the database query bottleneck
As expected, the mysql span stays at around 100ms, regardless of the load. The
driver span is not expanded, but it takes the same time as before. The interesting
change is in the route calls, which now take more than 50% of the total request time.
Previously, we saw these requests executing in parallel three at a time, but now we
often see only one at a time, and even a gap right after the frontend to driver call
when no requests to route service are running. Clearly, we have a contention with
other goroutines on some limited resource and we can also see that the gaps happen
between the spans of the frontend service, which means the bottleneck is not in the
route service, but in how the frontend service calls it.
[ 52 ]
Chapter 2
This function receives a customer record (with address) and a list of drivers
(with their current locations), then calculates the expected time of arrival (ETA) for
each driver. It calls the route service for each driver inside an anonymous function
executed via a pool of goroutines, by passing the function to eta.pool.Execute().
Since all functions are executed asynchronously, we track their completion with
the wait group, wg, which implements a countdown latch: for every new function,
we increment its count with we.Add(1), and then we block on wg.Wait() until
each of the spawned functions calls wg.Done().
RouteWorkerPoolSize = 3
The default value of three explains why we saw, at most, three parallel executions
in the very first trace we inspected. Let's change it to 100 (goroutines in Go are cheap)
using the -W command line flag, and restart HotROD:
$ ./example-hotrod -M -D 100ms -W 100 all
INFO Using expvar as metrics backend
[ 53 ]
Take Tracing for a HotROD Ride
One more time, reload the HotROD web UI and repeat the experiment. We have to
click on the buttons really quickly now because the requests return back in less than
half a second.
Figure 2.22: Latency results after fixing the worker pool bottleneck. All requests return in less than half a second.
If we look at one of these new traces, we will see that, as expected, the calls from
frontend to the route service are all done in parallel now, thus minimizing the
overall request latency. We leave the final optimization of the driver service as an
exercise for you.
[ 54 ]
Chapter 2
Figure 2.23: Trace after fixing the worker pool bottleneck. The frontend is able to fire 10 requests to the route
service almost simultaneously.
[ 55 ]
Take Tracing for a HotROD Ride
Let's assume the company projects that one line of business will grow 10% next year,
while the other will grow 100%. Let's also assume, for simplicity, that the hardware
needs are proportional to the size of each business. We are still not able to predict
how much extra capacity the company will need because we do not know how
those current 1,000 CPU cores are attributed to each business line.
If the first business line is actually responsible for consuming 90% of hardware, then
its hardware needs will increase from 900 to 990 cores, and the second business line's
needs will increase from 100 to 200 CPU cores, to the total of an extra 190 cores across
both business lines. On the other hand, if the current needs of the business lines are
split 50/50, then the total capacity requirement for next year will be 500 * 1.1 + 500 *
2.0=1550 cores.
The main difficulty in resource usage attribution stems from the fact that most
technology companies use shared resources for running their business. Consider
such products as Gmail and Google Docs. Somewhere, at the top level of the
architecture, they may have dedicated pools of resources, for example, load
balancers and web servers, but the lower we go down the architecture, the
more shared resources we usually find.
At some point, the dedicated resource pools, like web servers, start accessing shared
resources like Bigtable, Chubby, Google File System, and so on. It is often inefficient
to partition those lower layers of architecture into distinct subsets in order to support
multi-tenancy. If we require all requests to explicitly carry the tenant information
as a parameter, for example, tenant="gmail" or tenant="docs", then we can
accurately report resource usage by the business line. However, such a model is
very rigid and hard to extend if we want to break down the attribution by a different
dimension, as we need to change the APIs of every single infrastructure layer to
pass that extra dimension. We will now discuss an alternative solution that relies
on metadata propagation.
We have seen in the HotROD demo that the calculation of the shortest route
performed by the route service is a relatively expensive operation (probably
CPU intensive). It would be nice if we could calculate how much CPU time
we spend per customer. However, the route service is an example of a shared
infrastructure resource that is further down the architectural layers from the point
where we know about the customer. It does not need to know about the customer
in order to calculate the shortest route between two points.
Passing the customer ID to the route service just to measure the CPU usage would
be poor API design. Instead, we can use the distributed metadata propagation
built into the tracing instrumentation. In the context of a trace, we know for which
customer the system is executing the request, and we can use metadata (baggage)
to transparently pass that information throughout the architecture layers, without
changing all the services to accept it explicitly.
[ 56 ]
Chapter 2
To demonstrate this approach, the route service contains the code to attribute
the CPU time of calculations to the customer and session IDs, which it reads
from the baggage. In the services/route/server.go file, we can see this code:
func computeRoute(
ctx context.Context,
pickup, dropoff string,
) *Route {
start := time.Now()
defer func() {
updateCalcStats(ctx, time.Since(start))
}()
// actual calculation ...
}
As with the instrumented mutex we saw earlier, we don't pass any customer/session
IDs because they can be retrieved from baggage via the context. The code actually
uses some static configuration to know which baggage items to extract and how
to report the metrics.
var routeCalcByCustomer = expvar.NewMap(
"route.calc.by.customer.sec",
)
var routeCalcBySession = expvar.NewMap(
"route.calc.by.session.sec",
)
var stats = []struct {
expvar *expvar.Map
baggage string
}{
{routeCalcByCustomer, "customer"},
{routeCalcBySession, "session"},
}
This code uses the expvar package ("exposed variables") from Go's standard library.
It provides a standard interface to global variables that can be used to accumulate
statistics about the application, such as operation counters, and it exposes these
variables in JSON format via an HTTP endpoint, /debug/vars.
[ 57 ]
Take Tracing for a HotROD Ride
The expvar variables can be standalone primitives, like float and string, or
they can be grouped into named maps for more dynamic statistics. In the preceding
code, we define two maps: one keyed by customer ID and another by session ID,
and combine them in the stats structure (an array of anonymous structs) with
the names of metadata attributes that contain the corresponding ID.
[ 58 ]
Chapter 2
This approach is very flexible. If necessary, this static definition of the stats array
can be easily moved to a configuration file to make the reporting mechanism even
more flexible. For example, if we wanted to aggregate data by another dimension,
say the type of web browser (not that it would make a lot of sense), we would need
to add one more entry to the configuration and make sure that the frontend service
captures the browser type as a baggage item.
The key point is that we do not need to change anything in the rest of the services.
In the HotROD demo, the frontend and route services are very close to each other, so
if we had to change the API it would not be a major undertaking. However, in real-life
situations, the service where we may want to calculate resource usage can be many
layers down the stack, and changing the APIs of all the intermediate services, just to
pass an extra resource usage aggregation dimension, is simply not feasible. By using
distributed context propagation, we vastly minimize the number of changes needed.
In Chapter 10, Distributed Context Propagation, we will discuss other uses of metadata
propagation.
In a production environment, using the expvar module is not the best approach,
since the data is stored individually in each service instance. However, our example
has no hard dependency on the expvar mechanism. We could have easily used
a real metrics API and had our resource usage statistics aggregated in a central
metrics system like Prometheus.
Summary
This chapter introduced a demo application, HotROD, that is instrumented for
distributed tracing, and by tracing that application with Jaeger, an open source
distributed tracing system, demonstrated the following features common to
most end-to-end tracing systems:
[ 59 ]
Take Tracing for a HotROD Ride
References
1. Yuri Shkuro, Evolving Distributed Tracing at Uber Engineering, Uber
Engineering blog, February 2017: https://eng.uber.com/distributed-
tracing/.
2. Natasha Woods, CNCF Hosts Jaeger, Cloud Native Computing Foundation
blog, September 2017: https://www.cncf.io/blog/2017/09/13/cncf-
hosts-jaeger/.
3. The OpenTracing Authors. Semantic Conventions, The OpenTracing
Specification: https://github.com/opentracing/specification/blob/
master/semantic_conventions.md.
[ 60 ]
Distributed Tracing
Fundamentals
[ 61 ]
Distributed Tracing Fundamentals
In the previous chapter, we saw a tracing system in action from the end user
perspective. In this chapter, we will discuss the basic underlying ideas of
distributed tracing, various approaches that have been presented in the industry,
academic works for implementing end-to-end tracing; the impact and trade-offs
of the architectural decisions taken by different tracing systems on their capabilities,
and the types of problems they can address.
The idea
Consider the following, vastly simplified architectural diagram of a hypothetical
e-commerce website. Each node in the diagram represents numerous instances
of the respective microservices, handling many concurrent requests. To help with
understanding the behavior of this distributed system and its performance or user-
visible latency, end-to-end tracing records information about all the work performed
by the system on behalf of a given client or request initiator. We will refer to this
work as execution or request throughout this book.
The data is collected by means of instrumentation trace points. For example, when
the client is making a request to the web server, the client's code can be instrumented
with two trace points: one for sending the request and another for receiving the
response. The collected data for a given execution is collectively referred to as trace.
One simple way to visualize a trace is via a Gantt chart, as shown on the right in in
Figure 3.1:
[ 62 ]
Chapter 3
Request correlation
The basic concept of distributed tracing appears to be very straightforward:
Of course, things are rarely as simple as they appear. There are multiple design
decisions taken by the existing tracing systems, affecting how these systems perform,
how difficult they are to integrate into existing distributed applications, and even
what kinds of problems they can or cannot help to solve.
The ability to collect and correlate profiling data for a given execution or request
initiator, and identify causally-related activities, is arguably the most distinctive
feature of distributed tracing, setting it apart from all other profiling and
observability tools. Different classes of solutions have been proposed in the industry
and academia to address the correlation problem. Here, we will discuss the three
most common approaches: black-box inference, domain-specific schemas, and
metadata propagation.
Black-box inference
Techniques that do not require modifying the monitored system are known as
black-box monitoring. Several tracing infrastructures have been proposed that use
statistical analysis or machine learning (for example, the Mystery Machine [2]) to
infer causality and request correlation by consuming only the records of the events
occurring in the programs, most often by reading their logs. These techniques are
attractive because they do not require modifications to the traced applications, but
they have difficulties attributing causality in the general case of highly concurrent
and asynchronous executions, such as those observed in event-driven systems.
Their reliance on "big data" processing also makes them more expensive and higher
latency compared to the other methods.
Schema-based
Magpie [3] proposed a technique that relied on manually-written, application-
specific event schemas that allowed it to extract causality relationships from the
event logs of production systems. Similar to the black-box approach, this technique
does not require the applications to be instrumented explicitly; however, it is less
general, as each application requires its own schemas.
[ 63 ]
Distributed Tracing Fundamentals
This approach is not particularly suitable for modern distributed systems that
consist of hundreds of microservices because it would be difficult to scale the
manual creation of event schemas. The schema-based technique requires all events
to be collected before the causality inference can be applied, so it is less scalable than
other methods that allow sampling.
Metadata propagation
What if the instrumentation trace points could annotate the data they produce
with a global identifier – let's call it an execution identifier – that is unique for each
traced request? Then the tracing infrastructure receiving the annotated profiling data
could easily reconstruct the full execution of the request, by grouping the records
by the execution identifier. So, how do the trace points know which request is being
executed when they are invoked, especially trace points in different components of
a distributed application? The global execution identifier needs to be passed along
the execution flow. This is achieved via a process known as metadata propagation
or distributed context propagation.
Figure 3.2: Propagating the execution identifier as request metadata. The first service in the
architecture (client) creates a unique execution identifier (Request ID) and passes it to the next
service via metadata/context. The remaining services keep passing it along in the same way.
[ 64 ]
Chapter 3
Figure 3.3: Metadata propagation in a single service. (1) The Handler that processes the inbound request
is wrapped into instrumentation that extracts metadata from the request and stores it in a Context object
in memory. (2) Some in-process propagation mechanism, for example, based on thread-local variables.
(3) Instrumentation wraps an RPC client and injects metadata into outbound (downstream) requests.
An acute reader may have noticed that the notion of propagating metadata alongside
request execution is not limited to only passing the execution identifier for tracing
purposes. Metadata propagation can be thought of as a prerequisite for distributed
tracing, or distributed tracing can be thought of as an application built on top of
distributed context propagation. In Chapter 10, Distributed Context Propagation we
will discuss a variety of other possible applications.
[ 65 ]
Distributed Tracing Fundamentals
Special trace points at the edges of the microservice, which we can call inject and
extract trace points, are also responsible for encoding and decoding metadata for
passing it across process boundaries. In certain cases, the inject/extract trace points
are used even between libraries and components, for example, when a Python code
is making a call to an extension written in C, which may not have direct access to the
metadata represented in a Python data structure.
The Tracing API is implemented by a concrete tracing library that reports the
collected data to the tracing backend, usually with some in-memory batching to
reduce the communications overhead. Reporting is always done asynchronously
in the background, off the critical path of the business requests. The tracing backend
receives the tracing data, normalizes it to a common trace model representation,
and puts it in a persistent trace storage. Because tracing data for a single request
usually arrives from many different hosts, the trace storage is often organized to
store individual pieces incrementally, indexed by the execution identifier. This
allows for later reconstruction of the whole trace for the purpose of visualization,
or additional processing through aggregations and data mining.
[ 66 ]
Chapter 3
Sampling
Sampling affects which records produced by the trace points are captured by the
tracing infrastructure. It is used to control the volume of data the tracing backend
needs to store, as well as the performance overhead and impact on the applications
from executing tracing instrumentation. We discuss sampling in detail in Chapter 8,
All About Sampling.
Preserving causality
If we only pass the execution identifier as request metadata and tag tracing records
with it, it is sufficient to reassemble that data into a single collection, but it is not
sufficient to reconstruct the execution graph of causally-related activities. Tracing
systems need to capture causality that allows assembling the data captured by the
trace points in the correct sequence. Unfortunately, knowing which activities are
truly causally-related is very difficult, even with very invasive instrumentation. Most
tracing systems elect to preserve Lamport's happens-before relation [4], denoted as
→ and formally defined as the least strict partial order on events, such that:
The happens-before relation can be too indiscriminate if applied liberally: may have
influenced is not the same as has influenced. The tracing infrastructures rely on the
additional domain knowledge about the systems being traced, and the execution
environment, to avoid capturing irrelevant causality. By threading the metadata along
the individual executions, they establish the relationships between items with the
same or related metadata (that is, metadata containing different trace point IDs by the
same execution ID). The metadata can be static or dynamic throughout the execution.
Tracing infrastructures that use static metadata, such as a single unique execution
identifier, throughout the life cycle of a request, must capture additional clues via
trace points, in order to establish the happens-before relationships between the
events. For example, if part of an execution is performed on a single thread, then
using the local timestamps allows correct ordering of the events. Alternatively, in
a client-server communication, the tracing system may infer that the sending of
a network message by the client happens before the server receiving that message.
Similar to black-box inference systems, this approach cannot always identify
causality between events when additional clues are lost or not available from the
instrumentation. It can, however, guarantee that all events for a given execution
will be correctly identified.
[ 67 ]
Distributed Tracing Fundamentals
In the following diagram, we see five trace points causally linked to a single
execution. The metadata propagated after each trace point is a three-part tuple
(execution ID, event ID, and parent ID). Each trace point stores the parent event
ID from inbound metadata as part of its captured trace record. The fork at trace
point b and join at trace point e illustrate how causal relationships forming
a directed acyclic graph can be captured using this scheme.
Using fixed-width dynamic metadata, the tracing infrastructure can explicitly record
happens-before relationships between trace events, which gives it an edge over the
static metadata approach. However, it is also somewhat brittle if some of the trace
records are lost because it will no longer be able to order the events in the order of
causality.
[ 68 ]
Chapter 3
When using end-to-end tracing on distributed systems, where profiling data loss
is a constant factor, some tracing infrastructures, for example, Azure Application
Insights, use variable-width dynamic metadata, which grows as the execution
travels further down the call graph from the request origin.
The following diagram illustrates this approach, where each next event ID is
generated by appending a sequence number to the previous event ID. When a fork
happens at event 1, two distinct sequence numbers are used to represent parallel
events 1.1 and 1.2. The benefit of this scheme is higher tolerance to data loss; for
example, if the record for event 1.2 is lost, it is still possible to infer the happens-
before relationship 1 → 1.2.1.
Inter-request causality
Sambasivan and others [10] argue that another critical architectural decision that
significantly affects the types of problems an end-to-end tracing infrastructure is
able to address is the question of how it attributes latent work. For example, a request
may write data to a memory buffer that is flushed to the disk at a later time, after the
originating request has been completed. Such buffers are commonly implemented
for performance reasons and, at the time of writing, the buffer may contain data
produced by many different requests. The question is: who is responsible for
the use of resources and the time spent by the system on writing the buffer out?
The work can be attributed to the last request that made the buffer full and caused
the write (trigger-preserving attribution), or it can be attributed proportionally to all
requests that produced the data into the buffer before the flush (submitter-preserving
attribution). Trigger-preserving attribution is easier to implement because it does not
require access to the instrumentation data about the earlier executions that affected
the latent work.
[ 69 ]
Distributed Tracing Fundamentals
Trace models
In Figure 3.4, we saw a component called "Collection/Normalization." The purpose
of this component is to receive tracing data from the trace points in the applications
and convert it to some normalized trace model, before saving it in the trace storage.
Aside from the usual architectural advantages of having a façade on top of the
trace storage, the normalization is especially important when we are faced with the
diversity of instrumentations. It is quite common for many production environments
to be using numerous versions of instrumentation libraries, from very recent ones
to some that are several years old. It is also common for those versions to capture
trace data in very different formats and models, both physical and conceptual. The
normalization layer acts as an equalizer and translates all those varieties into a single
logical trace model, which can later be uniformly processed by the trace visualization
and analysis tools. In this section, we will focus on two of the most popular
conceptual trace models: event model and span model.
Event model
So far, we have discussed tracing instrumentation taking the form of trace points
that record events when the request execution passes through them. An event
represents a single point in time in the end-to-end execution. Assuming that we
also record the happens-before relationships between these events, we intuitively
arrive to the model of a trace as a directed acyclic graph, with nodes representing
the events and edges representing the causality.
Some tracing systems (for example, X-Trace [5]) use such an event model as the
final form of the traces they surface to the user. The diagram in Figure 3.7 illustrates
an event graph observed from the execution of an RPC request/response by a client-
server application. It includes events collected at different layers of the stack, from
application-level events (for example, "client send" and "server receive") to events
in the TCP/IP stack.
[ 70 ]
Chapter 3
The graph contains multiple forks used to model request execution at different
layers, and multiple joins where these logical parallel executions converge to higher-
level layers. Many developers find the event model difficult to work with because it
is too low level and obscures useful higher-level primitives. For example, it is natural
for the developer of the client application to think of the RPC request as a single
operation that has start (client sent) and end (client receive) events. However, in the
event graph these two nodes are far apart.
Figure 3.7: Trace representation of an RPC request between client and server in the event model,
with trace events recorded at application and TCP/IP layers
The next diagram (Figure 3.8) shows an even more extreme example, where
a fairly simple workflow becomes hard to decipher when represented as an
event graph. A frontend Spring application running on Tomcat is calling another
application called remotesrv, which is running on JBoss. The remotesrv application
is making two calls to a PostgreSQL database.
It is easy to notice that aside from the "info" events shown in boxes with rounded
corners, all other records come in pairs of entry and exit events. The info events
are interesting in that they look almost like a noise: they most likely contain useful
information if we had to troubleshoot this particular workflow, but they do not add
much to our understanding of the shape of the workflow itself. We can think of them
as info logs, only captured via trace points. We also see an example of fork and join
because the info event from tomcat-jbossclient happens in parallel with the
execution happening in the remotesrv application.
[ 71 ]
Distributed Tracing Fundamentals
Figure 3.8: Event model-based graph of an RPC request between a Spring application running
on Tomcat and a remotesrv application running on JBoss, and talking to a PostgreSQL database.
The boxes with rounded corners represent simple point-in-time "info" events.
Span model
Having observed that, as in the preceding example, most execution graphs include
well-defined pairs of entry/exit events representing certain operations performed
by the application, Sigelman and others [6], proposed a simplified trace model,
which made the trace graphs much easier to understand. In Dapper [6], which
was designed for Google's RPC-heavy architecture, the traces are represented as
trees, where tree nodes are basic units of work referred to as spans. The edges in
the tree, as usual, indicate causal relationships between a span and its parent span.
Each span is a simple log of timestamped records, including its start and end time,
a human-readable operation name, and zero or more intermediary application-
specific annotations in the form of (timestamp, description) pairs, which are
equivalent to the info events in the previous example.
[ 72 ]
Chapter 3
Figure 3.9: Using the span model to represent the same RPC execution as in as in Figure 3.8.
Left: the resulting trace as a tree of spans. Right: the same trace shown as a Gantt chart. The info events
are no longer included as separate nodes in the graph; instead they are modeled as timestamped annotations
in the spans, shown as pills in the Gantt chart.
Each span is assigned a unique ID (for example, a random 64-bit value), which is
propagated via metadata along with the execution ID. When a new span is started,
it records the ID of the previous span as its parent ID, thus capturing the causality.
In the preceding example, the remote server represents its main operation in the
span with ID=6. When it makes a call to the database, it starts another span with
ID=7 and parent ID=6.
Dapper originally advocated for the model of multi-server spans, where a client
application that makes an RPC call creates a new span ID, and passes it as part of
the call, and the server that receives the RPC logs its events using the same span ID.
Unlike the preceding figure, the multi-server span model resulted in fewer spans
in the tree because each RPC call is represented by only one span, even though two
services are involved in doing the work as part of that RPC. This multi-server span
model was used by other tracing systems, such as Zipkin [7] (where spans were
often called shared spans). It was later discovered that this model unnecessarily
complicates the post-collection trace processing and analysis, so newer tracing
systems like Jaeger [8] opted for a single-host span model, in which an RPC call
is represented by two separate spans: one on the client and another on the server,
with the client span being the parent.
The tree-like span model is easy to understand for the programmers, whether they
are instrumenting their applications or retrieving the traces from the tracing system
for analysis. Because each span has only one parent, the causality is represented with
a simple call-stack type view of the computation that it is easy to implement and to
reason about.
[ 73 ]
Distributed Tracing Fundamentals
Effectively, traces in this model look like distributed stack traces, a concept
very intuitive to all developers. This makes the span model for traces the most
popular in the industry, supported by the majority of tracing infrastructures. Even
tracing systems that collect instrumentations in the form of single point-in-time
events (for example, Canopy [9]) go to the extra effort to convert trace events into
something very similar to the span model. Canopy authors claim that "events
are an inappropriate abstraction to expose to engineers adding instrumentation
to systems," and propose another representation they call modeled trace, which
describes the requests in terms of execution units, blocks, points, and edges.
The original span model introduced in Dapper was only able to represent
executions as trees. It struggled to represent other execution models, such as queues,
asynchronous executions, and multi-parent causality (forks and joins). Canopy
works around that by allowing instrumentation to record edges for non-obvious
causal relationships between points. The OpenTracing API, on the other hand,
sticks with the classic, simpler span model but allows spans to contain multiple
"references" to other spans, in order to support joins and asynchronous execution.
Clearly, we cannot trust the timestamps to be actually correct, but this is not what
we often look for when we analyze distributed traces. It is more important that
timestamps in the trace are correctly aligned relative to each other. When the
timestamps are from the same process, such as the start of the server span and the
extra info annotations in the following diagram, we can assume that their relative
positions are correct. The timestamps from different processes on the same host are
generally incomparable because even though they are not subject to the hardware
clock skew, the accuracy of the timestamps depends on many other factors, such as
what programming language is used for a given process and what time libraries it is
using and how. The timestamps from different servers are definitely incomparable
due to hardware clock drifts, but we can do something about that.
[ 74 ]
Chapter 3
Figure 3.10: Clock skew adjustment. When we know the causality relationships between the events,
such as "client-send must happen before server-receive", we can consistently adjust the timestamps for
one of the two services, to make sure that the causality constraints are satisfied. The annotations within
the span do not need to be adjusted, since we can assume their timestamps to be accurate relative to the
beginning and end timestamps of the span.
Consider the client and server spans at the top diagram in Figure 3.10. Let's
assume that we know from instrumentation that this was a blocking RPC request,
that is, the server could not have received the request before the client sent it,
and the client could not have received the response before the server finished the
execution (this reasoning only works if the client span is longer than the server
span, which is not always the case). These basic causality rules allow us to detect
if the server span is misaligned on the timeline based on its reported timestamps,
as we can see in the example. However, we don't know how much it is misaligned.
We can adjust the timestamps for all events originating from the server process
by shifting it to the left until its start and end events fall within the time range
of the larger client span, as shown at the bottom of the diagram. After this
adjustment, we end up with two variables, and , that are still unknown to
us. If there are no more occurrences of client and server interaction in the given
trace, and no additional causality information, we can make an arbitrary decision
on how to set the variables, for example, by positioning the server span exactly
in the middle of the client span:
[ 75 ]
Distributed Tracing Fundamentals
The values of and calculated this way provide us with an estimate of the time
spent by RPC in network communication. We are making an arbitrary assumption
that both request and response took roughly the same time to be transmitted over
the network. In other cases, we may have additional causality information from the
trace, for example the server may have called a database and then another node in
the trace graph called the same database server. That gives us two sets of constraints
on the possible clock skew adjustment of the database spans. For example, from
the first parent we want to adjust the database span by -2.5 ms and from the second
parent by -5.5 ms. Since it's the same database server, we only need one adjustment
to its clock skew, and we can try to find the one that works for both calling nodes
(maybe it's -3.5 ms), even though the child spans may not be exactly in the middle
of the parent spans, as we have arbitrarily done in the preceding formula.
In general, we can walk the trace and aggregate a large number of constraints
using this approach. Then we can solve them as a set of linear equations for a
full set of clock skew adjustments and we can apply to the trace to align the spans.
In the end, the clock skew adjustment process is always heuristic, since we typically
don't have other reliable signals to calculate it precisely. There are scenarios when
this heuristic technique goes wrong and the resulting trace views make little sense
to the users. Therefore, the tracing systems are advised to provide both adjusted and
unadjusted views of the traces, as well as to clearly indicate when the adjustments
are applied.
Trace analysis
Once the trace records are collected and normalized by the tracing infrastructure,
they can be used for analysis, using visualizations or data mining algorithms.
We will cover some of the data mining techniques in Chapter 12, Gathering Insights
with Data Mining.
Tracing system implementers are always looking for new creative visualizations
of the data, and end users often build their own views based on specific features
they are looking for. Some of the most popular and easy-to-implement views
include Gantt charts, service graphs, and request flow graphs.
We have seen examples of Gantt charts in this chapter. Gantt charts are mostly
used to visualize individual traces. The x axis shows relative time, usually from the
beginning of the request, and the y axis represents different layers and components
of the architecture participating in the execution of the request. Gantt charts are
good for analyzing the latency of the requests, as they easily show which spans in
the trace take the longest time, and combined with critical path analysis can zoom
in on problematic areas. The overall shape of the chart can reveal other performance
problems at a glance, like the lack of parallelism among sub-requests or unexpected
synchronization/blocking.
[ 76 ]
Chapter 3
Service graphs are constructed from a large corpus of traces. Fan-outs from a node
indicate calls to other components. This visualization can be used for analysis of
service dependencies in large microservices-based applications. The edges can be
decorated with additional information, such as the frequency of calls between two
given components in the corpus of traces.
Summary
This chapter introduced the fundamental principles underlying most open source,
commercial, and academic distributed tracing systems, and the anatomy of a
typical implementation. Metadata propagation is the most-popular and frequently-
implemented approach to correlating tracing records with a particular execution, and
capturing causal relationships. Event model and span model are the two completing
trace representations, trading expressiveness for ease of use.
References
1. Bryan M. Cantrill, Michael W. Shapiro, and Adam H. Leventhal. Dynamic
Instrumentation of Production Systems. Proceedings of the 2004 USENIX
Annual Technical Conference, June 27-July 2, 2004.
2. Michael Chow, David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch.
The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet
Services. Proceedings of the 11th USENIX Symposium on Operating Systems
Design and Implementation. October 6–8, 2014.
3. Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier.
Using Magpie for request extraction and workload modelling. OSDI '04:
Proceedings of the 6th USENIX Symposium on Operating Systems
Design and Implementation, 2004.
[ 77 ]
Distributed Tracing Fundamentals
4. Leslie Lamport. Time, clocks, and the ordering of events in a distributed system.
Communications of the ACM, 21 (7), July1978.
5. Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and
Ion Stoica. X-Trace: a pervasive network tracing framework. In NSDI '07:
Proceedings of the 4th USENIX Symposium on Networked Systems
Design and Implementation, 2007.
6. Benjamin H. Sigelman, Luiz A. Barroso, Michael Burrows, Pat Stephenson,
Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag.
Dapper, a large-scale distributed system tracing infrastructure. Technical Report
dapper-2010-1, Google, April 2010.
7. Chris Aniszczyk. Distributed Systems Tracing with Zipkin. Twitter Engineering
blog, June 2012: https://blog.twitter.com/engineering/en_us/a/2012/
distributed-systems-tracing-with-zipkin.html.
8. Yuri Shkuro. Evolving Distributed Tracing at Uber Engineering. Uber
Engineering blog, February 2017: https://eng.uber.com/distributed-
tracing/.
9. Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor
Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan,
Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and
Yee Jiun Song. Canopy: An End-to-End Performance Tracing and Analysis System.
Symposium on Operating Systems Principles, October 2017.
10. Raja R. Sambasivan, Rodrigo Fonseca, Ilari Shafer, and Gregory R. Ganger.
So, You Want To Trace Your Distributed System? Key Design Insights from Years
of Practical Experience. Carnegie Mellon University Parallel Data Lab
Technical Report CMU-PDL-14-102, April 2014.
11. The OpenTracing Project: http://opentracing.io/.
[ 78 ]
II
Data Gathering Problem
Instrumentation Basics
with OpenTracing
In the previous chapter, we looked into the theory behind end-to-end tracing, and
various architectural decisions one must make when building a distributed tracing
infrastructure, including which data formats can be used for propagating metadata
between processes and for exporting tracing data to a tracing backend. Fortunately,
as we will see in this chapter, an end user of a tracing infrastructure, someone who
wants to instrument their business application, or their open source framework,
or library, typically does not need to worry about those decisions.
We only briefly touched upon the notion of instrumentation and trace points before,
so in this chapter, we will dive deep into the question of instrumentation, using three
canonical "Hello, World!" applications in Go, Java, and Python. You may be having
Jules Winnfield's reflex right now: "Say Hello, World! again," but I promise to make
it interesting.
[ 81 ]
Instrumentation Basics with OpenTracing
The application will be built with microservices, using a database, and occasionally
will spit out "politically incorrect" responses. We will use the OpenTracing APIs
from the OpenTracing project [1] to make our instrumentation portable across
many tracing vendors, and we will cover such topics as creating entry/exit trace
points; annotating spans with tags and timestamped events; encoding and decoding
metadata for transferring it over the wire; and the mechanisms for in-process context
propagation provided by the OpenTracing APIs.
[ 82 ]
Chapter 4
• Exercise 6: Auto-instrumentation:
°° Use existing open source instrumentation
°° Use zero-touch instrumentation
After completing this chapter, you will have the knowledge and understanding
of how to apply instrumentation to your own applications or frameworks to hit
the ground running with distributed tracing.
Prerequisites
In order to run the examples in this chapter, we need to prepare the development
environment for each of the three programming languages and run the tracing
backend. This section provides instructions on setting up the required dependencies.
The last argument to the git clone command is to ensure that the directory is not
created with the .git suffix, otherwise it will confuse the Go compiler. If you are
not planning on running Go examples, then you can clone the source code in any
directory of your choosing, as Python and Java won't care.
To make it easier to refer to the main directory throughout this chapter, let's define
an environment variable for convenience:
$ cd Mastering-Distributed-Tracing/Chapter04
$ export CH04='pwd'
$ echo $CH04
/Users/yurishkuro/gopath/src/github.com/PacktPublishing/Mastering-
Distributed-Tracing/Chapter04
[ 83 ]
Instrumentation Basics with OpenTracing
All examples are grouped by language first. The main directory for the language
contains project files, such as pom.xml or requirements.txt, and a list of exercise#
directories with the final code for each exercise. You may also find lib directories,
which are used for code that is shared across exercises.
All code examples, except for exercise1 and exercise4a, are built upon the
previous exercises. You can take the code in {lang}/exercise1 modules as the
starting point and keep improving it in the first half of the chapter, and then move
onto {lang}/exercise4a to do the same.
Go development environment
Please refer to the documentation (https://golang.org/doc/install) for Go
development environment installation instructions. The examples have been tested
with Go version 1.10.x. In addition to the standard toolchain, you will need to have
dep, a dependency management tool. Please see https://github.com/golang/
dep for installation instructions. Once installed, run dep ensure to download all
the necessary dependencies:
$ cd $CH04/go
$ dep ensure
[ 84 ]
Chapter 4
In order to guarantee repeatable builds, dep uses a Gopkg.lock file, where the
dependencies are resolved to specific versions or Git commits. When we run dep
ensure, it downloads all the dependencies and stores them in the vendor folder.
$ cd $CH04/python
$ virtualenv env
$ source env/bin/activate
$ pip install -r requirements.txt
This will create a subdirectory env containing the virtual environment, which
we then activate, and we run pip to install the dependencies. If you are more
comfortable with other Python environment tools, like pipenv or pyenv, feel
free to use those.
[ 85 ]
Instrumentation Basics with OpenTracing
MySQL database
The application we are going to use will be making calls to MySQL database.
We do not have exotic requirements, so any version of MySQL should work,
but I specifically tested with the Community Server edition version 5.6. You can
download and install it locally from https://dev.mysql.com/downloads/mysql/,
but I recommend running it as a Docker container:
$ docker run -d --name mysql56 -p 3306:3306 \
-e MYSQL_ROOT_PASSWORD=mysqlpwd mysql:5.6
cae5461f5354c9efd4a3a997a2786494a405c7b7e5b8159912f691a5b3071cf6
$ docker logs mysql56 | tail -2
2018-xx-xx 20:01:17 1 [Note] mysqld: ready for connections.
Version: '5.6.42' socket: '/var/run/mysqld/mysqld.sock' port: 3306
MySQL Community Server (GPL)
You may need to define a user and permissions to create the database, which
is outside of the scope of this book. For simplicity, we used the default user root
to access the database (which you should never do in production) with password
mysqlpwd.
The source code contains a file called database.sql with SQL instructions to create
the database chapter04 and the table people, and populate it with some data:
$ docker exec -i mysql56 mysql -uroot -pmysqlpwd < $CH04/database.sql
Warning: Using a password on the command line interface can be insecure.
If you are using a local installation, you can run mysql directly:
$ mysql -u root -p < $CH04/database.sql
Enter password:
[ 86 ]
Chapter 4
OpenTracing
Before we jump into the exercises, let's talk about the OpenTracing project.
In October 2015, Adrian Cole, the lead maintainer of Zipkin, organized and hosted
a "Distributed Tracing and Zipkin Workshop" at the Pivotal office in San Francisco.
The attendees were a mix of commercial tracing vendors, open source developers,
and engineers from a number of companies who were in charge of building or
deploying tracing infrastructure in their organizations.
A common theme in the hallway conversations was that the single largest obstacle
to the wide adoption of tracing in large organizations was the lack of reusable
instrumentation for a vast number of open source frameworks and libraries, due
to the absence of standard APIs. It was forcing all vendors; open source tracing
systems, like Zipkin; and the end users to implement instrumentation over and
over for the same popular software and frameworks.
One common mistake people make about the OpenTracing project is thinking that
it provides the actual end-to-end tracing infrastructure. We will see in Chapter 6,
Tracing Standards and Ecosystem, that there are five different problems that someone
deploying a tracing system in an organization needs to address. The OpenTracing
project solves one and only one of those problems: providing vendor-neutral APIs
for the instrumentation, as well as the reusable instrumentation libraries for popular
frameworks. This is very likely the problem that has the largest audience because if
we have an organization of several thousand software engineers, only a few of them
will be actually dealing with deploying the tracing infrastructure. The rest of the
engineers are going to develop their own applications and will expect either to have
the tracing instrumentation included with their infrastructure libraries, or to have
a narrow, well-defined API that they can use to instrument their own code.
[ 87 ]
Instrumentation Basics with OpenTracing
The OpenTracing APIs allow developers to focus on what they know best: describing
the semantic behavior of the distributed transactions executed by their software. All
other concerns of tracing, such as the exact wire format of metadata and the format
of span data, are delegated to the implementations of the OpenTracing APIs that
the end users can swap without changing their code.
The OpenTracing APIs define two primary entities: a tracer and a span. A tracer
is a singleton that is responsible for creating spans and exposing methods for
transferring the context across process and component boundaries. For example,
in Go, the Tracer interface has just three methods:
type Tracer interface {
StartSpan(operationName string, opts
...StartSpanOption) Span
Inject(sc SpanContext, format interface{}, carrier
interface{}) error
Extract(format interface{}, carrier interface{})
(SpanContext, error)
}
Finish()
Context() SpanContext
Tracer() Tracer
}
[ 88 ]
Chapter 4
As we can see, span is mostly a write-only API. With the exception of the baggage
API, which we will discuss later, all other methods are used to write data to
the span, without being able to read it back. This is intentional, since requiring
implementations to provide a read API for the recorded data would impose
additional restrictions in how the implementation can process that data internally.
Span context is another important concept in the OpenTracing APIs. In the previous
chapter, we discussed that tracing systems are able to track distributed executions
by propagating metadata along the execution path of the request. Span context is the
in-memory representation of that metadata. It is an interface that does not actually
have any methods, except for the baggage iterator, because the actual representation
of the metadata is implementation-specific. Instead, the Tracer interface provides
Inject() and Extract() methods that allow the instrumentation to encode
metadata represented by the span context to and from some wire representation.
The causal relationship between two spans is represented with the span reference,
which is a combination of two values; a reference type, which describes the nature
of the relationship; and a span context, which identifies the referenced spans. Span
references can only be recorded in the span when it is started, which prevents loops
in the causality graph. We will come back to span references later in this chapter.
Figure 4.1: A representation of the OpenTracing conceptual model in the physical Thrift data model in Jaeger
[ 89 ]
Instrumentation Basics with OpenTracing
The references list contains the links to ancestor spans in the causality DAG.
The tags are represented with the KeyValue struct, which contains a key and
a value of one of the five types.
The Log struct is a combination of a timestamp and a nested list of key-value pairs
(called "log fields" in the OpenTracing API). There is no data type for span context
because this is Jaeger's backend data model, and span context is only needed when
the execution is in progress, in order to pass the metadata, and establish causality
references between spans. If we look at the actual Jaeger implementation of the
OpenTracing API for Go, we will find this implementation of SpanContext interface:
type SpanContext struct {
traceID TraceID
spanID SpanID
flags byte
baggage map[string]string
debugID string
}
Now that we have reviewed the basics of the OpenTracing API, let's see it in action.
$ curl http://localhost:8080/sayHello/Margo
Hello, Margo!
The application has some creepy big brother tendencies, however, by occasionally
volunteering additional knowledge about the person:
$ curl http://localhost:8080/sayHello/Vector
Hello, Vector! Committing crimes with both direction and magnitude!
$ curl http://localhost:8080/sayHello/Nefario
Hello, Dr. Nefario! Why ... why are you so old?
[ 90 ]
Chapter 4
It looks up the information in the MySQL database that we created and seeded
earlier. In the later exercises, we will extend this application to run several
microservices.
Hello application in Go
The working directory for all Go exercises is $CH04/go.
Let's run the application (I removed the date and timestamps from the log
messages for brevity):
$ go run ./exercise1/hello.go
Listening on http://localhost:8080/
Now we have the HTTP server running, let's query it from another Terminal
window:
$ curl http://localhost:8080/sayHello/Gru
Hello, Felonius Gru! Where are the minions?%
exercise1/people:
repository.go
We can see that the application consists of the main file, hello.go, and a data
repository module called people. The data repository uses a self-explanatory Person
type defined in a shared location, lib/model/person.go:
package model
[ 91 ]
Instrumentation Basics with OpenTracing
import (
"database/sql"
"log"
"github.com/go-sql-driver/mysql"
"github.com/PacktPublishing/Mastering-Distributed-Tracing/
Chapter04/go/lib/model"
)
We can see that it imports a driver for MySQL and the model package that defines
the Person struct. Then we have a couple of declarations: the connection URL for
the MySQL database (where root:mysqlpwd refers to the username and password)
and the Repository type:
const dburl = "root:mysqlpwd@tcp(127.0.0.1:3306)/chapter04"
[ 92 ]
Chapter 4
for rows.Next() {
var title, descr string
err := rows.Scan(&title, &descr)
if err != nil {
return model.Person{}, err
}
return model.Person{
Name: name,
Title: title,
Description: descr,
}, nil
}
return model.Person{
Name: name,
}, nil
}
[ 93 ]
Instrumentation Basics with OpenTracing
This is all a pretty standard use of Go's database/sql module. With this out of the
way, let's take a look at the main application code in hello.go. We need to import
the people package to get access to the repository:
package main
import (
"log"
"net/http"
"strings"
"github.com/PacktPublishing/Mastering-Distributed-Tracing/chapter-
04/go/exercise1/people"
)
The main function creates the repository and starts an HTTP server listening
on port 8080, and serving a single endpoint, sayHello:
var repo *people.Repository
func main() {
repo = people.NewRepository()
defer repo.Close()
http.HandleFunc("/sayHello/", handleSayHello)
log.Print("Listening on http://localhost:8080/")
log.Fatal(http.ListenAndServe(":8080", nil))
}
[ 94 ]
Chapter 4
The function SayHello() uses the repository to load the Person object by name,
and formats the greeting using the information it may have found:
// SayHello creates a greeting for the named person.
func SayHello(name string) (string, error) {
person, err := repo.GetPerson(name)
if err != nil {
return "", err
}
return FormatGreeting(
person.Name,
person.Title,
person.Description,
), nil
}
[ 95 ]
Instrumentation Basics with OpenTracing
Please do not uncomment it until Exercise 6. Since all exercises are defined
in the same module, we have multiple classes that define the main() function,
and therefore we must tell Spring which main class we want to run, like this:
$ ./mvnw spring-boot:run -Dmain.class=exercise1.HelloApp
[... a lot of logs ...]
INFO 57474 --- [main] exercise1.HelloApp: Started HelloApp in 3.844
seconds
Compared to Go and Python apps, both Maven and Spring generate a lot of logs.
The last log should be stating that the application has started. We can test it the
same way as the others:
$ curl http://localhost:8080/sayHello/Gru
Hello, Felonius Gru! Where are the minions?%
The source code of the application consists of two packages. One of them (lib.
people) is shared across all exercises and defines the Person class as the data model,
and the data access PeopleRepository interface:
@Entity
@Table(name = "people")
public class Person {
@Id
private String name;
@Column(nullable = false)
private String title;
@Column(nullable = false)
private String description;
public Person() {}
The Person class also includes getters for its members, which is omitted here.
Together, these two classes allow us to use Spring's Data to access the database.
The database connection details are defined in src/main/resources/application.
properties:
spring.jpa.hibernate.ddl-auto=none
spring.datasource.url=jdbc:mysql://localhost:3306/Chapter04
spring.datasource.username=root
spring.datasource.password=mysqlpwd
The main application code can be found in the exercise1 package. It contains
a very simple main class, HelloApp, where we point Spring to the lib.people
package to auto-discover the data model and the repository interface:
@EnableJpaRepositories("lib.people")
@EntityScan("lib.people")
@SpringBootApplication
public class HelloApp {
public static void main(String[] args) {
SpringApplication.run(HelloApp.class, args);
}
}
@Autowired
private PersonRepository personRepository;
@GetMapping("/sayHello/{name}")
public String sayHello(@PathVariable String name) {
Person person = getPerson(name);
String response = formatGreeting(person);
return response;
}
[ 97 ]
Instrumentation Basics with OpenTracing
It defines a single endpoint sayHello and calls two functions to get the person
by name and to format the greeting for that person:
private Person getPerson(String name) {
Optional<Person> personOpt = personRepository.findById(name);
if (personOpt.isPresent()) {
return personOpt.get();
}
return new Person(name);
}
[ 98 ]
Chapter 4
The application consists of two files at this point. The database.py module contains
the basic code to read from the database using the ORM framework SQL Alchemy:
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy.schema import Column
from sqlalchemy.types import String
db_url = 'mysql+pymysql://root:mysqlpwd@localhost:3306/chapter04'
engine = create_engine(db_url, echo=False)
Session = sessionmaker(bind=engine)
session = Session()
Base = declarative_base()
class Person(Base):
__tablename__ = 'people'
name = Column(String, primary_key=True)
title = Column(String)
description = Column(String)
@staticmethod
def get(name):
return session.query(Person).get(name)
app = Flask('py-1-hello')
@app.route("/sayHello/<name>")
def say_hello(name):
person = get_person(name)
resp = format_greeting(
name=person.name,
title=person.title,
description=person.description,
)
return resp
def get_person(name):
[ 99 ]
Instrumentation Basics with OpenTracing
person = Person.get(name)
if person is None:
person = Person()
person.name = name
return person
if __name__ == "__main__":
app.run(port=8080)
Exercise summary
In this first exercise, we familiarized ourselves with the source code of the
"Hello" application and learned how to run it. In the following exercises,
we will instrument this program with the OpenTracing API, and also refactor
it into multiple microservices.
[ 100 ]
Chapter 4
Tracers are expected to be used as singletons: one tracer per application. There are
rare scenarios when an application needs more than one tracer. For example, as we
will see in Chapter 7, Tracing with Service Mesh, service meshes can create spans on
behalf of different applications, which may require multiple instances of the tracer.
The exact mechanism for ensuring a singleton instance of a tracer is language-and
framework-specific. The OpenTracing API libraries usually provide a mechanism
for defining a global tracer using a global variable, however, the applications are
not required to use that and can instead rely on dependency injection.
Jaeger libraries in different languages have a convention that they provide a
Configuration class that can act as a builder for the Tracer. By default, the builder
creates a production-ready tracer, including a sampling strategy that only samples
approximately one in 1,000 traces. For our purposes, we would rather have all traces
sampled, so we override the sampling strategy by instructing the Configuration
class to use a "const" strategy, meaning that it always makes the same decision,
with a parameter param=1, which translates to that decision always being "yes"
(this sampler treats the parameter as a Boolean value). Another small tweak we
make to the defaults is to instruct the reporter to write a log entry for all finished
spans. The reporter is an internal component of the Jaeger tracer that is responsible
for exporting finished spans out of the process to the tracing backend.
Configuration classes expect us to provide a service name, which is used by the
tracing backend to identify the service instance in the distributed call graph. In this
exercise, we only have a single service, but in the later exercises, we will split it into
multiple microservices, and by giving them different names, we will be able to more
clearly see the shape of the distributed execution call graph. We will be using
a simple notation to name the services:
{language}-{exercise number}-{microservice name}
For example, the Go service in this exercise will be called "go-2-hello." This naming
scheme will allow us to clearly separate services in the tracing UI.
Create a tracer in Go
Creating a tracer will be something we will need to do across all exercises, so rather
than repeating that code under each exercise, I placed it into a shared module under
$CH04/go/lib/tracing, in the file called init.go. Let's look at the imports:
package tracing
import (
"io"
"log"
opentracing "github.com/opentracing/opentracing-go"
jaeger "github.com/uber/jaeger-client-go"
config "github.com/uber/jaeger-client-go/config"
)
[ 101 ]
Instrumentation Basics with OpenTracing
Here we're importing the opentracing-go module that defines the official
OpenTracing API for Go. We're renaming it to opentracing, which is, strictly
speaking, not necessary, since that is the package name it has anyway, but we
just want to be more explicit, since its import path ends with a different name.
We also import two modules from the Jaeger client library that implement the
OpenTracing API. The config module is used to create the tracer parameterized
with some settings. The jaeger module is only needed because we use the jaeger.
StdLogger type to bind the Jaeger tracer to the standard library logger, which we
used in the rest of the program. The main function looks like this:
// Init returns an instance of Jaeger Tracer that samples 100%
// of traces and logs all spans to stdout.
func Init(service string) (opentracing.Tracer, io.Closer) {
cfg := &config.Configuration{
Sampler: &config.SamplerConfig{
Type: "const",
Param: 1,
},
Reporter: &config.ReporterConfig{
LogSpans: true,
},
}
tracer, closer, err := cfg.New(
service,
config.Logger(jaeger.StdLogger),
)
if err != nil {
log.Fatalf("ERROR: cannot init Jaeger: %v", err)
}
return tracer, closer
}
Now we can add a call to this function to our main application in the hello.go
file. First, we add the imports:
import (
"log"
"net/http"
"strings"
[ 102 ]
Chapter 4
opentracing "github.com/opentracing/opentracing-go"
"github.com/PacktPublishing/Mastering-Distributed-Tracing/
Chapter04/go/exercise2/people"
"github.com/PacktPublishing/Mastering-Distributed-Tracing/
Chapter04/go/lib/tracing"
)
Then we declare a global variable called tracer (so that we don't have to pass
it to functions), and initialize it by calling tracing.Init() in main():
var repo *people.Repository
var tracer opentracing.Tracer
func main() {
repo = people.NewRepository()
defer repo.Close()
[ 103 ]
Instrumentation Basics with OpenTracing
def init_tracer(service):
logging.getLogger('').handlers = []
logging.basicConfig(format='%(message)s', level=logging.DEBUG)
config = Config(
config={
'sampler': {
'type': 'const',
'param': 1,
},
'logging': True,
'reporter_batch_size': 1,
},
service_name=service,
)
The first thing we do here is to configure Python's logging. It's probably not the best
place, but the tracer is the only component using it, so we do it here for convenience.
Then we see the already-familiar configuration overrides. The additional parameter
reporter_batch_size=1 is used to instruct the tracer to flush the spans immediately
without buffering them.
With this utility function in place, we just need to call it from the main hello.py:
from flask import Flask
from .database import Person
from lib.tracing import init_tracer
app = Flask('py-2-hello')
init_tracer('py-2-hello')
[ 104 ]
Chapter 4
Note that we're passing the service name py-2-hello to the tracer.
When the sampling decision is "no", some of the API calls to the span may be
short-circuited, for example, trying to annotate the span with tags will be a no-
op. However, the trace ID, span ID, and other metadata will still be propagated
alongside the distributed execution, even for unsampled traces. We will go into
more detail about sampling in Chapter 8, All About Sampling, where we will see that
the so-called "upfront" or "head-based sampling" implemented by Jaeger is not the
only possible sampling technique, although it is the prevailing one in the industry.
Since we want to have a new trace for each HTTP request handled by our
application, we will add the instrumentation code to the HTTP handler functions.
Every time we start a span, we need to give it a name, called the "operation name"
in OpenTracing. The operation names help with later analysis of the traces, and can
be used for the grouping of traces, building latency histograms, tracking endpoint
service level objectives (SLOs), and so on. Now, because of this frequent use in
aggregations, operation names should never have high cardinality. For example, in
the Hello application the name of the person is encoded as the HTTP path parameter,
for example, /sayHello/Margo. Using this exact string as the operation name would
be a bad idea because the service may be queried with thousands of different names,
and each will result in a unique span name, which will make any aggregate analysis
of the spans, such as investigation of the endpoint latency profile, very difficult.
If the application is using a web framework, as we did in the Java and Python
examples, then they typically define a pattern for the URLs, often referred to
as a route, for example, the "/sayHello/{name}" pattern in the Java example:
@GetMapping("/sayHello/{name}")
public String hello(@PathVariable String name) { ... }
The route pattern is fixed and does not depend on the actual value of the {name}
parameter, so it would be a good option to use as the span operation name. In this
exercise, however, we will be using the string "say-hello" as the operation name,
for consistency across languages.
[ 105 ]
Instrumentation Basics with OpenTracing
As we discussed earlier, a span is a unit of work with a start and end timestamps.
In order to capture the end timestamp of the span, the instrumentation code must
call the Finish() method on it. If the Finish() method is not called, the span may
not be reported to the tracing backend at all because quite often, the only reference
to the span object is in the function that created it. Some tracer implementations may
provide additional tracking capabilities and still report unfinished spans, but it is not
a guaranteed behavior. Therefore, the OpenTracing specification requires an explicit
call to the Finish() method.
Start a span in Go
In Go, the HTTP handler function is handleSayHello. We can start a span right
at the beginning:
func handleSayHello(w http.ResponseWriter, r *http.Request) {
span := tracer.StartSpan("say-hello")
defer span.Finish()
To ensure that the span is finished when the handler returns (either successfully
or with an error), we invoke Finish() with the defer keyword right after starting it.
[ 106 ]
Chapter 4
To ensure that the span is always finished, even in the case of an exception, we
are using a try-finally statement. In order to get access to the tracer singleton,
we need to declare it as auto-injected by the Spring framework:
@Autowired
private Tracer tracer;
@app.route("/sayHello/<name>")
def say_hello(name):
with opentracing.tracer.start_span('say-hello'):
person = get_person(name)
resp = format_greeting(
name=person.name,
title=person.title,
description=person.description,
)
return resp
[ 107 ]
Instrumentation Basics with OpenTracing
The primary objective of a span is to tell a story about the operation it represents.
Sometimes a name and a couple of timestamps are enough. In other cases, we
may need more information if we want to analyze the behavior of our system.
As developers, we know best what might be useful in the context: the remote address
of the server we called; the identity of a database replica we accessed; the details and
the stack trace of an exception that was encountered; the account number we tried
to access; the number of records we retrieved from storage, and so on. The rules here
are similar to logging: record what you think may be useful and control the overhead
with sampling. If you find some crucial piece of data is missing, add it! This is what
instrumentation is all about.
The OpenTracing APIs provide two facilities for recording custom information in the
spans: "tags" and "logs". Tags are key-value pairs that apply to the span as a whole,
and are often used to query and filter trace data. For example, if a span represents
an HTTP call, then recording the HTTP method, such as GET or POST, is best done
with a tag.
Logs, on the other hand, represent point-in-time events. In that sense, they are
very closely related to traditional log statements that we often use throughout the
program, only contextualized to a single span. OpenTracing logs are also structured,
that is, represented as a timestamp and a nested collection of key-value pairs. Span
logs can be used to record additional events that occur during the span lifetime, if we
do not want to model them as their own nested spans. For example, if we have a
single span for an HTTP request, we might want to log the following lower-level
events:
[ 108 ]
Chapter 4
The fact that span logs are called "logs" is often confusing and leads to questions
like "when should I use regular logs and when should I use span logs?" There is
no one-size-fits-all answer to that. Some logs simply do not make sense as part
of the span; we have seen the examples in the HotROD demo, where the custom
logging API reserved the term "background logs" for those events that occur as
part of the application lifecycle, rather than as part of a specific request execution.
For logs that are related to a single request, there are other aspects to consider.
Most tracing systems are not designed as log aggregation services, so they may
not provide the same capabilities as, say, the Elasticsearch–Logstash–Kibana
(ELK) stack. The sampling generally works differently for logs and traces:
logs are sometimes sampled (or throttled) per-process, while traces are always
sampled per-request (distributed execution). In retrospect, it would be better if the
OpenTracing specification used the term "events" instead of logs, until the industry
decides how to deal with both. In fact, OpenTracing recommends that one of the key-
value pairs in every span log is a pair with key = "event" that describes the overall
event being logged, with other attributes of the event provided as additional fields.
What kind of information might we need to record in the span in case we are
troubleshooting some issue in our Hello application? It might be useful to know
what response string (formatted greeting) the service returns, and maybe the
details loaded from the database. There's no clear-cut answer whether these should
be recorded as tags or logs (we will see more relevant examples later). We chose
to record the response string as a tag, since it corresponds to the span overall, and
used a log for the contents of the Person object loaded from the database because
it is structured data and a span log allows multiple fields.
We also have another special case where we may want to add custom annotations
to the span: the error handling. In all three languages, the query to the database
may fail for one reason or another, and we will receive an error of some sort. When
we are troubleshooting an application in production, it is very useful if it can record
these errors in the correct span.
The tag and log APIs do not enforce any specific semantics or meaning on the
values recorded as key-value pairs. This allows the instrumentation to easily add
completely custom data. For example, in our case we can store the response string
as a tag named "response". However, there are data elements that are frequently
recorded in the spans with the same meaning, such as HTTP URLs, HTTP status
codes, database statements, errors/exception, and so on. To ensure that all disparate
instrumentations still record these common concepts in a consistent way, the
OpenTracing project provides a set of "standard tags" and "fields", and a guideline on
the exact way of using them, in a document called "Semantic Conventions" [4].
[ 109 ]
Instrumentation Basics with OpenTracing
The tag and field names prescribed in this document are usually available as
constants exposed by the respective OpenTracing APIs, and we will be using those
in this exercise to record the error information in a standardized way. In the later
exercise, we will use those constants to record other data, such as HTTP attributes
and a SQL query.
span.SetTag("response", greeting)
w.Write([]byte(greeting))
}
Here, we made three changes. In the error handling branch, we set an error =
true tag on the span, to indicate that the operation failed, and we log the error
using the span.LogFields() method. It is a structured way to pass the log fields
to the span that minimizes memory allocations. We need to import an additional
package that we labeled otlog:
otlog "github.com/opentracing/opentracing-go/log"
In the case of a successful request, just before we write the response, we store the
greeting string in the span tag called response. We also change the signature of
the SayHello() function and pass the span to it so that it can do its own annotation,
like this:
func SayHello(name string, span opentracing.Span) (string, error) {
person, err := repo.GetPerson(name)
if err != nil {
return "", err
}
[ 110 ]
Chapter 4
span.LogKV(
"name", person.Name,
"title", person.Title,
"description", person.Description,
)
...
}
Here, we are using a different span log function, LogKV, which is conceptually
still the same, but takes the arguments as an even-length list of values, that is,
alternating (key, value, key, value, …) pairs.
return response;
} finally {
span.finish();
}
}
The Java OpenTracing API for span logs is rather inefficient at this time, requiring
us to create a new instance of a map. Perhaps in the future a new API will be added
with alternating key/value pairs.
[ 111 ]
Instrumentation Basics with OpenTracing
We also want to pass that span to the get_person() function, where we can log
the person's information to that span:
def get_person(name, span):
person = Person.get(name)
if person is None:
person = Person()
person.name = name
span.log_kv({
'name': person.name,
'title': person.title,
'description': person.description,
})
return person
Exercise summary
If we now run the program (for example, go run ./exercise2/hello.go)
and query it with curl, we will see a span that looks slightly more interesting
(Figure 4.3). The relative timing of the log entry is 451.85 ms, which is only slightly
earlier than the end of the span at 451.87 ms, which is not surprising, since the
majority of the time is spent in the database query.
[ 112 ]
Chapter 4
In this second exercise, we have learned that in order to use the OpenTracing API,
we need to instantiate a concrete tracer, just like you would need to use a concrete
logging implementation if you wanted to use SLF4J logging API in Java. We wrote
instrumentation that starts and finishes a span representing the work done by the
HTTP handler of the Hello application. Next, we added custom annotations to the
span in the form of tags and logs.
Let's change the main() to not have its own global variable tracer, and instead
initialize the one in the OpenTracing library:
// var tracer opentracing.Tracer – commented out
func main() {
repo = people.NewRepository()
defer repo.Close()
...
}
span := opentracing.GlobalTracer().StartSpan(
"get-person",
opentracing.Tag{Key: "db.statement", Value: query},
)
defer span.Finish()
[ 114 ]
Chapter 4
Notice that we record the SQL query we are about to execute as a span tag.
Let's make a similar change in the FormatGreeting() function:
func FormatGreeting(name, title, description string) string {
span := opentracing.GlobalTracer().StartSpan("format-
greeting")
defer span.Finish()
Even though the get-person span represents a call to the database, we do not
have access to a lot of information about that call, since it is handled by the ORM
framework for us. Later in this chapter, we will see how we can get instrumentation
to that level as well, for example, to save the SQL statement in the span.
[ 115 ]
Instrumentation Basics with OpenTracing
We could also add a span to the Person.get() method, however it would be more
useful if we could have access to the SQL query, which we don't have because it's
hidden by the ORM framework. In Exercise 6, we will see how we can get that.
The instrumentation did create three spans, as expected. The long hexadecimal
strings are the representation of the span context for each span, in the format trace-
id:span-id:parent-id:flags. The important part is the first segment, representing
the trace ID – they are all different! Instead of creating a single trace with three spans,
we created three independent traces. Well, rookie mistake. What we neglected to
do was to establish causal relationships between the spans, so that the tracer would
know that they belong to the same trace. As we discussed earlier in this chapter,
these relationships are represented by span references in OpenTracing.
[ 116 ]
Chapter 4
There are two types of span references currently defined by the OpenTracing
specification: "child-of" and "follows-from". For example, we can say that span
B is a "child-of" span A, or it "follows-from" span A. In both cases, it means that span
A is the ancestor of span B, that is, A happens-before B. The difference between the
two relationships, as defined by the OpenTracing specification, is that if B is child-of
A, then A depends on the outcome of B. For example, in the Hello application, the span
"say-hello" depends on the outcome of spans "get-person" and "format-greeting", so
the latter should be created with child-of references to the former.
There are cases, however, when the span that happens-before does not depend
on the outcome of the descendent span. One classic example is when the second
span is a fire-and-forget type of operation, such as an opportunistic write to a cache.
Another classic example is a producer-consumer pattern in the messaging systems,
where the producer generally has no idea when and how the consumer will process
the message, yet the producer's span (the act of writing to the queue) has the causal
link to the consumer span (the act of reading from the queue). These kinds of
relationships are modeled with follows-from span references.
[ 117 ]
Instrumentation Basics with OpenTracing
span = opentracing.GlobalTracer().StartSpan(
"get-person",
opentracing.ChildOf(span.Context()),
opentracing.Tag{Key: "db.statement", Value: query},
)
defer span.Finish()
...
}
span.LogKV(...)
return FormatGreeting(
person.Name,
person.Title,
person.Description,
span,
), nil
}
Here we see the use of the fluid syntax of the span builder. The method asChildOf()
needs a span context, but it has an overload that accepts a full span and extracts the
context from it. The changes to the sayHello() function are trivial, so we will omit
them here. The resulting code can be found in the exercise3a package.
Now let's change the calls to start_span() and pass a child-of reference:
def get_person(name, span):
with opentracing.tracer.start_span(
'get-person', child_of=span,
) as span:
person = Person.get(name)
...
return person
[ 119 ]
Instrumentation Basics with OpenTracing
Note that the child_of argument generally expects a span context, but the Python
library has a convenient fallback that allows us to pass the full span, and the library
will automatically retrieve span context from it. The resulting code can be found in
the exerice3a directory.
Figure 4.4: A three-span trace that includes spans for the worker functions
Each trace contains three spans, as expected, and we can confirm our earlier
suspicion that the database lookup is indeed responsible for the majority of the
time spent by the system on the request. We can drilldown into the database span
and see the SQL query as a tag (but only in the Go version where we did not use
the ORM framework).
[ 120 ]
Chapter 4
In this step, we will amend our function instrumentation to avoid passing spans
explicitly as parameters and use the in-process propagation mechanism from
OpenTracing instead. Since Go language is a special case, we will start with Python
and Java. In both Python and Java, the OpenTracing APIs have introduced the notion
of an "active span," "a scope," and a "scope manager." The easiest way to think about
them is to imagine an application where each request executes in its own thread
(a very common model). When the server receives a new request, it starts a span
for it, and makes that span "active," meaning that any other instrumentation that,
for example, wants to add annotations to it can access the active span directly from
the tracer (just for that thread). If the server executes some long operation, or maybe
a remote call, it would create another span as a child-of the currently active span
and make the new span "active." Effectively, the old active span is pushed onto the
stack while the child span is executing. Once the child span is finished, its parent is
popped from the stack and made active again. The scope manager is the component
responsible for managing and storing these active spans.
The management of the active spans is done through scopes. When we ask the scope
manager to activate a span, we get back a scope object that contains the span. Each
scope can be closed once, which removes it from the top of the stack and makes the
previous scope (and its span) active. Scopes can be configured to auto-finish the
span when the scope is closed, or to leave the span open–which one of these modes
is used depends on the threading model of the code. For example, if we are using
futures-based asynchronous programming, and we make a remote call, then the
result handler for that call will most likely be handled in another thread. So, the
scope that was created on the thread that started the request should not finish the
span. We will discuss this in Chapter 5, Instrumentation of Asynchronous Applications.
One very convenient feature of using active spans is that we no longer need to
explicitly tell the tracer that the new span should be the child-of the currently active
span. If we do not explicitly pass another span reference, the tracer will automatically
establish that relationship, assuming there is an active span at the time. This makes
the code for starting spans simpler.
[ 121 ]
Instrumentation Basics with OpenTracing
Writing instrumentation code that is fully relying on scope managers has one
danger. If the application flow is not correctly instrumented, it is possible to arrive at
a certain place in the program where the stack of scopes stored in the scope manager
is empty, and statements that unconditionally expect an active span to be present,
such as tracer.activeSpan().setTag(k, v), might raise a null pointer exception
or a similar error. Defensive coding and checking for null would address this issue,
but in a well-structured application with clear flow, it may not be a problem. The
examples I use in this chapter do not include these null checks because they are
structured so that it is not possible to find oneself in a situation where an expected
active span is null. Another recommendation is to keep code that accesses the scope
close to the code that starts it, thus also avoiding the null situation altogether.
Since the active span is directly accessible from the tracer, we no longer need to
pass it to the other two functions. However, when we want to set the response
tag, we need to retrieve the span from its scope.
The other two functions are changed in a similar way. The main difference is that
we get a scope back, and we do not have to specify the parent span anymore:
def get_person(name):
with opentracing.tracer.start_active_span(
'get-person',
) as scope:
person = Person.get(name)
if person is None:
person = Person()
person.name = name
[ 122 ]
Chapter 4
scope.span.log_kv({
'name': person.name,
'title': person.title,
'description': person.description,
})
return person
[ 123 ]
Instrumentation Basics with OpenTracing
This means that logging an exception or the error status to the span is not possible,
that is, the following code is invalid, even though it would be a lot simpler:
try (Scope scope = tracer.buildSpan("get-person").startActive(true)) {
...
} catch (Exception e) {
// will not compile since the scope variable is not visible
scope.span().setTag("error", true);
scope.span().log(...);
}
The new code does not look significantly simpler than before; however, in a real
program there may be many nested levels of functions, and not having to pass
the span object through all of them ends up saving a lot of typing and API churn.
[ 124 ]
Chapter 4
On the other hand, Go is one of the most popular languages for developing cloud-
native software. In-process context propagation is not only useful for distributed
tracing, but for other techniques, such as implementing RPC deadlines, timeouts,
and cancellations, which are important for high-performing applications. To address
this issue, the Go standard library provides a standard module, context, which
defines an interface called context.Context, used as a container to hold and pass
the in-process request context. If you are not familiar with this, you may want to read
the blog post that first introduced the Context type (https://blog.golang.org/
context). Here's an interesting quote from this blog:
So, Go's solution is to ensure that the applications explicitly pass the Context
object between all function calls. This is very much in the spirit of the "no magic"
principle because the program itself is fully in charge of propagating the context.
It completely solves the problem of in-process propagation for the purpose of
distributed tracing. It is certainly more invasive than in Java or Python, and might
seem like we just traded passing one thing (span) for another (context). However,
since it is the language standard, and Go's Context has much wider applications
than just distributed tracing, it is not a tough sell in an organization where many Go
applications are already written in the style recommended by the standard library.
[ 125 ]
Instrumentation Basics with OpenTracing
The OpenTracing API for Go recognized this early on and provided helper functions
that simplify passing the current span as part of the Context object and starting
new spans from it. Let's use that to clean up our Hello application. The first span
is created in the HTTP handler function and we need to store that span in the
context to be passed down, by calling ContextWithSpan(). The context itself is
already provided by the http.Request object. Then we pass the new ctx object
to the SayHello() function as the first argument, instead of passing the span:
func handleSayHello(w http.ResponseWriter, r *http.Request) {
span := opentracing.GlobalTracer().StartSpan("say-hello")
defer span.Finish()
ctx := opentracing.ContextWithSpan(r.Context(), span)
We change the function SayHello() similarly, by passing the context to both the
GetPerson() and the FormatGreeting() functions. By convention, we always
put ctx as the first argument. Since we no longer have a reference span to call
the LogKV() method, we call another OpenTracing helper, SpanFromContext(),
to retrieve the current span:
func SayHello(ctx context.Context, name string) (string, error) {
person, err := repo.GetPerson(ctx, name)
if err != nil {
return "", err
}
opentracing.SpanFromContext(ctx).LogKV(
"name", person.Name,
"title", person.Title,
"description", person.Description,
)
return FormatGreeting(
ctx,
person.Name,
person.Title,
person.Description,
), nil
}
[ 126 ]
Chapter 4
Now let's change the GetPerson() function to work with the context:
func (r *Repository) GetPerson(
ctx context.Context,
name string,
) (model.Person, error) {
query := "select title, description from people where name = ?"
[ 127 ]
Instrumentation Basics with OpenTracing
The final version of the code is available in the exercise3b package. We can verify
that the spans are connected correctly by running the program and checking that
all trace IDs are the same:
$ go run ./exercise3b/hello.go
Initializing logging reporter
Listening on http://localhost:8080/
Reporting span 158aa3f9bfa0f1e1:1c11913f74ab9019:158aa3f9bfa0f1e1:1
Reporting span 158aa3f9bfa0f1e1:1d752c6d320b912c:158aa3f9bfa0f1e1:1
Reporting span 158aa3f9bfa0f1e1:158aa3f9bfa0f1e1:0:1
Exercise summary
In this exercise, we have increased our instrumentation coverage to three
spans representing three functions in the application. We have discussed how
OpenTracing allows us to describe the causal relationships between spans by using
span references. Also, we used the OpenTracing scope managers and Go's context.
Context mechanisms to propagate the context in-process between function calls.
In the next exercise, we will replace the two inner functions with RPC calls to other
microservices and will see what additional instrumentation that requires in order
to support inter-process context propagation.
[ 128 ]
Chapter 4
It will return the information about the person as a JSON string. We could test that
service like this:
$ curl http://localhost:8081/getPerson/Gru
{"Name":"Gru","Title":"Felonius","Description":"Where are the
minions?"}
The Formatter service will listen on HTTP port 8082 and serve a single endpoint
formatGreeting. It will take three parameters: name, title, and description encoded
as URL query parameters, and respond with a plain text string. Here's an example
of calling it:
$ curl 'http://localhost:8082/formatGreeting?name=Smith&title=Agent'
Hello, Agent Smith!
Since we are now using three microservices, we will need to start each of them
in a separate Terminal window.
Microservices in Go
The refactored code for the Hello application broken down into three microservices
is available in package exercise4a. I will not reproduce all of it here, since most
of it is the same as before, just moved around. The main hello.go application
is refactored to have two local functions, getPerson() and formatGreeting(),
which make HTTP calls to the Big Brother and Formatter services respectively.
They are using a helper function, xhttp.Get(), from a shared module, lib/http/,
to execute HTTP requests and handle errors.
http.HandleFunc("/formatGreeting", handleFormatGreeting)
log.Print("Listening on http://localhost:8082/")
log.Fatal(http.ListenAndServe(":8082", nil))
}
Microservices in Java
Package exercise4a contains the refactored code from Exercise 3 - tracing functions
and passing context, that reimplements the Hello application as three microservices.
The class HelloController still has the same getPerson() and formatGreeting()
functions, but they now execute HTTP requests against two new services,
implemented in packages exercise4a.bigbrother and exercise4a.formatter,
using Spring's RestTemplate that we ask to be auto-injected:
@Autowired
private RestTemplate restTemplate;
Each sub-package has an App class (BBApp and FApp) and a controller class. All App
classes instantiate their own tracer with a unique name, so that we can separate the
services in the traces. The JPA annotations are moved from HelloApp to BBApp, as
it is the only one accessing the database. Since we need the two new services to run
on different ports, they each override the server.port environment variable in the
main() function:
@EnableJpaRepositories("lib.people")
@EntityScan("lib.people")
@SpringBootApplication
public class BBApp {
@Bean
public io.opentracing.Tracer initTracer() {
...
return new Configuration("java-4-bigbrother")
.withSampler(samplerConfig)
.withReporter(reporterConfig)
.getTracer();
}
The two new controllers are similar to the HelloController and contain the code
that it previously had in the getPerson() and formatGreeting() functions. We
also added new spans at the top of the HTTP handler functions.
The new services can be run in separate Terminal windows similar to the main one:
$ ./mvnw spring-boot:run -Dmain.class=exercise4a.bigbrother.BBApp
$ ./mvnw spring-boot:run -Dmain.class=exercise4a.formatter.FApp
[ 130 ]
Chapter 4
Microservices in Python
The refactored code for the Hello application broken down into three microservices
is available in package exercise4a. I will not reproduce all of it here, since most
of it is the same as before, just moved around. The functions get_person()
and format_greeting() in hello.py are changed to execute HTTP requests
against two new microservices, which are implemented by the bigbrother.
py and formatter.py modules. They both create their own tracers; for example,
the Big Brother code looks like this:
from flask import Flask
import json
from .database import Person
from lib.tracing import init_tracer
import opentracing
app = Flask('py-4-bigbrother')
init_tracer('py-4-bigbrother')
@app.route("/getPerson/<name>")
def get_person_http(name):
with opentracing.tracer.start_active_span('/getPerson') as scope:
person = Person.get(name)
if person is None:
person = Person()
person.name = name
scope.span.log_kv({
'name': person.name,
'title': person.title,
'description': person.description,
})
return json.dumps({
'name': person.name,
'title': person.title,
'description': person.description,
})
if __name__ == "__main__":
app.run(port=8081)
The new services can be run in separate Terminal windows similar to the main one:
$ python -m exercise4a.bigbrother
$ python -m exercise4a.formatter
[ 131 ]
Instrumentation Basics with OpenTracing
$ go run exercise4a/bigbrother/main.go
Initializing logging reporter
Listening on http://localhost:8081/
Reporting span e62f4ea4bfb1e34:21a7ef50eb869546:e62f4ea4bfb1e34:1
Reporting span e62f4ea4bfb1e34:e62f4ea4bfb1e34:0:1
$ go run exercise4a/formatter/main.go
Initializing logging reporter
Listening on http://localhost:8082/
Reporting span 38a840df04d76643:441462665fe089bf:38a840df04d76643:1
Reporting span 38a840df04d76643:38a840df04d76643:0:1
$ curl http://localhost:8080/sayHello/Gru
Hello, Felonius Gru! Where are the minions?
The design of this API had an interesting history because it had to work with several
assumptions:
• It could not assume any knowledge about the contents of the metadata used
by the tracing implementation, since virtually all existing tracing systems use
different representations.
[ 132 ]
Chapter 4
• It could not assume any knowledge of the serialized format of the metadata
because again, even systems with conceptually very similar metadata
(for example, Zipkin, Jaeger, and Stackdriver) use very different formats
for transmitting it over the wire.
• It could not assume any knowledge about the transmission protocol and
where in that protocol the metadata is encoded. For example, it is common
to pass metadata as plain text headers when using HTTP, but when
transmitting it via Kafka, or via custom transport protocols used by many
storage systems (for example, Cassandra), the plain text HTTP format might
not be suitable.
• It could not require tracers to be aware of all the different transmission
protocols either, so that the teams that maintain the tracers could focus on
a narrow and well-defined scope without worrying about all possible custom
protocols that may exist.
OpenTracing was able to solve this by abstracting most of these concerns and
delegating them to the tracer implementation. First, the inject/extract methods
operate on the span context interface, which is already an abstraction. To address
the difference in the transports, a notion of a "format" was introduced. It does
not refer to the actual serialization format, but to the type of metadata support
that exists in different transports. There are three formats defined in OpenTracing:
"text-map," "binary," and "http-headers." The latter was a special nod to HTTP being
the prevailing protocol with certain not-so-friendly idiosyncrasies, such as having
case-insensitive header names, and other quirks. Conceptually, the http-headers
format is similar to text-map in that it expects the transport to support a notion of
metadata as a collection of string key-value pairs. The binary format is for use with
protocols where the metadata can only be represented as an opaque sequence of
bytes (for example, in the Cassandra wire protocol).
The second abstraction introduced in the OpenTracing APIs is the notion of a carrier.
The carrier is actually tightly coupled with the format (there are discussions in
the OpenTracing of merging them into a single type) in that it provides a physical
container that the tracer can use to store the metadata according to the chosen
format. For example, for the text-map format, the carrier can be a Map<String,
String> in Java or map[string]string in Go; while the carrier for the binary
format in Go is the standard io.Writer or io.Reader (depending on whether
we are injecting or extracting the span context). The instrumentation that invokes
these OpenTracing APIs to inject/extract metadata always knows which transport
it is dealing with, so it can construct the appropriate carrier, typically tied to the
underlying RPC request object, and let the tracer populate it (inject) or read from
it (extract) through a well-defined carrier interface. This allows for the decoupling
of tracer inject/extract implementation from the actual transport used by the
application.
[ 133 ]
Instrumentation Basics with OpenTracing
To use this mechanism in the Hello application, we need to add calls to Inject()
in the main Hello service, and calls to Extract() in the Big Brother and Formatter
services. We will also be starting new spans to represent each outbound HTTP call.
This is customary in distributed tracing, since we do not want to attribute the time
spent in the downstream service to the top-level say-hello span. As a general
rule, every time an application makes a remote call to another service, for example,
to a database, we want to create a new span wrapping just that call.
opentracing.GlobalTracer().Inject(
span.Context(),
opentracing.HTTPHeaders,
opentracing.HTTPHeadersCarrier(req.Header),
)
return xhttp.Do(req)
}
A couple of new things are happening here. First, we start a new span that wraps
the HTTP request, as discussed. Then, we call the Inject() function on the tracer
and pass it the span context, the desired representation format opentracing.
HTTPHeaders, and the carrier created as a wrapper (adapter) around the HTTP
request headers. What's left is to replace the call sites:
func getPerson(ctx context.Context, name string) (*model.Person,
error) {
url := "http://localhost:8081/getPerson/"+name
res, err := get(ctx, "getPerson", url)
...
[ 134 ]
Chapter 4
func formatGreeting(
ctx context.Context,
person *model.Person,
) (string, error) {
...
url := "http://localhost:8082/formatGreeting?" + v.Encode()
res, err := get(ctx, "formatGreeting", url)
...
}
Now we need to change the other two services to extract the encoded metadata
from the request and use it when creating a new span. For example, in the Big
Brother service, it looks like this:
func handleGetPerson(w http.ResponseWriter, r *http.Request) {
spanCtx, _ := opentracing.GlobalTracer().Extract(
opentracing.HTTPHeaders,
opentracing.HTTPHeadersCarrier(r.Header),
)
span := opentracing.GlobalTracer().StartSpan(
"/getPerson",
opentracing.ChildOf(spanCtx),
)
defer span.Finish()
Similar to the calling side, we are using tracer's Extract() method with
HTTPHeaders format and carrier to decode span context from the request headers.
We then start a new span, using the extracted span context to establish a child-of
reference to the caller span. We can apply a similar change to the handler in the
Formatter service. Also, since our top-level Hello service is also an HTTP server,
it is conceivable that it might be called by a client that has already started a trace. So,
let's replace the code that starts the say-hello span with a similar extract-and-start
span code.
We are ignoring the error returned from the Extract() function because it does not
affect the rest of our code, for example, if the tracer is not able to extract the context
from the request, maybe because it is malformed or missing, it returns nil and
an error. We can pass nil into the ChildOf() function without ill effect: it would
simply be ignored and the tracer would start a new trace.
[ 135 ]
Instrumentation Basics with OpenTracing
The method get() is used for executing outbound HTTP requests. We need to inject
the trace context into request headers, so we use Spring's HttpHeaders object and
wrap it in an adapter class HttpHeaderInjectAdapter, which makes it look like
an implementation of OpenTracing's TextMap interface. The TextMap is an interface
that has method iterator() used by tracer.extract() and method put() used
by tracer.inject(). Since we are only doing the inject() call here, we do not
implement the iterator:
private static class HttpHeaderInjectAdapter implements
TextMap {
private final HttpHeaders headers;
HttpHeaderInjectAdapter(HttpHeaders headers) {
this.headers = headers;
[ 136 ]
Chapter 4
@Override
public Iterator<Entry<String, String>> iterator() {
throw new UnsupportedOperationException();
}
@Override
public void put(String key, String value) {
headers.set(key, value);
}
}
Once the HttpHeaders object is populated by the tracer, via our adapter, with the
headers carrying the trace context, we create HttpEntity, which is used by Spring's
restTemplate to execute the request.
For the inbound HTTP requests implemented by HTTP handlers in the controllers,
we implement a startServerSpan() method that performs the reverse of get():
it extracts the span context from the headers and passes it as the parent when
starting a new server-side span:
protected Span startServerSpan(
String operationName, HttpServletRequest request)
{
HttpServletRequestExtractAdapter carrier =
new HttpServletRequestExtractAdapter(request);
SpanContext parent = tracer.extract(
Format.Builtin.HTTP_HEADERS, carrier);
Span span = tracer.buildSpan(operationName)
.asChildOf(parent).start();
return span;
}
This time we are dealing with HTTP headers from HttpServletRequest, which
does not expose an API to get them all at once, but only one at a time. Since we
again need a TextMap interface for the tracer, we use another adapter class:
private static class HttpServletRequestExtractAdapter
implements TextMap
{
private final Map<String, String> headers;
HttpServletRequestExtractAdapter(HttpServletRequest
request) {
this.headers = new LinkedHashMap<>();
[ 137 ]
Instrumentation Basics with OpenTracing
@Override
public Iterator<Entry<String, String>> iterator() {
return headers.entrySet().iterator();
}
@Override
public void put(String key, String value) {
throw new UnsupportedOperationException();
}
}
This time we are only interested in the iterator() method, so the other one always
throws an exception. The constructor takes the servlet request and copies all HTTP
headers into a plain map, which is later used to get the iterator. There are more
efficient implementations, for example, HttpServletRequestExtractAdapter
from the https://github.com/opentracing-contrib/java-web-servlet-
filter library, but they are more complicated, so we provided our own version
that is simpler but less efficient, since it always copies the headers.
With the base controller class in place, we make small changes to the main
controllers, for example:
@RestController
public class HelloController extends TracedController {
@Autowired
private RestTemplate restTemplate;
@GetMapping("/sayHello/{name}")
public String sayHello(@PathVariable String name,
HttpServletRequest request) {
Span span = startServerSpan("/sayHello", request);
try (Scope s = tracer.scopeManager().activate(span, false)) {
...
return response;
} finally {
span.finish();
[ 138 ]
Chapter 4
}
}
Similar changes are made to the HTTP handlers in BBController and FController.
In order to inject the span context, we need to get access to the current span, which
we can do via the tracer.active_span property. Then we create a carrier headers
as an empty dictionary and use tracer's inject() method to populate it, asking
for the http-headers format. Then we simply pass the resulting dictionary as HTTP
headers to the requests module.
On the receiving side, we need to read from those same headers to extract the
context and use it to create a child-of reference. Flask exposes an object request that
we can use to access inbound HTTP headers. Here's how we do it in bigbrother.py:
from flask import request
@app.route("/getPerson/<name>")
def get_person_http(name):
span_ctx = opentracing.tracer.extract(
opentracing.Format.HTTP_HEADERS,
[ 139 ]
Instrumentation Basics with OpenTracing
request.headers,
)
with opentracing.tracer.start_active_span(
'/getPerson',
child_of=span_ctx,
) as scope:
person = Person.get(name)
...
We need to make the same changes in the HTTP handlers in hello.py and
formatter.py. After that, all spans emitted by the three microservices should
have the same trace ID.
Figure 4.5: A trace that spans three microservices. Network latency can be clearly observed
when descending from the inner spans in the go-4-hello service to the spans representing
HTTP server endpoints in the downstream microservices.
There is just one thing left to do. Earlier we talked about the OpenTracing
semantic conventions. HTTP requests are such a common pattern that OpenTracing
defines a number of recommended tags that should be used to annotate the spans
representing HTTP requests.
[ 140 ]
Chapter 4
Having those tags allows tracing backends to perform standard aggregations and
analysis, as well as understand the semantics of the trace better. In this last step,
we want to apply the following standard tags:
• span.kind: This tag is used to identify the role of the service in an RPC
request. Most frequently used values are client and server, which we
want to use in our example. Another pair used in messaging systems is
producer and consumer.
• http.url: This tag records the URL requested by the client or served by
the server. The client-side URL is usually more interesting, since the URL
known to the server may have been rewritten by upstream proxies.
• http.method: GET or POST, and so on.
Standard tags in Go
Since we already encapsulated tracing instrumentation for outbound HTTP calls,
it's easy to add the standard tags in one place: the get() function. We just need one
extra import for the extensions package ext (which we rename to ottag) that defines
the tags as objects with strongly typed Set() methods, such as Set(Span, string),
in contrast to the more generic span.SetTag(key, value) method, where the
second parameter can be of any type:
import ottag "github.com/opentracing/opentracing-go/ext"
ottag.SpanKindRPCClient.Set(span)
ottag.HTTPUrl.Set(span, url)
ottag.HTTPMethod.Set(span, "GET")
opentracing.GlobalTracer().Inject(
span.Context(),
opentracing.HTTPHeaders,
opentracing.HTTPHeadersCarrier(req.Header),
)
return xhttp.Do(req)
}
[ 141 ]
Instrumentation Basics with OpenTracing
On the server side, we will only add the span.kind tag. The OpenTracing
API provides an option RPCServerOption for the StartSpan() method that
combines setting the span.kind = server tag and adding a child-of reference:
func handleFormatGreeting(w http.ResponseWriter, r *http.Request)
{
spanCtx, _ := opentracing.GlobalTracer().Extract(
opentracing.HTTPHeaders,
opentracing.HTTPHeadersCarrier(r.Header),
)
span := opentracing.GlobalTracer().StartSpan(
"/formatGreeting",
ottag.RPCServerOption(spanCtx),
)
defer span.Finish()
...
}
The complete code for the exercise can be found in the package exercise4b.
import io.opentracing.tag.Tags;
[ 142 ]
Chapter 4
On the server side, we will only add the span.kind = server tag, right when
we create the new span (we'll only show the snippet from formatter.py):
from opentracing.ext import tags
@app.route("/formatGreeting")
def handle_format_greeting():
span_ctx = opentracing.tracer.extract(
opentracing.Format.HTTP_HEADERS,
request.headers,
)
with opentracing.tracer.start_active_span(
'/formatGreeting',
child_of=span_ctx,
tags={tags.SPAN_KIND: tags.SPAN_KIND_RPC_SERVER},
) as scope:
...
Exercise summary
In this exercise, we have divided the formerly monolithic Hello application into
three microservices. We discussed the inject/extract mechanisms provided by the
OpenTracing APIs for propagating tracing metadata between processes in a variety
of network protocols. We used OpenTracing-recommended standard tags to enhance
the client and server spans with additional semantic annotations.
Using baggage in Go
func FormatGreeting(
ctx context.Context,
name, title, description string,
) string {
span, ctx := opentracing.StartSpanFromContext(
ctx,
"format-greeting",
)
defer span.Finish()
greeting := span.BaggageItem("greeting")
if greeting == "" {
greeting = "Hello"
}
response := greeting + ", "
...
}
[ 145 ]
Instrumentation Basics with OpenTracing
Exercise summary
This exercise demonstrates the use of OpenTracing baggage to transparently
pass values throughout the whole distributed call graph, without any changes
to the application APIs. It is a toy example only, and we will discuss serious usage
of baggage in Chapter 10, Distributed Context Propagation.
Exercise 6 – auto-instrumentation
In the previous exercises, we added a lot of manual instrumentation to our
application. One may get the impression that it always takes so much work to
instrument code for distributed tracing. In reality, things are a lot better thanks to
a large amount of already-created, open source, vendor-neutral instrumentation
for popular frameworks that exists in the OpenTracing project under its community
contributions organization: https://github.com/opentracing-contrib/meta.
In this exercise, we will explore some of those modules to minimize the amount
of manual, boiler-plate instrumentation in the Hello application.
[ 146 ]
Chapter 4
package othttp
import (
"log"
"net/http"
"github.com/opentracing-contrib/go-stdlib/nethttp"
"github.com/opentracing/opentracing-go"
)
http.HandleFunc("/sayHello/", handleSayHello)
othttp.ListenAndServe(":8080", "/sayHello")
}
[ 147 ]
Instrumentation Basics with OpenTracing
span.LogFields(otlog.Error(err))
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
span.SetTag("response", greeting)
w.Write([]byte(greeting))
}
We no longer create any spans, but we still have some OpenTracing code. Some of
it is unavoidable, since it's completely custom logic, like setting the response tag on
the span. The error handling code is a bit unfortunate, as it's completely boilerplate
and we would rather avoid writing it (of course, it should be encapsulated in a
helper function in a larger code base). However, it's not completely avoidable, due
to the way Go defines the HTTP API, which does not include any error handling
facilities. If we used some more advanced RPC framework than the standard library,
that is, something that actually allows our handlers to return an error (for example,
gRPC), then this code could also be encapsulated in a standard instrumentation.
We use the nethttp module to instrument the outbound HTTP calls as well,
which also reduces some boilerplate code. You will find the complete code in
the exercise6 package.
Auto-instrumentation in Java
The auto-instrumentation exercise for Java is especially interesting because we can
remove all instrumentation, and still get a very detailed trace showing all HTTP and
database calls.
[ 148 ]
Chapter 4
This is in part due to how the Spring framework is designed, since it allows us to
provide it with instrumentation classes simply by adding a jar file to the class path.
All we need to do is to uncomment the dependency in the pom.xml file:
<dependency>
<groupId>io.opentracing.contrib</groupId>
<artifactId>opentracing-spring-cloud-starter</artifactId>
<version>0.1.13</version>
</dependency>
After that, try running the code from package exercise6 (in separate Terminal
windows):
$ ./mvnw spring-boot:run -Dmain.class=exercise6.bigbrother.BBApp
$ ./mvnw spring-boot:run -Dmain.class=exercise6.formatter.FApp
$ ./mvnw spring-boot:run -Dmain.class=exercise6.HelloApp
$ curl http://localhost:8080/sayHello/Gru
Hello, Felonius Gru! Where are the minions?%
Figure 4.7: An example trace from Java after removing all custom instrumentation
and adding opentracing-spring-cloud-starter jar
The trace still looks very detailed, including our only remaining manual
instrumentation for the get-person span in the Big Brother service. The server-side
spans that represent the HTTP handlers get the name derived from name of the
handler method, that is, the route /sayHello/{name} is handled by the method
sayHello() and results in the span name sayHello.
The outbound HTTP requests, on the other hand, have a very generic span name,
GET (that is, just the HTTP method), which is not surprising since the client does not
really know how the server calls its endpoints, and the automatic instrumentation
only has the target URL, which, as we discussed previously, cannot be used as an
operation name due to potentially high cardinality.
[ 149 ]
Instrumentation Basics with OpenTracing
The database query is represented by the span named Query. If we expand that span,
we will find a number of useful span tags added by the automatic instrumentation,
including the database query.
Figure 4.8: A number of useful tags added to the database query span by the automatic instrumentation
Not too bad for a completely automatic, near zero-touch instrumentation! However,
you may have noticed something peculiar about that trace: the spans representing
HTTP endpoints in the downstream microservices end later than their parent spans.
I have rerun the tests several times and always got the same results, while the traces
from Exercise 5 - using baggage that used manual instrumentation always looked
"normal." One possible explanation for this is that the automatic instrumentation
may be done at a lower level in the Spring framework, where the spans are finished
after the server has already sent the response to the client, that is, during some
request pipeline clean-up performed by the framework.
This style of instrumentation is somewhat similar to the so-called "agent-based
instrumentation" often provided by the commercial APM vendors. Agent-based
instrumentation often works by augmenting the code of the application or libraries
using various techniques, like monkey patching and bytecode rewriting, thus
requiring no changes to the application.
Our version is similarly nearly zero-touch, but we still had code in the App classes
to instantiate the Jaeger tracer. It could be made completely zero-touch (not counting
adding the jar to the class path) by using a traceresolver module (https://
github.com/opentracing-contrib/java-tracerresolver) that allows tracers
to implement an API that automatically registers the tracer as a global tracer. Then,
the global tracer can be automatically registered as a Spring bean by using another
module from OpenTracing: https://github.com/opentracing-contrib/java-
spring-tracer-configuration.
[ 150 ]
Chapter 4
Auto-instrumentation in Python
In this exercise, we will be using two open source libraries that provide almost
automatic instrumentation of many parts of our application. The first one is the
instrumentation for Flask from https://github.com/opentracing-contrib/
python-flask. While it is not fully automatic, it allows us to install a middleware
for the Flask application with one line (two if you count the import statement):
from flask_opentracing import FlaskTracer
app = Flask('py-6-hello')
init_tracer('py-6-hello')
flask_tracer = FlaskTracer(opentracing.tracer, True, app)
Unfortunately, at the time of writing the library has not been upgraded to support
the scope manager API that was only recently released for Python. As a result, even
though the framework does create a span for every inbound request, it does not set it
as the active span. To address that, I have included a simple adapter function in lib/
tracing.py:
Using the Flask request object, we ask FlaskTracer to give us the span it created
for the current request. We then activate that span through the OpenTracing scope
manager but pass False as the last argument, indicating that we do not want the
span to be finished once the scope is closed because FlaskTracer will take care
of finishing the span it started. With this helper function in place, we can replace
rather verbose instrumentation in our HTTP handlers with a single line:
@app.route("/getPerson/<name>")
def get_person_http(name):
with flask_to_scope(flask_tracer, request) as scope:
person = Person.get(name)
...
Let's go ahead and make the same change in the other two microservices. Once the
flask_opentracing library is upgraded to work with scope manager and active
span, even this call to flask_to_scope() will not be necessary: the span will be
available automatically. If we want to get access to it, without having the scope
variable, we can always get it from the tracer via the active_span property:
opentracing.tracer.active_span.set_tag('response', resp)
[ 151 ]
Instrumentation Basics with OpenTracing
This library has a submodule called client_hooks that uses the monkey-patching
technique (that is, dynamically rewriting well-known library functions) to add
tracing to a number of modules, such as urllib2, requests, SQLAlchemy, redis
(client), and so on. Other than activating this library in the main module, it requires
no changes to the source code of our application:
from opentracing_instrumentation.client_hooks import install_all_
patches
app = Flask('py-6-hello')
init_tracer('py-6-hello')
install_all_patches()
That means we can remove our manual instrumentation of HTTP requests in the
_get() function to be as simple as:
It also means we can get a deeper instrumentation of SQL queries that were hidden
from us by the ORM in SQL Alchemy. At the same time, if we still want to keep
some of our custom instrumentation, like the extra span in the get_person()
function, we can easily do it and both included libraries are compatible with that.
Figure 4.9: An example trace from Python after replacing custom instrumentation
with Flask-OpenTracing and Uber's opentracing-python-instrumentation
[ 152 ]
Chapter 4
After applying these changes (which can be found in module exercise6), we can
still get detailed traces similar to the one in the preceding screenshot. We can see that
the auto-instrumentation added a span called "SQL SELECT". If we inspect that span,
we will see that one of its tags includes the complete SQL query executed by the
ORM framework:
SELECT people.name AS people_name, people.title AS people_title,
people.description AS people_description
FROM people
WHERE people.name = %(param_1)s
Another option to explore for "extra credit" is trying to use the same Hello
application with a different implementation of OpenTracing. You can find
OpenTracing-compatible libraries for Zipkin, and for other tracing systems [5],
both free and commercial. The only places that need changing in the exercises
are the InitTracer functions.
Summary
In this chapter, we talked about tracing instrumentation. We discussed common
tasks that need to be addressed by the instrumentation, from simple annotations
to in-process context propagation, to using inject/extract trace points to transfer
distributed context across processes.
We took a brief detour and looked at using the baggage API for passing additional
metadata alongside the distributed call graph.
[ 153 ]
Instrumentation Basics with OpenTracing
References
1. The OpenTracing Project: http://opentracing.io/.
2. Code examples: https://github.com/PacktPublishing/Mastering-
Distributed-Tracing/.
3. The Jaeger Project: https://jaegertracing.io/.
4. The OpenTracing Semantic Conventions: https://github.com/
opentracing/specification#semantic-conventions.
5. Inventory of OpenTracing-compatible tracers and instrumentation libraries:
https://opentracing.io/registry/.
[ 154 ]
Instrumentation of
Asynchronous Applications
We will continue using the OpenTracing API, even though the same
instrumentation principles would apply to other tracing APIs, such as Zipkin's
Brave and OpenCensus. Since the chat application is slightly more complex,
we will only look at implementing it in Java, with the Spring framework.
After completing this chapter, you will have the knowledge and understanding
to apply instrumentation to your own asynchronous applications.
Prerequisites
In order to run the Tracing Talk chat application, we need to deploy a number
of infrastructure dependencies:
This section provides instructions on setting up the environment to run the chat
application.
[ 156 ]
Chapter 5
[ 157 ]
Instrumentation of Asynchronous Applications
$ docker-compose up
Starting chapter-06_kafka_1 ... done
Starting chapter-06_jaeger-all-in-one_1 ... done
Starting chapter-06_redis_1 ... done
Starting chapter-06_zookeeper_1 ... done
[... lots and lots of logs ...]
You should leave this running in a separate Terminal, although, if you want to run
everything in the background, you can pass the --detach flag: docker-compose up
--detach. To check that all dependencies have started successfully, run the docker
ps or docker-compose ps command. You should see four processes in the Up state,
for example:
$ docker ps | cut -c1-55,100-120
CONTAINER ID IMAGE STATUS
b6723ee0b9e7 jaegertracing/all-in-one:1.6 Up 6 minutes
278eee5c1e13 confluentinc/cp-zookeeper:5.0.0-2 Up 6 minutes
84bd8d0e1456 confluentinc/cp-kafka:5.0.0-2 Up 6 minutes
60d721a94418 redis:alpine Up 6 minutes
If you prefer to install and run each dependency without Docker, the Tracing Talk
application should still work without any additional changes, but the installation
instructions are outside of the scope of this book.
Once you are finished with this chapter, you may want to stop all the dependencies
by killing the docker-compose command, or, if you detached it from the Terminal,
by executing:
$ docker-compose down
[ 158 ]
Chapter 5
Once the application is running, we will be able to access its web frontend at
http://localhost:8080/. Each visitor is given a random screen name, such as
Guest-1324, which can be changed using the Edit button. The chat functionality
is exceedingly basic: the new messages appear at the bottom, with the name of the
sender and a relative timestamp. One can enter a message in the form of /giphy
<topic>, which causes the application to make a call to the giphy.com REST API
and display a random image on the specified topic.
Figure 5.1: The frontend view of the Tracing Talk chat application
[ 159 ]
Instrumentation of Asynchronous Applications
Implementation
We will review the implementation and source code of the main components of the
application now.
[ 160 ]
Chapter 5
AppId
The AppId class defines a bean that exposes the name of the current service.
It is created in the main App classes in all three services, for example:
@Bean
public AppId appId() {
return new AppId("chat-api");
}
The service name is used by KafkaConfig to compose a client ID string used by the
Kafka driver.
Message
The Message class is a value object that defines the structure of chat messages used
by the overall application application:
public class Message {
public String event;
public String id;
public String author;
public String message;
public String room;
public String date;
public String image;
}
It has an init() method that is called by the chat-api service when receiving
a new message from the frontend. It is used to populate some of the metadata in
the message, such as the unique ID and the timestamp.
[ 161 ]
Instrumentation of Asynchronous Applications
GiphyService
The GiphyService class is a helper that encapsulates the logic of calling the giphy.
com REST API and retrieving the URL of one of the returned 10 images at random. It
uses Spring's RestTemplate, which is auto-instrumented with OpenTracing, to make
HTTP calls.
[ 162 ]
Chapter 5
The second method, postMessages(), handles HTTP POST requests to the same
endpoint. It reads the message JSON from the request body, calls msg.init() to
initialize some metadata, and sends the message to Kafka using the KafkaService
helper.
@RequestMapping(value = "/message",
consumes = { "application/json" },
produces = { MediaType.APPLICATION_JSON_VALUE },
method = RequestMethod.POST)
public ResponseEntity<Message> postMessage(
@RequestBody Message msg
) throws Exception {
msg.init();
System.out.println("Received message: " + msg);
kafka.sendMessage(msg);
System.out.println("Message sent sync to Kafka");
return new ResponseEntity<Message>(msg, HttpStatus.OK);
}
The actual code of this method, in the source code repository, looks slightly more
complex because it allows choosing between the kafka.sendMessage() and kafka.
sendMessageAsync() methods of the helper, based on the environment variable.
We will come back to that later in this chapter. If we run the service with default
parameters, it will use the synchronous method that we showed in the preceding
listing.
Once again, the code in the repository looks slightly different because it contains
extra statements to create OpenTracing span and scope, which we will discuss later.
[ 163 ]
Instrumentation of Asynchronous Applications
Once again, we omitted the tracing code from the preceding listing.
[ 164 ]
Chapter 5
They will all produce a lot of logs. The chat-api service will log a message
Started App in x.xx seconds once it is ready. The storage-service and giphy-
service microservices also log this message, but you may not notice it because
it is followed by the logs from the Kafka consumer. We have observed that those
services are ready when they log a line like this:
2018-xx-xx 16:43:53.023 INFO 6144 --- [ntainer#0-0-C-1]
o.s.k.l.KafkaMessageListenerContainer : partitions assigned:
[message-0]
Here [message-0] refers to partition 0 of the message topic used by the application.
When all three microservices are ready, access the frontend at http://
localhost:8080/. Try sending a message with a /giphy command, for example, /
giphy hello. You may observe peculiar behavior, where the message will appear
in the chat, then disappear, then appear again, and then the image will show up,
all with delays of about one second. This is due to the polling nature of the frontend,
that is, what might be happening in that instance is the following:
• The frontend sends a message to the chat-api service. The service writes it to
Kafka and returns it back to the frontend, which displays it in the chat panel.
• In a second or so, the frontend polls the chat-api service for all current
messages and may not get the latest message back, since it hasn't been
processed yet by the storage-service microservice. The frontend
removes it from the screen.
• The storage-service microservice receives the message from Kafka and
stores it in Redis. The next poll from the frontend picks it up and displays
it again.
• Finally, the giphy-service microservice updates the image in the message,
and on the next poll the frontend displays the message with the image.
Observing traces
Once the application is running and you have sent at least one message with
the /giphy command, we can check how it looks in the Jaeger UI. The Services
dropdown in Jaeger should be showing the names of our three microservices:
[ 165 ]
Instrumentation of Asynchronous Applications
Figure 5.3: Searching for a trace for the postMessage endpoint in the chat-api service
If the trace was for a message that contained a /giphy command, we should see all
three services in it. We do not see, however, any spans from Kafka or Redis, which
is expected. We have encountered this behavior already in Chapter 4, Instrumentation
Basics with OpenTracing, where we also saw no spans from the MySQL database.
The reason is because all these third-party technologies are not yet instrumented
with OpenTracing, and the only spans we are able to observe are the client-side
spans from the application when it communicates to these backends. We hope
the situation will change in the future.
[ 166 ]
Chapter 5
Figure 5.4: Gantt chart of the trace for the request to the postMessage endpoint in the chat-api service
If we go to the Gantt chart trace view for our request, we will see more details about
all the interactions that occurred in the application, and how much time they took.
The chat-api service handled the POST request by sending the message to Kafka
(the send span).
Once the message was published, both storage-service and giphy-service
received it at almost simultaneously, indicated by the receive spans. The storage-
service microservice relatively quickly stored it in Redis, while the giphy-service
microservice took a while to make a query to the giphy.com REST API, and then resend
the updated message to Kafka.
[ 167 ]
Instrumentation of Asynchronous Applications
Figure 5.5: The Gantt chart of the same trace as in Figure 5.4 but zoomed to the end of the first Giphy interaction.
[ 168 ]
Chapter 5
Spring instrumentation
Similar to Chapter 4, Instrumentation Basics with OpenTracing, we can use the
opentracing-spring-cloud-starter library to automatically enable tracing
instrumentation in many Spring components, including the RestTemplate class,
as long as the Spring container has a bean of type io.opentracing.Tracer:
<dependency>
<groupId>io.opentracing.contrib</groupId>
<artifactId>opentracing-spring-cloud-starter</artifactId>
<version>0.1.13</version>
</dependency>
Tracer resolver
Tracer resolver (https://github.com/opentracing-contrib/java-
tracerresolver) is a library that supports instantiating the tracer from the
class path using Java's Service Loader mechanism. In order to make such a tracer
available in the Spring container, we used another artefact, opentracing-spring-
tracer-configuration-starter (it transitively pulls the tracerresolver
dependency):
<dependency>
<groupId>io.opentracing.contrib</groupId>
<artifactId>
opentracing-spring-tracer-configuration-starter
</artifactId>
<version>0.1.0</version>
</dependency>
[ 169 ]
Instrumentation of Asynchronous Applications
<dependency>
<groupId>io.jaegertracing</groupId>
<artifactId>jaeger-client</artifactId>
<version>0.31.0</version>
</dependency>
Since we no longer write any code to configure the Jaeger tracer, we are using
environment variables to pass some parameters, which is done in the Makefile.
For example, we start the storage-service microservice with the following
parameters passed to Maven via the -D switch:
JAEGER_SAMPLER_TYPE=const
JAEGER_SAMPLER_PARAM=1
JAEGER_SERVICE_NAME=storage-service-1
Once this is all done, we can get access to the tracer, if we need to, by declaring
an auto-wired dependency, for example, in the KafkaConfig class:
import io.opentracing.Tracer;
@Configuration
public class KafkaConfig {
@Autowired
Tracer tracer;
...
}
Redis instrumentation
For the purposes of the Tracing Talk application, Redis is just another service
we call, similar to calling the giphy.com API with Spring's RestTemplate.
Unfortunately, the OpenTracing instrumentation for Spring does not yet support
RedisTemplate instrumentation.
[ 170 ]
Chapter 5
@Configuration
public class RedisConfig {
@Autowired Tracer tracer;
@Bean
public StatefulRedisConnection<String, String> redisConn() {
RedisClient client = RedisClient.create("redis://localhost");
return new TracingStatefulRedisConnection<>(
client.connect(), tracer, false);
}
@Bean
public RedisCommands<String, String> redisClientSync() {
return redisConn.sync();
}
}
[ 171 ]
Instrumentation of Asynchronous Applications
Kafka instrumentation
The situation is slightly better for Kafka, which already has OpenTracing support
for Spring. We are using the following dependencies:
<dependency>
<groupId>org.springframework.kafka</groupId>
<artifactId>spring-kafka</artifactId>
<version>2.1.8.RELEASE</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>io.opentracing.contrib</groupId>
<artifactId>opentracing-kafka-spring</artifactId>
<version>0.0.14</version>
</dependency>
The low 0.0.14 version of the last module hints at it being an early experimental
version. It provides the decorators for producers and consumers, but does not
have the ability to auto-enable them as part of Spring initialization.
[ 172 ]
Chapter 5
Producing messages
Our KafkaConfig class registers the KafkaTemplate bean used to send messages
to Kafka:
@Bean
public KafkaTemplate<String, Message> kafkaTemplate() throws Exception
{
return new KafkaTemplate<>(producerFactory());
}
This is simple enough and because we are using the Spring template, we can take
advantage of Spring's serialization mechanisms by registering a JSON serializer for
the Message class. Given this template, we can send a message to Kafka like this:
@Service
public class KafkaService {
private static final String TOPIC = "message";
@Autowired
KafkaTemplate<String, Message> kafkaTemplate;
[ 173 ]
Instrumentation of Asynchronous Applications
If we go back to the trace we collected in the Jaeger UI and expand one of the
send spans, we will see that it is indeed created by the java-kafka component,
with a span.kind=producer tag, and the topic name message is captured in the
message_bus.destination tag.
This should look familiar. The span context is serialized and stored in the
Kafka message headers using the tracer.inject() call. The only new part is
the HeadersMapInjectAdapter, whose job it is to adapt the Kafka record headers
to OpenTracing's TextMap carrier API.
[ 174 ]
Chapter 5
Elsewhere, this method is called by passing it the span context from the currently
active span, accessed via scope.span(), and the record headers:
try {
TracingKafkaUtils.inject(scope.span().context(),
record.headers(), tracer);
} catch (Exception e) {
logger.error("failed to inject span context", e);
}
Consuming messages
Now let's review how the tracing instrumentation works on the consumer side.
We have already seen from the trace that our current instrumentation results
in two spans: one called receive, which is typically very short, and another
called process, which wraps the actual execution of the message handler.
Unfortunately, the Jaeger UI currently does not display what span reference type
is used by these spans (tracing through messaging is a relatively new field). We
can still find that out by using the View Options dropdown menu in the top-right
corner and selecting Trace JSON. This will open a new browser window with
JSON representation of the trace.
If we search for span names receive and process, we will see that both of these
types of spans are defined with a follows-from span reference type. As we discussed
briefly in Chapter 4, Instrumentation Basics with OpenTracing, the follows-from
reference is used to link the current span to its predecessor, and indicate that the
predecessor did not have a dependency on the outcome of the current span. This
makes sense in the case of asynchronous messaging applications: a producer writes
a message to the queue and does not wait for a response. Most often, it does not even
know who the consumer is, or if there are many of them, or if they are all down at
the moment. According to the semantic definition in OpenTracing, the producer does
not depend on the outcome of some process consuming the message later. Therefore,
the receive span using a follows-from reference makes perfect sense.
The relationship between the receive and process spans is actually similar to that
of a producer/consumer relationship. The receive span is executed somewhere
deep inside the Kafka driver, when a whole batch of messages is read from the
queue. The process span is started in the application code, asynchronously. The
receive span does not directly depend on the outcome of the process span, so
they are linked via a follows-from reference.
People who are new to tracing sometimes ask why we cannot have a span that starts
from the moment of the producer writing the message and ends when the consumer
receives it.
[ 175 ]
Instrumentation of Asynchronous Applications
It would certainly make the Gantt chart of the trace look better because they can get
really sparse when used for messaging workflows. There are a few reasons why this
is not the best approach:
The problem of the sparse Gantt charts is better solved with improvements
to the UIs.
Let's look at how all span creation is achieved in the code on the consumer side. The
receive span is created by TracingConsumerFactory, which decorates the default
consumer factory:
private ConsumerFactory<String, Message> consumerFactory()
throws Exception
{
. . .
[ 176 ]
Chapter 5
Unfortunately, this is where the immature status of the library comes though: with
this code, we are only getting the receive span, but not the process span. We can
only speculate as to why; maybe the maintainers did not have time to write proper
Spring wiring for the message handlers. Another reason could be that the Kafka
driver does not provide a place to store the span context of the receive span so
that the process span can refer to it later. Sometimes, we have to work with the
limitations of frameworks that do not expose enough hooks for middleware. We
can look into the io.opentracing.contrib.kafka.TracingKafkaConsumer code,
where we find the following method:
@Override
public ConsumerRecords<K, V> poll(long timeout) {
ConsumerRecords<K, V> records = consumer.poll(timeout);
for (ConsumerRecord<K, V> record : records) {
TracingKafkaUtils.buildAndFinishChildSpan(
record, tracer, consumerSpanNameProvider);
}
return records;
}
As we can see, the consumer reads multiple Kafka records at once and then creates
a span for each of them, finishing them immediately. There is no other state,
so the instrumentation uses a trick to preserve the span context of the receive
span: it serializes it back into the message headers under a different set of keys,
by prefixing them with the string second_span_. The code can be found in the
buildAndFinishChildSpan() method in the class TracingKafkaUtils:
spanBuilder.addReference(References.FOLLOWS_FROM,
parentContext);
[ 177 ]
Instrumentation of Asynchronous Applications
@Override
public void put(String key, String value) {
if (second) {
headers.add("second_span_" + key,
value.getBytes(StandardCharsets.UTF_8));
} else {
headers.add(key, value.getBytes(StandardCharsets.UTF_8));
}
}
}
If the instrumentation does not create the process span automatically, how did we
get it into the trace? When we showed earlier the examples of the message handlers
in the Tracing Talk application, we mentioned that they were slightly simplified.
Now let's look at their full form, for example, in the storage-service microservice:
@KafkaListener(topics = "message")
public void process(@Payload Message message,
@Headers MessageHeaders headers) throws Exception
{
Span span = kafka.startConsumerSpan("process", headers);
try (Scope scope = tracer.scopeManager().activate(span, true)) {
System.out.println("Received message: " + message.message);
redis.addMessage(message);
System.out.println("Added message to room.");
}
}
[ 178 ]
Chapter 5
The helper method uses another adapter for the headers to extract the keys with the
second_span_ prefix (unfortunately, the identical class in the instrumentation library
is private):
public Span startConsumerSpan(String name, MessageHeaders headers) {
TextMap carrier = new MessageHeadersExtractAdapter(headers);
SpanContext parent = tracer.extract(
Format.Builtin.TEXT_MAP, carrier);
return tracer.buildSpan(name)
.addReference(References.FOLLOWS_FROM, parent)
.start();
}
If we did not start this span manually, as shown, our top-level trace would have
stopped with the receive span as a leaf, and calls to Redis or to the giphy.com
API would have started new traces. We leave it as an exercise to you to verify
this by commenting out the span creation statements in the message handlers.
In Java, the asynchronous code is often written with the use of futures and
executors. In our Tracing Talk application, we have already used those APIs,
only in the synchronous manner. For example, the KafkaService.sendMessage()
method calls the send() function, which returns ListenableFuture, a sign of the
asynchronous API. We turn it back to synchronous by calling get() and blocking
until it is completed:
public void sendMessage(Message message) throws Exception {
ProducerRecord<String, Message> record =
new ProducerRecord<>(TOPIC, message);
kafkaTemplate.send(record).get();
}
[ 179 ]
Instrumentation of Asynchronous Applications
The tracing implementation we used takes care of transferring the tracing context
correctly through the asynchronous API boundaries. However, what if we wanted
to try it ourselves? The KafkaService class contains another method for sending
messages: sendMessageAsync(). It pretends that kafkaTemplate.send() is
a blocking call and uses a CompletableFuture and an executor to execute that
call on a thread that is different from the caller thread:
public void sendMessageAsync(Message message, Executor executor)
throws Exception
{
CompletableFuture.supplyAsync(() -> {
ProducerRecord<String, Message> record =
new ProducerRecord<>(TOPIC, message);
kafkaTemplate.send(record);
return message.id;
}, executor).get();
}
The chat-api service uses the environment variable KSEND to toggle between
using the sendMessage() and sendMessageAsync() methods. Since the second
function requires an executor, the chat-api service constructs one as follows:
@Bean
public Executor asyncExecutor() {
ThreadPoolTaskExecutor executor = new
ThreadPoolTaskExecutor();
executor.setCorePoolSize(2);
executor.setMaxPoolSize(2);
executor.setQueueCapacity(10);
executor.setThreadNamePrefix("send-to-kafka-");
executor.initialize();
return executor;
}
To use this executor, we can start the chat-api with the KSEND=async1 parameter:
$ make chatapi KSEND=async1
If we post a message to the chat and look for a trace, we will find that instead of
one trace, as we would expect, we created two traces: one for the top-level endpoint
that contains a single span postMessage, and one that starts with the send span as
the root, and contains the rest of the usual trace. Remember that in order to find
the second trace, you may need to select send in the Operation dropdown box.
[ 180 ]
Chapter 5
Figure 5.7: Two traces instead of one when running the chat-api service with KSEND=async1 parameter
Clearly, breaking a trace like that is not what we intended. We used an asynchronous
CompletableFuture to execute the request, which is not that much different from
the future used by the Kafka template. The problem is that we did not take care of
passing the in-process context between the threads correctly. The lambda function
passed to the future executes on a different thread, which has no access to the active
span from the caller thread. In order to bridge the gap, we can get the current active
span and re activate it inside the lambda code that runs in the other thread:
public void sendMessageAsync(Message message, Executor executor)
throws Exception
{
final Span span = tracer.activeSpan();
CompletableFuture.supplyAsync(() -> {
try (Scope scope = tracer.scopeManager().activate(span, false))
{
ProducerRecord<String, Message> record =
new ProducerRecord<>(TOPIC, message);
kafkaTemplate.send(record);
return message.id;
}
}, executor).get();
}
[ 181 ]
Instrumentation of Asynchronous Applications
If we apply this change and rerun the test with a chat message (don't forget to run
mvn install first), we will see that the trace is back to normal, with postMessage
and send spans properly connected. Unfortunately, it requires adding tracing
code directly into the application code, which we were trying to avoid as much
as possible. As you may have noticed, the code we added is not in any way specific
to the lambda function that we pass to the future. If we step through the call to the
supplyAsync() code, we will eventually reach a point where a runnable is passed to
an executor. Until that point, all execution happens on the same caller thread that has
access to the current active span. Thus, the generic solution would be to instrument
the executor (by decorating it) to perform the active span transfer between threads.
This is exactly what the TracedRunnable class in the opentracing-contrib/java-
concurrent library does:
@Override
public void run() {
Scope = span == null ? null :
tracer.scopeManager().activate(span, false);
try {
delegate.run();
} finally {
if (scope != null) {
scope.close();
}
}
}
}
The TracedRunnable class captures the current span when the decorator is created
and activates it in the new thread, before calling run() on the delegate. We could
wrap our executor into TracedExecutor from this library, which internally uses
TracedRunnable. However, that would still require changes to our code to apply
the decorator. Instead, we can let the Spring instrumentation do it automatically!
[ 182 ]
Chapter 5
@Autowired
Executor executor2;
As you probably have already guessed (or looked into the source code), we can tell
the chat-api service to use the second executor with the KSEND=async2 parameter:
$ make chatapi KSEND=async2
If we run the chat message test again, we will see that the trace is back to normal,
with up to 16 spans if the message contains the /giphy command.
To close this section, here is one final comment about using follows-from versus
child-of span references with asynchronous programming. Sometimes, the code
at the lower levels of the stack does not know its true relationship with the current
active span in the thread. For example, consider sending a message to Kafka,
which is done using an asynchronous send() method on the Kafka template.
We have seen that the producer decorator always creates the send span using a
child-of reference to the span active in the caller thread. In the case of the chat-api
service, it turns out to be the correct choice because we always call get() on the future
returned from send(), that is, our HTTP response is not produced until the send
span is finished. However, that is not a strict dependency and it is easy to imagine
a situation where an asynchronous API is called without expecting its completion
to affect the higher-level span. A follows-from reference is more appropriate in that
scenario, yet the lower-level instrumentation cannot infer the intent.
The best advice here is to make the caller responsible for attributing correct causality
to spans. If the caller knows that it is making a fire-and-forget type of asynchronous
call, it can pre-emptively start a follows-from span, so that the child-of span created
at the lower level has the correct causality reference. Fortunately, it is quite rare
that this needs to be done explicitly because even when we use asynchronous
programming, there is often the expectation that the parent span depends on the
outcome of the child.
[ 183 ]
Summary
In this chapter, we covered the more advanced topic of tracing instrumentation
for asynchronous applications. Using Apache Kafka as an example of asynchronous
communication between services, we discussed how tracing context can be
propagated through messaging infrastructure, and how producer and consumer
spans are created at both ends of the message bus. We finished off with a discussion
about instrumenting applications that use asynchronous programming in-process,
such as futures and executors in Java. While the instrumentation itself was
OpenTracing-specific, the principles of instrumenting asynchronous applications
were general and applicable to any tracing API built around the span model.
Most of the instrumentation we used in this chapter came from various off-the-shelf
open source modules from the opentracing-contrib organization on GitHub. We
even avoided the code to instantiate the Jaeger tracer and used runtime configuration
options instead. Since all the instrumentation was vendor-agnostic, any other
OpenTracing-compatible tracer could be used.
In the next chapter, we will take a look at a larger ecosystem of tracing beyond
OpenTracing, to see what other emerging standards are being developed. Some
projects are competing and some overlap in non-trivial ways, so we will introduce
a classification by the types of problems the projects are solving, and by their
primary audience.
References
1. Distributed Tracing NYC meetup: https://www.meetup.com/Distributed-
Tracing-NYC/.
Tracing Standards
and Ecosystem
[ 185 ]
Tracing Standards and Ecosystem
Styles of instrumentation
The completely manual instrumentation we did in Chapter 4, Instrumentation
Basics with OpenTracing, was useful to demonstrate the core principles, but in
practice, that instrumentation style is very rare, as it is very expensive and simply
not scalable for large, cloud-native applications. It is also quite unnecessary because
in a microservices-based application, most instrumentation trace points occur next
to process boundaries, where the communications are performed by the means
of a small number of frameworks, such as RPC libraries.
In Java applications, where dynamic code modifications like that are not allowed,
a similar effect is achieved though byte code manipulation. For example, the Java
runtime executable has a command line switch, ‑javaagent, that loads a library
that interacts with the java.lang.instrument API to automatically apply
instrumentation as classes are being loaded.
[ 186 ]
Chapter 6
Figure 6.1: Three types of tracing instrumentation and their interactions with the Instrumentation
API. The API abstracts the exact metadata propagation and trace reporting formats from the
application and framework developers, delegating it to the tracing library implementation, which
communicates with the tracing backend. Agent-based instrumentations often bypass the API
and go to the implementation directly, making them non-portable across tracing vendors.
[ 187 ]
Tracing Standards and Ecosystem
A typical microservice (Figure 6.2) contains some proprietary business logic, while
using a number of standard frameworks, usually open source, for its infrastructure
needs, such as a server framework for serving the inbound requests; potentially
a separate RPC framework for making calls to other microservices; database and
queue drivers for talking to infrastructure components, and so on. All of these
frameworks should be instrumented for tracing, but without a common API,
they won't even know how to pass the context metadata to each other.
Figure 6.2 A typical microservice composed of proprietary business logic and open source frameworks
[ 188 ]
Chapter 6
• At the same time, the API should feel natural to the specific programming
language, using established idioms and naming conventions.
• The primitives must abstract away how the context metadata is formatted
on the wire, and how collected traces are reported to the tracing backend.
Those considerations are important for the operators of a tracing system
and for successful deployment, but they are irrelevant at the time of the
instrumentation and must be decoupled.
• It must be vendor-neutral. Writing instrumentation for a framework or
an application is an expensive proposition; it does not need to be coupled
with the simultaneous decision of which tracing system will be used with
that instrumentation. I will use the term "vendor" in this chapter to refer
to any of the commercial APM vendors, managed tracing systems from
cloud providers, and open source tracing projects.
• It must be lightweight, ideally packaged as a standalone library without extra
dependencies.
The additional soft requirement for the instrumentation API is that it needs to be
widely adopted. As with any standardization efforts, competing standards fracture
the community. Competition is good for implementations, but not for standards
because it saddles the other open source developers with a difficult choice: which
competing API should they use to instrument their framework? If they choose
wrongly, their tracing code becomes incompatible with the other frameworks
that someone else might use to compose their application.
[ 189 ]
Tracing Standards and Ecosystem
Assuming that this is the complete architecture and the microservices do not
communicate with any external systems (which is actually not realistic, at least
for the Billing component, which usually needs to make calls to the external
payment processor), there are no interoperability issues in this deployment that
would prevent the collection of complete traces covering all the components
of the application.
In reality, things are usually more complicated. Imagine that our company needs
to meet data protection and privacy regulations, and after some analysis, we realize
that the fastest way to get there is to move the Billing component, which handles
credit cards, to Amazon's cloud service, AWS, because it has sufficient controls in
place. AWS provides its own tracing solution, X-Ray, which is also integrated with
other services it provides, so we may decide to use X-Ray instead of Jaeger to trace
the Billing service.
Also imagine that the database used by the Inventory module is no longer scaling
and we decide to replace it with a managed database service, for example, Google's
Cloud Spanner. Spanner, as with most managed Google services, is internally traced
with Google's StackDriver.
[ 190 ]
Chapter 6
Figure 6.4: More complicated deployment of the imaginary application introduces integration
points between different tracing backends in order to be able to observe a full view of the system in
the traces. All three tracing systems, Jaeger, Stackdriver, and X-Ray, use different data formats for
encoding the metadata in the requests, and for storing and exposing collected trace data.
Figure 6.4 shows the extended architecture, which includes the components running
both on premises and in the cloud. The complexity of the system has only increased,
so even more than before, we want to see the complete, end-to-end traces of the
transactions. Unfortunately, we have two new problems to solve before we can
do that. The first problem is that we won't even get a single complete trace for
a given transaction.
You may remember from Chapter 3, Distributed Tracing Fundamentals, that tracing
systems propagate context metadata when services make network calls, usually by
encoding it in a certain way as request headers. In our case, the metadata is being
exchanged between components instrumented with different tracing systems: Jaeger,
Stackdriver, and X-Ray. They need to agree on the wire format for metadata in order
to preserve causality relationships between tracing data. We will discuss one such
proposed format later in this chapter.
[ 191 ]
Tracing Standards and Ecosystem
The second problem is that different parts of the trace are now collected by different
tracing backends. In order to assemble them into a single view, we need to export
them into a single system, and in a single data format. At the time of writing, no such
standard format for trace data exists, although some attempts have been made in the
industry to create one.
In 2016, both Google and Amazon announced the availability of their own managed
tracing systems, Stackdriver and X-Ray, both with their own metadata and trace
data formats. These systems can be used to trace the applications running in the
respective clouds, as well as receive tracing data from the internal applications
running, by their customers on premises. A number of other tracing systems were
created and open sourced in 2016-2018.
All this activity increased the interest in distributed tracing in the industry but
did not help with interoperability or moving tracing instrumentation into the
mainstream as a required component of application frameworks, the way logging
and metrics instrumentations have been for some time now. This led to the creation
of several open source projects aiming for the standardization of various aspects
of tracing solutions. However, some of the projects have overlaps, confusing
newcomers even further.
[ 192 ]
Chapter 6
Figure 6.5: Different meanings of tracing: analyzing transactions (1), recording transactions (2),
federating transactions (3), describing transactions (4), and correlating transactions (5)
The very first notion that comes to mind when you hear tracing is the actual
tracing system that collects the data and provides users with the means to analyze
transactions using the tracing tool. When we took the HotROD for a test drive in
Chapter 2, Take Tracing for a HotROD Ride, we said we traced it, even though all
we did was use the Jaeger UI to look at the traces.
There will be people (and projects) to whom tracing means something else. They
will say tracing is about recording transactions in a process and sending them
to the tracing backend. For example, a service mesh sidecar, or an APM agent,
might be in the position to both capture the traces for a service and record them.
Understanding the behavior of such a system is much easier if we can trace the
transactions end to end through the message bus, but because it is a managed
service, it is much more likely to send its part of the traces to the tracing system
run by the cloud provider. By uniting or federating transactions, we can get
them to a single place where they can be analyzed end to end.
[ 193 ]
Tracing Standards and Ecosystem
If we are developing a business service, we are usually running and deploying a lot
of code, some of which we wrote, and some of which we brought as dependencies
from shared repositories like Maven Central, GitHub, NPM, and so on, either
as infrastructure frameworks or some extensions of business functionality. To
understand all that code, we need to trace it by instrumenting it, or in other words,
by describing transactions.
Finally, coming back to the managed cloud service, we need to ensure that the traces
are not interrupted when transactions cross the domains of different tracing vendors.
By agreeing on the metadata encoding format, we are able to correlate transactions,
even when they are recorded into different tracing backends.
All these meanings of the word tracing are valid and have people who care about
them, but they all refer to different things. It is important to make sure that these
different aspects are decoupled from each other, and the projects working on them
are clear about their scope.
[ 194 ]
Chapter 6
• Tool users, for example, DevOps, SREs, and application developers, generally
only care that the tool works and helps them to analyze transactions. They
are not involved in most projects around the standardization efforts.
• Data recorders, which often, but not always, include tracing system authors,
care about recording transactions and work on tools that may include tracing
libraries. For example, the OpenCensus project is focused on data recording,
but it is explicitly agnostic to the actual tracing backend that receives the
data. It is not maintained by the same people who maintain the tracing
backends.
• Application developers care about instrumentation APIs that help
them to describe transactions in their systems and gain visibility.
• OSS (open source software) framework developers, similar to
application developers, care about describing transactions by means
of the instrumentation API, which allows them to provide their users with
best-in-class visibility into the operation of their libraries and frameworks.
The ecosystem
In this section, we will attempt to classify some of the projects in the distributed
tracing space by the dimensions described earlier. The following table provides
a quick summary of the areas where each project has some exposure, such as
a dependency or an influence on a specific data format.
Tracer/ App/OSS
Tracing tool Trace data Metadata
Project agent instrumentation
Zipkin ✔ ✔ ✔ ✔ ✔
Jaeger ✔ ✔ ✔ ✔
SkyWalking ✔ ✔ ✔ ✔ ✔
Stackdriver,
X-Ray, ✔ ✔ ✔ ✔ ✔
and so on
W3C Trace
✔
Context
W3C "Data
Interchange ✔
Format"
[ 195 ]
Tracing Standards and Ecosystem
OpenCensus ✔ ✔ ✔ ✔
OpenTracing ✔
Tracing systems
First, let us consider a few complete tracing systems and see which areas of the
problem space they occupy.
Jaeger
Jaeger was created at Uber in 2015 and released as an open source project in
2017. It is a relatively new project but gaining in popularity due to its focus on
OpenTracing compatibility and a neutral home at the Cloud Native Computing
Foundation. Jaeger provides a feature set similar to that of Zipkin. The Jaeger
project does not provide any instrumentation by itself. Instead, it maintains a set
of OpenTracing-compatible tracers in multiple languages that can be used with
the wealth of existing OpenTracing-based instrumentation (discussed later in this
chapter). Therefore, Jaeger has no exposure in the instrumentation column in the
table. Some of Jaeger's tracing libraries are able to send data in Zipkin format,
as well as use Zipkin's B3 metadata format.
SkyWalking
SkyWalking is another relatively new project, which originated in China, and was
accepted at incubation stage to Apache Foundation in 2017. Started as a tracing
system, it has been gradually transforming into a full-blown APM solution, providing
metrics and alerting functionality, and even logs. As far as its tracing capability is
concerned, it is partially OpenTracing-compatible (only in Java), but the authors
invested heavily into agent-based instrumentation for a number of frameworks
popular in China, thus earning the checkmarks in the instrumentation columns.
[ 196 ]
Chapter 6
Standards projects
As we can see, all of the tracing systems we discussed earlier have unique
solutions for each area of the problem space, including data formats for traces and
metadata, trace recording libraries, and, with the exception of Jaeger, their individual
instrumentation APIs. This makes them mostly incompatible with each other without
some special adapters. A number of standardization efforts have appeared in
the past two years trying to bridge the gaps and bring vendor neutrality in areas
that do not require strong differentiation, such as, "how do we instrument an
HTTP endpoint?"
The Trace Context project reached one major milestone in early 2018 when
all participating vendors agreed on the conceptual model for trace identifiers.
Previously, there was a lot of disagreement on whether such a standard would
be workable for all participating tracing systems. As an example, the OpenTracing
API had for two years refrained from requiring the span context API to expose
anything resembling trace and span ID, in fear that some tracing systems would
not be able to support that. Since the decision in the W3C Trace Context, that hurdle
has been removed.
[ 197 ]
Tracing Standards and Ecosystem
The Trace Context working draft suggests two protocol headers to propagate the
tracing metadata. The first header, traceparent, holds a canonical representation
of the trace and span IDs as 16-bytes and 8-bytes arrays, respectively, encoded as
hexadecimal strings, and a flags field, which carries the indicator of a sampling
decision made upstream from the current service. As an example, it might look
like this:
Traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
The leading 00 is the version of the specification, and the 01 at the end is an 8-bit
mask with the least significant bit indicating that the trace was sampled upstream.
The second header, tracestate, is designed to provide tracing vendors with
a place to store and pass additional metadata that may be specific to the vendor
and not supported in the canonical format of the traceparent header. As an
example, SkyWalking propagates additional fields that cannot be represented
in the traceparent header, such as parent span id or parent service instance id [6].
Compliant tracers are required to pass tracestate values to the next node in the
call graph, which is useful when the transaction execution leaves the domain of one
tracing vendor and then comes back to it. Each vendor uses a unique key to mark its
own state, for example:
Tracestate: vendorname1=opaqueValue1,vendorname2=opaqueValue2
Despite the narrow focus, and relative simplicity of the format proposal, the project
has been running for over a year. This shows why the narrow focus is very important
for the open source projects, as even such a simple format takes many months to get
agreement from many stakeholders and to work out all the edge cases. As of the time
of writing, the following questions remain unresolved:
[ 198 ]
Chapter 6
In the case of a leaf service like Spanner (leaf from the customer point of
view), there is a simple solution: the incoming trace ID is not respected, but
recorded in the trace as a tag, for example "correlation ID", which still allows
the traces from the customer tracing system to be correlated with traces in the
cloud provider tracing systems. But what if the managed service is not a leaf,
for example, a messaging system like AWS Kinesis?
• If a commercial APM vendor encodes the name or ID of the customer in
the custom vendor state, and two of its customers, A and B, are exchanging
requests, how should it represent its vendor state in the tracestate header
once the request with a state for A arrives to customer B? Since there is
a chance that later requests may come back to the first customer, A, the
vendor might want to preserve both, but how should that be encoded
in a single field in the header?
I hope these edge cases will be resolved soon and the specification ratified with an
official recommendation version. The header standard will open up an opportunity
for distributed transactions crossing systems with different tracing libraries to
maintain a single view of distributed traces.
OpenCensus
The OpenCensus project [7] originates from an internal set of libraries at Google that
were called Census. It defines its mission as a "vendor-agnostic single distribution of
libraries to provide metrics collection and tracing for your services." Its roadmap also
mentions logs collection, so in the future, it may increase in scope even more.
The OpenCensus project occupies a wider problem area space than the other
standardization efforts, as it combines the instrumentation API with the underlying
implementation of trace recording and the logical data model, especially of the
metadata.
[ 199 ]
Tracing Standards and Ecosystem
OpenCensus is very opinionated about the metadata and follows the original
Google Dapper format of trace ID, span ID, and a bitmask for flags. For example,
its Go library defines the SpanContext as follows:
type TraceID [16]byte
type SpanID [8]byte
type TraceOptions uint32
This representation is the exact match for the metadata used by Google Stackdriver,
as well as the Zipkin and Jaeger projects, all of which can trace their roots to Dapper.
It is fully compatible with the W3C Trace Context format, without the need to
pass any additional data in the tracestate header. However, many other existing
tracing systems define their metadata quite differently. For example, an APM vendor
Dynatrace [8] has a much more complex metadata propagation format that contains
seven fields:
<clusterId>;<serverId>;<agentId>;<tagId>;<linkId>;<tenantId>;<pathInf
o>
For the purpose of this discussion, it is not important what those fields mean, but
it is clear that they do not fit cleanly into the OpenCensus view of the metadata. In
contrast, the OpenTracing API that we will discuss next imposes no restrictions on
what data goes into the span context.
One obvious strength of the OpenCensus project is combining tracing and metrics
functionality. On the surface, it sounds like those two are completely different
problem domains and should be kept separately. However, as we have seen in the
HotROD example in Chapter 2, Take Tracing for a HotROD Ride, some metrics might
benefit from the additional labels, such as partitioning a single metric into many
time series depending on the top-level customer or the product line.
This is only possible through the use of distributed context propagation, which is
usually available in the tracing libraries. Thus, unless the access to the propagated
context is completely de-coupled from the tracing functionality (more on that in the
Tracing Plane discussion in Chapter 10, Distributed Context Propagation), the metrics
API inevitably becomes coupled with tracing. This coupling, however, is one way:
the metrics functionality depends on the context propagation feature of tracing
functionality, but not vice versa. The same considerations will apply to the logging
APIs once OpenCensus implements them.
[ 200 ]
Chapter 6
The OpenCensus project does not get a checkmark in the trace data column in our
table because, fortunately, it is not opinionated about that external format. Instead, it
uses special modules called exporters whose job it is to convert the internal span data
model into some external representation and send it to a particular tracing backend.
As of the time of writing, OpenCensus by default ships with exporters for Zipkin,
Jaeger, and Stackdriver tracing systems.
OpenTracing
Finally, we come back to the OpenTracing project [2]. We already know from
the previous chapters that it is a collection of instrumentation APIs in different
programming languages. The goals of the OpenTracing project are:
We can see from our comparison table that there are no other problems that
OpenTracing is trying to solve. OpenTracing has no opinion on the format of
metadata encoding, or the format of trace data, or the method of recording and
collecting the traces, because those are not the concerns of the instrumentation
domain.
This was a conscientious choice on the part of the project founders, who believe that
decoupling unrelated problems is a good engineering practice, and that a pure API
leaves a lot more freedom for innovation to the implementers than devising a single
implementation.
[ 201 ]
Tracing Standards and Ecosystem
Another group of mostly end users of tracing, called the OpenTracing Industrial
Advisory Board (OTIAB), is tasked with advising the OTSC based on their
experiences, successes, and challenges. Proposals to make changes to the
specification and language APIs go through a formal and rigorous request for
comment (RFC) process. Sometimes, breaking changes have to be made, a decision
that is not taken lightly by the project members. As an example, the change of
the Java API from version 0.30 to 0.31 was accompanied by the development of
a large test suite illustrating how the new features were to be used in a variety
of instrumentation scenarios. It even included a separate adapter module to ease
the transition from version 0.30.
[ 202 ]
Chapter 6
Summary
Deploying a tracing system is not just a matter of instrumenting your code or
running a tracing backend. In this chapter, we discussed five problem areas that
must be addressed in order to "have a good time" when collecting traces, and being
able to analyze your system behavior and performance. The areas are analyzing,
recording, federating, describing, and correlating transactions.
Most existing tracing systems cover all those areas, but in their own ways, which
are incompatible with other tracing systems. This limits interoperability, which
is especially problematic when using managed cloud services. There are four
standardization projects that exist in the industry and they are trying to address
different problem areas. We reviewed the scope of each project and discussed
why having a narrow scope is important for them to be successful.
In the next chapter, we will discuss how other technologies, specifically service
mesh proxies, can be used to standardize methods of extracting tracing data from
applications.
[ 203 ]
Tracing Standards and Ecosystem
References
1. A collection of Python instrumentation tools for the OpenTracing
API: https://github.com/uber-common/opentracing-python-
instrumentation/.
2. The OpenTracing Project: http://opentracing.io/.
3. Automatic OpenTracing instrumentation for 3rd-party libraries in
Java applications: https://github.com/opentracing-contrib/java-
specialagent/.
4. Distributed Tracing Working Group. World Wide Web Consortium (W3C):
https://www.w3.org/2018/distributed-tracing/.
5. Trace Context: Specification for distributed tracing context propagation
format: https://www.w3.org/TR/trace-context/.
6. Apache SkyWalking Cross-Process Propagation Headers Protocol:
https://github.com/apache/incubator-skywalking/blob/master/
docs/en/protocols/Skywalking-Cross-Process-Propagation-Headers-
Protocol-v2.md.
7. The OpenCensus Project: https://opencensus.io/.
8. Dynatrace: Software intelligence for the enterprise cloud: https://www.
dynatrace.com/.
9. The OpenTracing Specification and Project Governance: https://github.
com/opentracing/specification/.
10. The OpenTracing Contributions: https://github.com/opentracing-
contrib/.
11. The OpenTracing Registry: https://opentracing.io/registry/.
[ 204 ]
Tracing with Service Meshes
[ 205 ]
Tracing with Service Meshes
In this chapter, we will discuss and try in practice how service meshes, a relatively
new phenomenon in the cloud-native landscape, can be used to deploy distributed
tracing using an approach that is somewhat in between the white-box and black-box
techniques. We will use Istio [2], a service mesh platform developed by Google, IBM,
and Lyft, which comes with built-in Jaeger integration. We will use Kubernetes [3]
to deploy both Istio and our sample application.
Service meshes
Service meshes have become increasingly popular in the past two-to-three years,
as more and more organizations have embarked on the path of replacing the old
monolithic applications with distributed architectures based on microservices. In
Chapter 1, Why Distributed Tracing, we discussed the benefits and challenges of these
transitions. As microservices-based applications grow in size and complexity, the
communications between the services require more and more support from the
infrastructure, to address problems like discovery, load balancing, rate limiting,
failure recovery and retries, end-to-end authentication, access control, A/B testing,
canary releases, and so on. Given a common trend in the industry, where different
parts of distributed applications are often written in different programming
languages, implementing all this infrastructure functionality as libraries, included
in the individual services in every language, becomes intractable. We would much
rather implement it once and find a way to reuse the functionality across different
services.
From a performance point of view, that means every time two services, A and B,
need to talk to each other, there must be two network hops: A → ESB and ESB →
B. For modern cloud-native applications, where a single user request may involve
dozens, or even hundreds of microservices, these extra network hops quickly add
up and contribute to the application latency. Also, a problem or a bug in the single
central layer could bring the whole application down.
[ 207 ]
Tracing with Service Meshes
The term service mesh often refers to the use of the sidecar pattern to provide the
infrastructure for microservices communications. The term itself is a bit misleading,
since the dictionary definition of a "mesh" would more accurately apply to the
network of microservices that make up a distributed application and the interactions
between them. However, the industry seems to be converging on using service mesh
to refer to the actual communications routing and management infrastructure, and
so will we in this book.
Service mesh platforms typically provide two separate components, the data
plane and the control plane. The data plane is a set of network proxies deployed
as sidecars that are responsible for the runtime tasks, such as routing, load balancing,
rate limiting, circuit breaking, authentication and security, and monitoring. In other
words, the data plane's job is to translate, forward, and observe every network
packet that goes in and out of the service instance. There is a separate design pattern
used to describe this category of sidecars, known as the Ambassador pattern and
named this way because the sidecar handles all the external communications on
behalf of the application. Examples of network proxies that can be used as a service
mesh data plane include Envoy, Linkerd, NGINX, HAProxy, and Traefik.
The control plane decides how the data plane should perform its tasks. For example,
how does the proxy know where to find service X on the network? Where does it
get the configuration parameters for load balancing, timeouts, and circuit breaking?
Who configures the authentication and authorization settings? The control plane is
in charge of these decisions. It provides policy and configuration for all the network
proxies (data plane) that run in a service mesh. The control plane does not touch any
packets/requests in the system, as it is never on the critical path. Istio, which we will
use in this chapter, is an example of a service mesh control plane. It uses Envoy as
the default data plane.
[ 208 ]
Chapter 7
The diagram in Figure 7.2 shows an architecture of the Istio service mesh platform.
It is not in the scope of this book to complete a thorough review of Istio, so I will
only provide a brief overview. The data plane is represented by the network proxies
collocated with the service instances. The control plane at the bottom consists of an
API layer and three components:
• Pilot provides service discovery for the proxy sidecars, traffic management
capabilities for intelligent routing (for example, A/B tests, canary
deployments, and so on), and resiliency (timeouts, retries, circuit breakers,
and so on).
• Mixer enforces access control and usage policies across the service mesh,
and collects telemetry data from the proxy sidecars and other services.
• Citadel provides strong service-to-service and end-user authentication,
with built-in identity and credential management.
Istio supports distributed tracing with Jaeger, Zipkin, and other tracing systems.
With the emergence of the standards for tracing data formats that we discussed
in Chapter 6, Tracing Standards and Ecosystem, the tracing capabilities provided by
the service meshes should become completely portable across different tracing
backends.
• The sidecar can emit uniformly named metrics about the traffic going in
and out of a service instance, such as throughput, latency, and error rates
(also known as the RED method (Rate, Error, Duration). These metrics can
be used to monitor the health of the services and for creating standardized
dashboards.
• The sidecar can produce rich access logs. Sometimes these logs need
to be produced and stored securely for compliance reasons. Having a
single logical component responsible for all those logs is very convenient,
especially for applications that are implemented with multiple languages
and frameworks, which makes standardizing log format and processing
rather difficult.
[ 209 ]
Tracing with Service Meshes
• The sidecar can generate traces as well. Network proxies like Envoy can
handle not only the standard RPC traffic through common protocols like
HTTP or gRPC, but other types of network calls as well, such as calls to
MySQL databases or Redis cache. By understanding the call protocols,
the sidecars can generate rich tracing spans with many useful attributes
and annotations.
The ability of the service mesh to generate traces is obviously the most interesting
in the context of this book. By generating traces outside of the application, in
the routing sidecar, the method looks almost like the holy grail of the black-box
approach, which works without any changes to the application. Unfortunately,
if something seems too good to be true, it probably is. As we will see, the service
mesh tracing does not quite rise up to the level of treating the application as
a complete black box. Without the application doing some minimal work to
propagate the context, even a service mesh cannot guarantee complete traces.
Prerequisites
In the rest of this chapter, we will focus on collecting tracing data via Istio.
We will use a modified version of the Hello application we developed in
Chapter 4, Instrumentation Basics with OpenTracing, and deploy it along with
Istio onto a Kubernetes cluster. First, we will describe some dependencies
and installations you will need to do in order to run the examples.
I should note that setting up Kubernetes and Istio is not a particularly easy
endeavor. If you are able to follow the following directions and get everything
running, then great, you will have a platform where you can experiment more
with the ideas introduced in this chapter. However, if you run into a wall and
cannot get everything running, do not despair; just read on and simply understand
the concepts, and the integration between tracing and a service mesh.
[ 210 ]
Chapter 7
formatter/
hello/
Dockerfile
Makefile
app.yml
gateway.yml
routing.yml
pom.xml
This version of the application contains only two microservices defined in the
submodule exercise1. Included is a Dockerfile used to build container images
that we deploy to Kubernetes, and a Makefile with some convenience targets for
building and deploying the application. The three YAML (*.yml) files contain
Kubernetes and Istio configurations.
Kubernetes
In order to run Istio, we need a working Kubernetes installation. The examples in
this chapter were tested with minikube (version 0.28.2), which runs a single-node
Kubernetes cluster inside a virtual machine. Instructions on how to install it are
beyond the scope of this book, so please refer to the documentation at https://
kubernetes.io/docs/setup/minikube/.
[ 211 ]
Tracing with Service Meshes
Istio
I used Istio version 1.0.2 to run the examples in the chapter. Full installation
instructions can be found at https://istio.io/docs/setup/kubernetes/quick-
start/. Here, I will summarize the steps I took to get it working on minikube.
[ 212 ]
Chapter 7
public FController() {
if (Boolean.getBoolean("professor")) {
template = "Good news, %s! If anyone needs me " +
"I'll be in the Angry Dome!";
} else {
template = "Hello, puny human %s! Morbo asks: " +
"how do you like running on Kubernetes?";
}
System.out.println("Using template: " + template);
}
@GetMapping("/formatGreeting")
public String formatGreeting(@RequestParam String name,
@RequestHeader HttpHeaders headers) {
System.out.println("Headers: " + headers);
[ 213 ]
Tracing with Service Meshes
[ 214 ]
Chapter 7
The entry point to the application is the hello service. Since we don't have the
bigbrother service to call anymore, the hello service acts as a simple proxy,
by calling the endpoint /formatGreeting of the formatter service.
[ 215 ]
Tracing with Service Meshes
We added a few help messages at the end to remind you to build against the
right Docker registry. After the build is done, we can deploy the application:
$ make deploy-app
[ 216 ]
Chapter 7
The first one instructs Istio to decorate our deployment instructions in app.yml with
the sidecar integration, and applies the result. The second command configures the
ingress path, so that we can access the hello service from outside of the networking
namespace created for the application. The last command adds some extra routing
based on the request headers, which we will discuss later in this chapter.
To verify that the services have been deployed successfully, we can list the running
pods:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
formatter-svc-v1-59bcd59547-8lbr5 2/2 Running 0 1m
formatter-svc-v2-7f5c6dfbb6-dx79b 2/2 Running 0 1m
hello-svc-6d789bd689-624jh 2/2 Running 0 1m
As expected, we see the hello service and two versions of the formatter service.
In case you run into issues deploying the application, the Makefile includes useful
targets to get the logs from the pods:
$ make logs-hello
$ make logs-formatter-v1
$ make logs-formatter-v2
We are almost ready to access the application via curl, but first we need to get the
address of the Istio ingress endpoint. I have defined a helper target in the Makefile
for that:
$ make hostport
export GATEWAY_URL=192.168.99.103:31380
Either execute the export command manually or run eval $(make hostport).
Then use the GATEWAY_URL variable to send a request to the application using curl:
$ curl http://$GATEWAY_URL/sayHello/Brian
Hello, puny human Brian! Morbo asks: how do you like running on
Kubernetes?
[ 217 ]
Tracing with Service Meshes
As you can see, the application is working. Now it's time to look at the trace collected
from this request. The Istio demo we installed includes Jaeger installation, but it is
running in the virtual machine and we need to set up port forwarding to access it
from the local host. Fortunately, I have included another Makefile target for that:
$ make jaeger
kubectl port-forward -n istio-system $(kubectl get pod -n istio-
system -l app=jaeger -o jsonpath='{.items[0].metadata.name}')
16686:16686
Forwarding from 127.0.0.1:16686 -> 16686
Forwarding from [::1]:16686 -> 16686
This allows us to access the Jaeger UI via the usual address: http://
localhost:16686/. Let's go to the Dependencies | DAG page first,
to see what services were registered by Jaeger during tracing (Figure 7.3).
Figure 7.3: Services captured in Jaeger traces for the Hello application
[ 218 ]
Chapter 7
This graph looks a bit strange because our hello and formatter services appear
to be duplicated. If we go back to the app.yml deployment file, we will see that we
are passing the app_name environment variable to our services with values hello
and formatter. Those names are passed to Jaeger tracers as the service names via
the JAEGER_SERVICE_NAME=${app_name} Java system property. So, we would
expect to see a link between the hello and formatter services on the service graph.
Instead, we see those two services hanging off two other nodes, hello-svc and
formatter-svc. These two extra names are the names we gave to the services in
the Kubernetes configuration:
apiVersion: v1
kind: Service
metadata:
name: hello-svc
---
apiVersion: v1
kind: Service
metadata:
name: formatter-svc
The spans for them are created automatically by Envoy. The Envoy proxy intercepts
both inbound and outbound network calls for each service, which explains why the
hello service has links in both directions to and from hello-svc service. Effectively,
the hello service is wrapped by the hello-svc service. The formatter service
does not make any outbound calls, so the sidecar only intercepts the inbound call
to it, which is indicated by the incoming arrow. Using different names for the actual
services and for their Kubernetes counterparts allows us to observe these finer details
in the service graph.
What are the other services we see in the graph? They are part of the Istio system.
The node istio-ingressgateway represents the public API endpoint we are
accessing via curl. The three nodes on the right, policy, mixer, and telemetry,
are the other services of Istio. They are not on the critical path of the main application
requests, but are still captured in the same trace. Let's look at that trace in a Gantt
chart view. I recommend selecting hello-svc in the Services dropdown menu to
ensure you find the right trace.
[ 219 ]
Tracing with Service Meshes
In In the screenshot (Figure 7.4), I collapsed the istio-policy spans to make room
for the more interesting parts of the trace. We see three spans highlighted with ovals
on the left that are emitted by the white-box instrumentation in Spring Boot. The rest
of the spans are produced by Istio. If we expand one of those, we will see that there
are a lot of additional details captured in the span tags (Figure 7.5).
[ 220 ]
Chapter 7
As we can see from the trace, using a service mesh adds quite a bit of complexity
to the request-processing pipeline. If we were to run this Hello application without
a service mesh and capture a trace, it would contain just three spans: two from the
hello service and one from the formatter. The same trace in the service mesh
contains 19 spans. However, if we were to run our simple application in the actual
production environment, we would have to deal with many of the same concerns
that the service mesh solves for us, so the complexity of the trace is just a reflection
of that reality. When tracing via the service mesh, we at least have the visibility into
all those additional interactions happening in our architecture.
An acute reader may have noticed that even though I started the chapter with the
premise that the service mesh may provide black-box style tracing of our application,
we in practice used the application that is internally instrumented for tracing via
Spring Boot–OpenTracing integration. What happens if we remove the white-box
instrumentation? Fortunately, in our application it is easy to try. All we need to do
is remove (or comment out) these Jaeger dependencies from the files exercise1/
hello/pom.xml and exercise1/formatter/pom.xml:
<!--
<dependency>
<groupId>io.jaegertracing</groupId>
<artifactId>jaeger-client</artifactId>
</dependency>
<dependency>
<groupId>io.jaegertracing</groupId>
<artifactId>jaeger-zipkin</artifactId>
</dependency>
-->
[ 221 ]
Tracing with Service Meshes
$ make build-app
mvn install
[... skip many logs ...]
docker build -t hello-app:latest .
[... skip many logs ...]
Successfully built 58854ed04def
Successfully tagged hello-app:latest
*** make sure the right docker repository is used
*** on minikube run this first: eval $(minikube docker-env)
$ make deploy-app
istioctl kube-inject -f app.yml | kubectl apply -f -
service/hello-svc created
deployment.extensions/hello-svc created
service/formatter-svc created
deployment.extensions/formatter-svc-v1 created
deployment.extensions/formatter-svc-v2 created
kubectl apply -f gateway.yml
gateway.networking.istio.io/hello-app-gateway unchanged
virtualservice.networking.istio.io/hello-app unchanged
istioctl create -f routing.yml
Created config virtual-service/default/formatter-virtual-svc at
revision 191779
Created config destination-rule/default/formatter-svc-destination at
revision 191781
Waits till the pods are running (use the kubectl get pods command to check)
and send a request:
$ curl http://$GATEWAY_URL/sayHello/Brian
Hello, puny human Brian! Morbo asks: how do you like running on
Kubernetes?
If we search for traces involving hello-svc, we will see that instead of one trace,
we have two traces (look for the timestamp on the right, Figure 7.6).
[ 222 ]
Chapter 7
Figure 7.6: Two traces instead of one after we removed white-box instrumentation
If we open the shorter trace (first in the screenshot), we will see that the top
two spans represent the egress from the hello-svc, and then the ingress to
the formatter-svc, both captured by the sidecar. The other spans are the
administrative activity (calls to the mixer, and so on):
Figure 7.7: One of the two traces after we removed white-box instrumentation
[ 223 ]
Tracing with Service Meshes
So, how is this supposed to work then? If we read the documentation for systems
like Linkerd or Envoy, we will find a fine print saying that in order for tracing to
work, the application must propagate a set of known headers from every inbound
call to all outbound calls. To try this out, I have added a second controller to the
hello service, called HelloController2, which has the additional logic of copying
a set of headers required by Istio from the inbound request to the outbound request:
private final static String[] tracingHeaderKeys = {
"x-request-id",
"x-b3-traceid",
"x-b3-spanid",
"x-b3-parentspanid",
"x-b3-sampled",
"x-b3-flags",
"x-ot-span-context"
};
The main handler method invokes this copyHeaders() method and passes the
result to the formatGreeting() method to include in the outbound request:
@GetMapping("/sayHello2/{name}")
public String sayHello(@PathVariable String name,
@RequestHeader HttpHeaders headers) {
System.out.println("Headers: " + headers);
[ 224 ]
Chapter 7
To execute this code path, we only need to send a request to a different URL,
/sayHello2:
$ curl http://$GATEWAY_URL/sayHello2/Brian
Hello, puny human Brian! Morbo asks: how do you like running on
Kubernetes?
All looks good; let's try to find the trace in Jaeger, which looks like the screenshot
in Figure 7.8. Once again, I collapsed the calls to the mixer to remove the distraction.
We can clearly see that the proper context transfer is happening and a single trace is
starting from the entry into the Istio gateway, and all the way down to the call to the
formatter service.
Figure 7.8: A trace without white-box instrumentation but with manual copying of headers
We have come to the most important point of this chapter. A service mesh provides
great tracing capabilities outside of the application, but it does not work without some
form of instrumentation. Should one use full white-box tracing instrumentation or
just pass the headers? I recommend using regular tracing instrumentation, for the
following reasons:
Despite the benefits of full tracing instrumentation, we have to admit that it takes
more code to implement it, especially when we do not have nice integration, as
with Spring Boot. Developers need to learn not only how to pass the context around
the application, but also how to start and finish spans, how to annotate them with
useful attributes, and how to inject and extract the context when crossing the process
boundaries. Learning how to do that adds more cognitive load for the developers.
[ 226 ]
Chapter 7
[ 227 ]
Tracing with Service Meshes
The graphs produced by Istio resemble the graph generated by Jaeger; however,
they do not include the hello and formatter nodes that were present in the Jaeger
graph. This is because the graphs are not generated from the tracing data but from
the telemetry collected by the proxy sidecars. Without the tracing information, the
sidecars only know about pairwise communications between nodes, but that is
enough to construct the graphs we see in the screenshots, and it does not depend
on any white-box instrumentation in the services. Another nice thing about Istio
service graphs is that they use the data collected in real time, and accept parameters
like filter_empty and time_horizon, which allow you to control whether to show
all services or only those actively receiving traffic, and to control the time window.
[ 228 ]
Chapter 7
$ make deploy-app
$ kubectl get pods
There we go; we have a new response from v2 of the formatter. The switch -A
is used to set the User-Agent header on the request. Why is this request routed
to v2 of the service when others are not? The answer is because we defined a
routing rule based on the OpenTracing baggage. First, let's look at the code of the
HelloController:
@GetMapping("/sayHello/{name}")
public String sayHello(@PathVariable String name,
@RequestHeader HttpHeaders headers) {
Span span = tracer.activeSpan();
if (span != null) {
span.setBaggageItem("user-agent",
headers.getFirst(HttpHeaders.USER_AGENT));
}
We can see here that we are getting the User-Agent header from the HTTP request
headers and setting it as baggage on the current span with the key user-agent.
This is the only change in the code. From previous chapters, we already know that
the baggage will be automatically propagated to all downstream calls, namely to
the formatter service. However, the formatter service itself does not depend on
this baggage. Instead, we defined an Istio routing rule (in the file routing.yml) that
checks for the string *Netscape* in the header baggage-user-agent and forwards
the request to v2 of the formatter-svc:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
[ 229 ]
Tracing with Service Meshes
name: formatter-virtual-svc
spec:
hosts:
- formatter-svc
http:
- match:
- headers:
baggage-user-agent:
regex: .*Netscape.*
route:
- destination:
host: formatter-svc
subset: v2
- route:
- destination:
host: formatter-svc
subset: v1
[ 230 ]
Chapter 7
This is obviously a toy example. Where would one use this functionality in real
systems? There may be many applications of this approach, including A/B testing,
canary deployments, routing traffic from test accounts differently, and so on. For
example, assume we are about to release a new version of some downstream service,
and we want to do a canary deployment where only traffic from a specific group of
users is routed to that version (for example, only the requests from the company's
own employees who are dogfooding the application). The problem is, first we need
to run some code in the upper layer to determine whether the user executing the
current request is eligible to be routed to the new version, and second, the routing
needs to happen somewhere downstream on a per-request basis. The combination
of distributed context propagation and service mesh routing works exceptionally
well in this case.
Summary
Service meshes are a powerful platform for adding observability features
to distributed, microservices-based applications. Without any changes to the
application, they produce a rich set of metrics and logs that can be used to monitor
and troubleshoot the application. Service meshes can also generate distributed traces,
provided that the application is white-box instrumented to propagate the context,
either by passing the headers only, or through normal tracing instrumentation.
In this chapter, we have discussed the pros and cons of both approaches and
showed examples of the traces that can be obtained with each approach. The
sidecar proxies comprise the data plane of the service mesh, and their intimate
knowledge of inter-service communications allows for the generation of detailed
and up-to-date service graphs. By combining the OpenTracing baggage (distributed
context propagation facility) with routing rules in the service mesh, we can perform
targeted, request-scoped routing decisions, useful for A/B testing and canary
deployments.
So far, we have discussed various ways of extracting tracing data from applications.
In the following and last chapter of Part II we will review different sampling
strategies that affect which and how much data is captured by the tracing
infrastructure.
[ 231 ]
Tracing with Service Meshes
References
1. Michael Chow, David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch.
The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet.
Proceedings of the 11th USENIX Symposium on Operating Systems Design
and Implementation. October 6–8, 2014.
2. Istio: Connect, secure, control, and observe services: https://istio.io/.
3. Kubernetes: Production Grade Container Orchestration: https://
kubernetes.io/.
4. Wikipedia: Enterprise Service Bus: https://en.wikipedia.org/wiki/
Enterprise_service_bus.
[ 232 ]
All About Sampling
[ 233 ]
All About Sampling
The decision is recorded as part of the trace metadata and propagated throughout
the call graph as part of the context. This sampling scheme is consistent because it
ensures that either all spans of a given trace are captured by the tracing system or
none of them are. Head-based sampling is employed by the majority of existing
industrial-grade tracing systems today.
When the sampling decision must be made at the root of the trace, there is relatively
little information available to the tracer on which to base that decision. Nonetheless,
there are many algorithms that are used in today's tracing systems to help with
making that decision, which we will discuss in the following sections.
[ 234 ]
Chapter 8
Probabilistic sampling
In probabilistic sampling, the sampling decision is made based on a coin toss
with a certain probability, for example (using a pseudo-language):
class ProbabilisticSampler(probability: Double) {
def isSampled: Boolean = {
if (Math.random() < probability) {
return true
} else {
return false
}
}
}
Some tracer implementations make use of the fact that in many tracing systems,
the trace ID itself is a randomly generated number and can be used to avoid making
a second call to the random number generator, in order to reduce the overhead:
class ProbabilisticSampler(probability: Double) {
val boundary: Double = Long.MaxValue * probability
Probabilistic samplers are by far the most popular in the tracing systems using
head-based sampling. For example, all Jaeger tracers and the Spring Cloud Sleuth
[2] tracer default to probabilistic samplers. In Jaeger, the default sampling probability
is 0.001, that is, tracing one in a thousand requests.
[ 235 ]
All About Sampling
The following code shows a sample implementation of a rate limiter used in Jaeger.
Instead of the leaky bucket algorithm terminology, it is implemented as a virtual
bank account that has a fixed rate of credits being added to it, up to a maximum
balance, and the sampling decision is allowed if we have enough credits to withdraw
a certain fixed amount of credits (typically 1.0). Every time a call is made to check
the balance, the current credit balance is recalculated based on the elapsed time,
and then compared to the withdrawal amount:
class RateLimiter(creditsPerSecond: Double, maxBalance: Double) {
val creditsPerNanosecond = creditsPerSecond / 1e9
Given this RateLimiter class, the rate limiting sampler can be implemented by
attempting to withdraw the amount of 1.0 units for every call to isSampled. Note
how this sampler supports rates of sampling less than 1 (for example, one trace in
10 seconds), by setting maxBalance to 1.0:
[ 236 ]
Chapter 8
The rate of sampled traces allowed by the rate limiting sampler usually has
no correlation with the actual traffic going through the application, thus the
extrapolation calculations possible with the probabilistic sampler are not possible
here. For this reason, rate limiting samplers may not be very useful in isolation.
Guaranteed-throughput probabilistic
sampling
To partially address the issue of rate limiting for services with spiky traffic,
Jaeger tracers implement a guaranteed-throughput sampler that is a combination
of a probabilistic sampler for normal operations and an additional rate limiter for
low-traffic periods. The rate limiter is only consulted when the probabilistic sampler
decides not to sample. This ensures that a given trace point is sampled with at least
a certain minimal rate, hence the name "guaranteed throughput." Here is the basic
algorithm of this sampler:
class GuaranteedThroughputSampler(
probability: Double,
minTracesPerSecond: Double
) {
val probabilistic = new ProbabilisticSampler(probability)
val lowerBound = new RateLimitingSampler(minTracesPerSecond)
[ 237 ]
All About Sampling
The reason the lower bound sampler is invoked before checking the result of the
probabilistic sampler is to ensure that probabilistic decisions still count toward
the rate limits. The real implementation you can find in Jaeger code is a little bit more
involved, as it also captures the description of the sampler that made the decision as
sampler.type and sampler.param tags on the root span. Thus, if a trace is sampled
because of the probabilistic sampler, the backend can still perform extrapolation
calculations.
Adaptive sampling
One of the primary reasons we sample traces is to avoid overloading tracing
backends with too much data, which they may not be able to handle. It is possible
to use simple probabilistic sampling and to tune the probabilities to ensure a steady
rate of trace data coming to the tracing backend. However, that assumes the business
traffic stays at roughly stable levels, which is rarely the case in practice. For example,
most online services handle more traffic during the day than during the night. There
are several ways the tracing backends can deal with fluctuating traffic:
Another common problem with using simple samplers is that they do not
distinguish between workloads with different traffic volumes. For example, a Gmail
server may have an endpoint get_mail that is called a thousand times more often
than a manage_account endpoint.
[ 238 ]
Chapter 8
If every endpoint were sampled with the same probabilistic sampler, then its
probability would have to be small enough to ensure a low overhead and trace
volume from the high-traffic get_mail endpoint. It would be too small to get
enough traces from the manage_account endpoint, even though that endpoint
has enough overhead budget to tolerate higher sampling rates.
[ 239 ]
All About Sampling
• Some services may be more important than others and we may want more
traces from them
• Some services may only produce small and shallow traces, while others
may produce traces with thousands of spans
• Some endpoints in the same service may be very important (for example,
StartTrip in the Uber app), while others are not that interesting when
traced (for example, a ping with the car location every few seconds)
• Some services may be running only a few instances, while others may run
hundreds or thousands of instances, resulting in vastly different volumes
of traces
As we will see in Chapter 14, Under the Hood of a Distributed Tracing System, Jaeger
client libraries were intentionally designed with a feedback loop from the Jaeger
tracing backend that allows the backend to push configuration changes back to
the clients. This design allows us to build more intelligent adaptive sampling
controlled from the backend. Unlike local adaptive sampling, which has only
limited information from a single service instance for making a sampling decision,
the backend is able to observe all traces post-collection and calculate adjustments
to sampling parameters based on a global view of the traffic patterns.
Goals
There are several different goals we may be trying to achieve with adaptive
sampling:
[ 240 ]
Chapter 8
• Since traces originating from different services may differ in the number
of spans by orders of magnitude, another possible objective is to achieve
a stable number of spans per second flowing to the tracing backend in the
sampled traces.
• Different spans can vary significantly in their byte size, depending on how
many tags and events are recorded in them. To account for that factor, the
target measure can be bytes per second, which includes the total byte size
of all spans in all traces sampled by a service.
Some of these goals may be more important than the others; it really depends on
the distributed system and the verbosity level of tracing instrumentation. Most of
them cannot be addressed via local adaptive sampling, since the information is not
available at the time of making the sampling decision. The global adaptive sampling
algorithm described in the following section can work for all three goals, although,
in practice, Jaeger currently only implements the TPS optimization.
Theory
The adaptive sampling used in Jaeger is conceptually similar to the classic
proportional-integral-derivative (PID) controller used in a variety of applications
requiring continuously modulated control, such as cruise control on a vehicle.
Imagine that we are driving a car on a highway and want to keep it at a steady
speed of 60 mph. We can think of the car as the process that we want to control,
by observing its current speed as the measured process value, , and affecting it
by changing the power output of the vehicle's engine as the correction signal, u (t ) ,
in order to minimize the error, , between the desired process value, ,
of 60 mph and the current speed, . A PID controller (Figure 8.1) calculates the
correction signal, u (t ) , as a weighted sum of proportional, integral, and derivative
terms (denoted by P, I, and D respectively), which give the controller its name.
Figure 8.1: A traditional PID controller, whose behavior is defined by the coefficients K p , K1 ,KKand
, K1 , K D
pD
[ 241 ]
All About Sampling
As we will see in the following section, the proportional, integral, and derivative
terms of the standard PID controller are not well suited for the problem of adaptive
sampling directly because of the distributed nature of the implementation.
Architecture
The architecture of Jaeger adaptive sampling is shown in Figure 8.2. On the left, we
see a number of microservices, each running the Jaeger tracing library that collects
tracing data and sends it to the Jaeger collectors (solid lines). The adaptive sampling
infrastructure in the collectors calculates the desired sampling probabilities for all
services and provides them back to the tracing libraries that periodically poll for that
information (dashed lines). We do not show the complete collection pipeline here,
as it is not relevant. The adaptive sampling is implemented by four cooperating
components that are running inside each Jaeger collector.
[ 242 ]
Chapter 8
The Counter listens to all spans received by the collector and keeps track of the
number of root spans received from each microservice, and each operation. If
the tags in the root span indicate that it was sampled via a probabilistic sampler,
it counts as a new sampled trace.
The Counter aggregates the counts for a given observation period, τ (usually one
minute), and at the end of the period, saves the accumulated counts to the Counts
table in the database. Since each collector runs a Counter component, there are K
summaries saved to the database every period, where K is the number of collectors.
[ 243 ]
All About Sampling
u ′ (t ) = u (t −1)× q , where:
r (t )
q=
y (t )
However, we do not always want to output the exact value u ′ (t ) calculated in the
preceding equation. Let's consider a couple of examples. Let's assume that our target
rate r (t ) = 10 TPS, and we observed the current rate as y (t ) = 20 TPS. That means
that the current sampling probability used by the service is too high, and we want
to reduce it by half:
r (t ) 10 1
q= = = . So:
y (t ) 20 2
u (t −1)
u ′ (t ) = u (t −1)× q =
2
This change to the sampling probability is safe to apply (u(t) ⇐ u'(t)), since it
will result in less data sent to the tracing backend. In fact, we want to apply that
probability as soon as possible, since we are clearly oversampling this service and
may be overloading our tracing backend. Now consider the reverse situation where
r (t ) = 20 and y (t ) = 10 :
r (t )
u ′ (t ) = u (t −1)× = 2u (t −1)
y (t )
In other words, our target is to sample twice as many traces as we are actually
sampling. The intuitive solution would be to double the current sampling
probability. However, when we tried this in practice, we observed a lot of
volatility in both the levels of sampling probability and the volume of sampled
traces, due to the following traffic patterns:
• Some services have periodic spikes in traffic. For example, every 30 minutes
some cron job wakes up and starts querying the service. During the quiet
periods, the tracing backend receives hardly any traces from this service,
so it tries to increase the sampling rate by raising the sampling probability,
possibly all the way to 100%. Then, once the cron job runs, every single
request to the service is sampled and it slams the tracing backend with
tracing data. This can last for several minutes, since adaptive sampling
takes time to react to the traffic and propagate the new sampling strategies
back to the clients.
[ 244 ]
Chapter 8
• Another similar pattern occurs with services that may have their traffic
drained to another availability zone for a certain period of time. For
example, site reliability engineers (SREs) at Uber have a standing
operating procedure to failover the traffic for a certain city during a severe
outage, as a way to minimize time to mitigation, while other engineers are
investigating the root cause of the outage. During these failover periods, no
traces are being received from the service, which again misleads the adaptive
sampling into believing it should raise the sampling probability.
ρnew − ρold
ρold ×(1 + θ ) , >θ
β (ρnew , ρold , θ ) = ρolo
ρnew, otherwise
Here ρold and ρnew correspond to the old u (t −1) and new u ′ (t ) probabilities
respectively. The following table shows a couple of examples of the impact of
this function. In the first scenario, a relatively large probability increase from 0.1
to 0.5 is suppressed, and only allowed to increase to 0.15. In the second scenario,
a relatively small probability increase from 0.4 to 0.5 is allowed unchallenged:
θ 0.5(50%) 0.5(50%)
ρnew − ρold
4.0 0.2
ρold
[ 245 ]
All About Sampling
With the damping function, the final calculation of the control output looks like this:
r (t )
u ′ (t ) = u (t −1)
y (t )
u ′ (t ) , u ′ (t ) < u (t −1)
u (t ) =
u(t) ⇐ min 1, β (u ′ (t ) , u (t −1) , θ ) , otherwise
In some cases, this is acceptable. For example, if the capacity of the tracing backend
is limited and we would rather have guaranteed representation of all microservices
in the architecture than allocate most of the tracing capacity to the high-throughput
services.
Extensions
How do we apply the adaptive sampling algorithm to the other two optimization
goals: spans per second or bytes per second? In the design described earlier, no
collector ever sees the full trace, since they only operate on individual spans, and
services in the main application are free to send their tracing data to any collector.
[ 246 ]
Chapter 8
If the new service is not getting a lot of traffic, it might never get sampled with this
probability. The lower-bound rate limiter comes to the rescue and samples at least
some small amount of traces, enough to kick-start the recalculations in the adaptive
sampler. However, this lower-bound rate is yet another parameter to control our
system, and we are back to the question: which value is appropriate for this parameter?
Suppose we set it to one trace every minute. It seems reasonable, so why not?
Unfortunately, that does not work too well if we suddenly deploy a thousand instances
of the service. The lower-bound rate limiter is local to each instance of the service; each
of them is now sampling one trace every minute, or 1000 ÷ 60 = 16.7 traces per second.
If this happens across hundreds of microservices (and endpoints!), we suddenly have a
lot of traces being sampled by the lower-bound rate limiter, instead of the probabilistic
sampler that we want. We need to get creative and come up with a scheme of
assigning the lower-bound sampling rate that takes into account how many distinct
instances of each service and endpoint we are running. One solution is to integrate
with a deployment system that can hopefully tell us that number. Another solution is
to use the same adaptive sampling algorithm we used to compute probabilities, and
apply it to calculate the lower-bound rates appropriate for each service.
[ 247 ]
All About Sampling
Context-sensitive sampling
All sampling algorithms we have discussed so far use very little information about
the request executed by the system. There are situations when it is useful to focus
the attention on a specific subset of all production traffic and sample it with higher
frequency. For example, our monitoring system might alert us that only users using
a certain version of the Android app are experiencing problems. It would make sense
to increase sampling rates for requests from this app version, in order to collect more
complete tracing data and diagnose the root cause.
At the same time, if our sampling happens in the backend services, and not in
the mobile app, we do not want to deploy new versions of those services that
contain the code evaluating this specific condition for sampling. Ideally, the tracing
infrastructure would have a flexible mechanism of describing the context-sensitive
selection criteria and pushing them to the tracing libraries in the microservices,
to alter the sampling profiles.
Facebook's Canopy [3] is a notable example of a tracing system that supports such
infrastructure. It provides a domain-specific language (DSL) that allows engineers
to describe the profile of requests they want to sample at higher rates. The predicates
described in this DSL are automatically propagated to the tracing code running in the
microservices and executed against the inbound requests for a predefined period of
time. Canopy even has the ability to isolate traces sampled via this mechanism into
a separate namespace in the trace storage, so that they can be analyzed as a group,
independently of the rest of the traces sampled via normal algorithms.
[ 248 ]
Chapter 8
Jaeger will ensure that the trace created for this request will be sampled, and also
mark it as a debug trace, which tells the backend to exclude this request from any
additional down-sampling. The value foo-bar of the header is stored as a tag on
the root span, so that the trace can be located in the Jaeger UI via tag search.
Let's try this with the HotROD application. If you do not have it running, please refer
back to Chapter 2, Take Tracing for a HotROD Ride, for instructions on how to start it
and the Jaeger standalone backend. When ready, execute the following command:
$ curl -H 'jaeger-debug-id: find-me' 'http://0.0.0.0:8080/dispatch'
Missing required 'customer' parameter
The service returns an error, which is expected. Let's pass a parameter that it needs:
$ curl -H 'jaeger-debug-id: find-me' \
'http://0.0.0.0:8080/dispatch?customer=123'
{"Driver":"T744909C","ETA":120000000000}%
The exact values you receive might be different, but you should get back a JSON
output that indicates a successful response. Now let's go to the Jaeger UI. If you have
it open with previous results, click on the Jaeger UI text in the top-left corner to go to
the blank search page. Select the frontend service from the Services dropdown and
enter the query jaeger-debug-id=find-me in the Tags field (Figure 8.3):
[ 249 ]
All About Sampling
Now click on the Find Traces button at the bottom, and Jaeger should find two traces
(Figure 8.4):
Figure 8.4: Traces found in the Jaeger UI via debug id tag search
Note that the small trace is not marked as having errors, even though it returned
to us an error message, and if we inspect it, we will find the HTTP status code 400
(bad request). This may be a bug in the opentracing-contrib/go-stdlib we used
to instrument the HTTP server. On the other hand, some people may argue that only
status codes in the 500-900 range (server faults) should be marked as errors, but not
client faults (400-499).
If we click into either of these traces and expand their top-level spans, we will find
that they both have a tag jaeger-debug-id with the value find-me. The trace view
page has a View Options dropdown in the top-right corner. If we select the Trace
JSON option, we can see that the flags field of the root span is set to value 3, or
00000011 in binary, which is a bitmask where the right-most (least significant) bit
indicates that the trace was sampled, and the second right-most bit indicates that it
was a debug trace.
{
"traceID": "1a6d42887025072a",
"spanID": "1a6d42887025072a",
"flags": 3,
"operationName": "HTTP GET /dispatch",
[ 250 ]
Chapter 8
"references": [
],
"startTime": 1527458271189830,
"duration": 106,
"tags": [
{
"key": "jaeger-debug-id",
"type": "string",
"value": "find-me"
},
...
In some situations, the tracing team does not even have full control over the
sampling rates set by the users. For example, in the Jaeger libraries, the default
configuration instantiates a special sampler that constantly consults the tracing
backend about which sampling strategies it should use in a given microservice.
However, nothing prevents the engineer developing that microservice from
turning off the default configuration and instantiating a different sampler, such as
a probabilistic sampler with a higher sampling probability. Even if there is no ill
intent, it is often useful to run the service with 100% sampling during development,
and sometimes people forget to change the setting in the production configuration.
The tracing team would want to protect itself from these accidents.
Post-collection down-sampling
One way to protect the tracing backend from overload is with a second round
of sampling, after the traces arrive to the collection tier. The approach was
described in the Dapper paper, and it is also implemented in Jaeger.
[ 251 ]
All About Sampling
This down-sampling technique provides the tracing team with an additional knob
they can use to adjust the average global sampling rate and to control how much
data is being stored in the trace storage. If a misbehaving service deploys a bad
sampling configuration and starts flooding the tracing backend, the tracing team
can increase the down-sampling rate to bring the trace volume back within the
backend capacity, while contacting the service owners and asking them to fix it.
The down-sampling ratio can also be adjusted automatically using an approach
similar to adaptive sampling, which we discussed earlier.
Throttling
Throttling attempts to solve the problem of oversampling at the source, at the
moment the sampling decision is made by a tracer on the first span of the trace.
Irrespective of how that decision was made, even if it was forced by the user
by sending the jaeger-debug-id header, throttling uses another rate limiter
to overrule that decision if it deems that the service is starting too many traces
(or too many debug traces).
Canopy has been reported to use such throttling in the tracing libraries. At the time
of writing, Uber's internal build of Jaeger implements throttling for debug traces, but
the functionality is not yet released in the open source version of Jaeger. Selecting
the appropriate throttling rate has the same challenges as selecting the lower-bound
sampling rate described previously: a single value of the throttling rate may not be
suitable for services that serve vastly different traffic volumes.
[ 252 ]
Chapter 8
Most likely, we will be looking to extend the adaptive sampling algorithm to support
the calculation of the throttling rate, since in the end, adaptive sampling is just
another version of the distributed rate limiting problem.
Although we could say that one in a million is not that low, given how much traffic
goes through modern cloud-native applications, and we probably will capture
some of those interesting traces, it also means that the remaining 999 traces out of
each thousand that we do capture and store in the tracing backend are perhaps not
that interesting.
It would be nice if we could delay the sampling decision until we see something
unusual recorded in the trace, such as an abnormal request latency, or an error, or
a call graph branch we have not seen before. Unfortunately, by the time we detect the
anomaly, it is too late, since the earlier fragments of the trace that may have caused
the abnormal behavior will not be recorded in a tracing system that uses head-based
sampling.
Tail-based sampling addresses this problem by making the sampling call at the
end of the request execution, when we have the complete trace and can make a
more intelligent decision about whether it should be captured for storage or not.
What constitutes an interesting trace is in fact an active area of academic research.
The following are just some examples of potential strategies:
[ 253 ]
All About Sampling
• Trace data collection must be enabled for 100% of traffic. It will introduce a
much larger performance overhead compared to head-based sampling where
the calls to tracing instrumentation are short-circuited to be effectively no-
op when the request is unsampled. To recap, without sampling, the Dapper
tracer was reported to impose 1.5% throughput and a 16% latency overhead.
• Until the request execution is completed, and the full trace is assembled and
passed through a sampling decision, the collected trace data must be kept
somewhere.
It is interesting to note that tail-based sampling changes the reason for using
sampling. We are no longer trying to reduce the performance overhead on the
application. We are still trying to restrict the amount of trace data we store in the
tracing backend, but we just said that "all data must be kept somewhere," so how
does that work?
The important thing to realize is that we only need to keep all data somewhere while
the request execution is in flight. Since most requests in modern applications execute
very fast, we only need to hold onto the data for a given trace for mere seconds. After
that, we can make a sampling decision, and in most cases, discard that data, since
we are still going to be storing the same overall ratio of traces in the tracing backend
as we would with head-based sampling. So, because we want to avoid introducing
additional performance overhead, keeping the trace data in memory for the duration
of the request seems like a reasonable approach.
[ 254 ]
Chapter 8
There are some existing solutions in the market today that successfully employ tail-
based sampling. To my knowledge, LightStep, Inc. was the pioneer in this area, with
its proprietary LightStep [x]PM technology [5]. In December 2018, SignalFx, Inc.
announced that they also rolled out an application performance monitoring (APM)
solution with tail-based sampling called NoSample™ Architecture [6]. At the same
time, a stealth-mode start-up Omnition (https://omnition.io/) has been working
with the OpenCensus project on building an open source version of OpenCensus
collector that supports tail-based sampling [7].
Let's consider how a tail-based sampling architecture could work. In Figure 8.5, we
see two microservices and two request executions recorded as traces T1 and T2. At
the bottom, we see two instances of trace collectors. The objective of the collectors is
to temporarily store in-flight traces in memory until all spans for a trace are received
and invoke some sampling algorithm to decide if the trace is interesting enough to be
captured in the persistent trace storage on the right.
In contrast with the previous architecture diagram, where collectors were shown as
a uniform cluster because they were stateless, these new collectors must use a data
partitioning scheme based on the trace id, so that all spans for a given trace are sent
to the same collector instance.
In the diagram, we show how all spans for traces T1 and T2 are collected in the first
and the second collectors respectively. In this example, only the second trace, T2, is
considered interesting and sent to the storage, while trace T1 is simply discarded to
free up space for newer traces.
[ 255 ]
All About Sampling
What are some of the challenges of implementing this architecture? The data
partitioning is certainly an unpleasant addition, as it requires coordination
between the collectors to decide which of them owns which range of the partition
keys. Usually, this would be achieved by running a separate service, like Apache
Zookeeper or an etcd cluster. We would also have to deal with the reshuffling of
the partition ring should one of the collectors crash, maybe running each collector
with one or more replicas for data resiliency, and so on. Rather than implementing
all this new functionality in the tracing system, it would probably be easier to use
some existing scalable in-memory storage solution.
Another, perhaps even more interesting, challenge is the cost of sending the tracing
data out of the application process to the collectors. In the systems with head-based
sampling, this cost is usually negligible due to low sampling rates, but in a tail-based
sampling system, we have to send every single span to another server and that is
potentially many spans for each business request processed by the service. The cost
itself consists of two parts: serializing the trace data into a binary representation
for transporting over the wire, and the network transmission. There are a few ways
these costs can be reduced:
Finally, before we can make a sampling decision, we need to be sure that the trace
is finished, that is, the collector received all spans generated for this trace. This is
a harder problem than it seems, because spans can arrive from different hosts in
the architecture, even from mobile applications outside of our data center, with
unpredictable delays. We will discuss some solutions to this problem in Chapter 12,
Gathering Insights with Data Mining. The most naive implementation is to simply wait
for a certain time interval during which no more spans are received for a trace and
then declare the trace finished and ready for sampling.
Partial sampling
Let's conclude the overview of the sampling techniques with an approach used
in some tracing systems where sampling decision does not guarantee consistent
collection of all spans of a trace. It does not mean the sampling decision is made
completely randomly at every node of the call graph, but rather that only a portion
of the call graph is sampled. Specifically, the sampling decision can be made after
detecting an anomaly in the trace, such as unusual latency or an error code. The
tracing library is changed slightly to keep all the spans for currently active traces
in memory until the entry span is finished. Once the sampling decision is made,
we send all those spans to the tracing backend. Even though we will miss the spans
from any downstream calls we already made from this service, as they finished the
execution without being sampled, at least the inner workings of the current service
will be represented in the trace, and we will be able to see which downstream
systems were called and possibly caused the error. In addition, we can propagate
the fact that we triggered sampling of the current trace backwards in the call graph, to
the caller, which would also have all its spans in memory waiting for the completion
of the request. By following this procedure, we can sample a subtree of the call graph
above the current node, and possibly other future subtrees that the upstream service
might execute in response to the error from the service that detected the issue.
Summary
Sampling is used by tracing systems to reduce the performance overhead on the
traced applications, and to control the amount of data that needs to be stored in the
tracing backends. There are two important sampling techniques: head-based consistent
sampling, which makes the sampling decision at the beginning of the request execution,
and tail-based sampling, which makes the sampling decision after the execution.
[ 257 ]
All About Sampling
This concludes the part of the book dedicated to the data gathering problem in
distributed tracing. In Part III, we will look into some practical applications and
use cases for end-to-end tracing, beyond those that we reviewed in Chapter 2,
Take Tracing for a HotROD Ride.
References
1. Benjamin H. Sigelman, Luiz A. Barroso, Michael Burrows, Pat Stephenson,
Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag,
Dapper, a large-scale distributed system tracing infrastructure, Technical
Report dapper-2010-1, Google, April 2010.
2. Spring Cloud Sleuth, a distributed tracing solution for Spring Cloud: https://
cloud.spring.io/spring-cloud-sleuth/.
3. Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor
Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan
Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and Yee Jiun Song,
Canopy: An End-to-End Performance Tracing and Analysis System, Symposium
on Operating Systems Principles.
4. Pedro Las-Casas, Jonathan Mace, Dorgival Guedes, and Rodrigo Fonseca,
Weighted Sampling of Execution Traces: Capturing More Needles and Less
Hay. In Proceedings of the 9th ACM Symposium on Cloud Computing,
October 2018.
5. Parker Edwards. LightStep [x]PM Architecture Explained: https://
lightstep.com/blog/lightstep-xpm-architecture-explained/.
6. Ami Sharma and Maxime Petazzoni. Reimagining APM for the Cloud-Native
World: Introducing SignalFx Microservices APM™: https://www.signalfx.
com/blog/announcing-signalfx-microservices-apm/.
7. OpenCensus Service: https://github.com/census-instrumentation/
opencensus-service
[ 258 ]
III
Getting Value from Tracing
Turning the Lights On
In the second part of the book, we reviewed various techniques for instrumenting
our applications for end-to-end tracing and getting the tracing data out. This next
part is all about what we can do with that data, as well as the tracing infrastructure
in general.
We have already seen some glimpses of what is possible in Chapter 2, Take Tracing for
a HotROD Ride, when we ran the HotROD demo application. In this chapter, we will
do a more thorough review of the benefits provided by end-to-end tracing, and ways
of using the tracing data to help engineers with day-to-day tasks. Some of the ideas
presented here are theoretical, meaning that while they are feasible, not all of them
are implemented in the existing tracing systems because there is so much to do when
you start exploring the possibilities of end-to-end tracing. I hope that you can use
some of these ideas as inspiration for what to do with the data generated by your
own tracing infrastructure.
[ 261 ]
Turning the Lights On
Imagine you are a new team member who is not familiar with the system you are
hired to work on. How would you learn about the system architecture, deployment
modes, or bottlenecks? Documentation is rarely up to date enough to be a reliable
source of information. Most engineers would rather build new features than spend
time on documenting how the system works. An example from my own experience:
I am working on a project of building a new distributed monitoring system for
compliance purposes, which started just two months ago. Already the system
architecture described in the original Request for Comments (RFC) document is
out of date because better ways were discovered during development and testing.
This situation is typical for many teams, especially those that make heavy use of
microservices and agile development practices.
If documentation is not reliable, then the only way new team members learn
about the system is by asking the existing team members to explain it. This
is a slow process that relies on a lot of tribal knowledge, which may also be
inaccurate in fast-moving organizations.
[ 262 ]
Chapter 9
Service graphs
You have seen examples of the service graphs in previous chapters, some generated
by Jaeger, and others generated by the service mesh Istio. There are other open source
tools, such as Weaveworks Scope [2] and Kiali [3], that can provide similar service
graphs by integrating with other infrastructure components, like service meshes, or
simply by sniffing network connections and traffic. These graphs can be extremely
useful for quickly grasping the architecture of the system. They are often annotated
with additional monitoring signals, such as the throughput of the individual edges
(requests per second), latency percentiles, error counts, and other golden signals
indicating the health or performance of a service. Various visualization techniques
can be used to enhance the presentation, such as representing relative throughput of
the edges with line thickness or even animation (for example, Netflix Vizceral [4]), or
color-coding healthy and less healthy nodes. When the application is small enough to
fit all of its services in a single service graph, the graph can be used as an entry point
to the rest of the monitoring stack, as it allows for a quick overview of the application
state, with the ability to drill down to the individual components:
Figure 9.1: Examples of service graphs. Left: architecture of the HotROD application from Chapter 2 discovered
by Jaeger. Right: architecture of the Hello application from Chapter 7 discovered by the service mesh Istio.
[ 263 ]
Turning the Lights On
Despite these limitations, pair-wise dependency graphs are the de facto standard
among most distributed tracing tools and commercial vendors today.
The tool is able to accurately filter out the irrelevant nodes because it builds
the graph from the tracing data by accounting for the actual paths through
the architecture observed in the traces. Consider the call graph involving five
microservices and their endpoints in Figure 9.2, part (A). The algorithm collects
all unique branches in the tree, starting from the root service A and ending at
a leaf node. The process is repeated for all traces being aggregated for a certain
time window. Then, the final dependency graph for a selected focal service B is
reconstructed from the accumulated branches by filtering out all paths that do
not pass through service B, as shown in part (C).
[ 264 ]
Chapter 9
Figure 9.2: (A) sample call graph between services and endpoints. (B) a collection of
paths extracted by the algorithm. (C) dependency graph reconstructed from the collected
paths after selecting service B as the focal point.
How can we use this technique in practice? Figure 9.3 shows an example
constructed from production data at Uber, with service names obfuscated. Most
services shown in the graph have many more immediate neighbors in practice,
and if those neighbors were included in the diagram, it would quickly become too
busy and unusable. Instead, the diagram is built only from traces that are passing
through a selected focal service, in this case, service shrimp. By doing that, we can
get a much clearer picture of the true dependencies of this service, both upstream
and downstream, and not only immediate neighbors.
[ 265 ]
Turning the Lights On
Figure 9.4: Investigating whether the dingo service depends on the dog service using path-aware filtering,
by only showing the paths passing through the dog service (left) and through the dingo service (right).
All paths are passing through the focal service, shrimp, shown in the thicker line.
[ 266 ]
Chapter 9
The tool applies the filter by graying out the services in the diagram that are not
encountered in any traces passing through the selected service. It makes it obvious
that only requests originating from the seagull service ever reach the dog service,
while the requests originating from the dingo service the have a very different
call graph.
This same tool can be switched to display the dependency graphs at the
endpoint level, rather than at a service level as shown in the graph, by creating
standalone nodes for every service and endpoint combination. These graphs
are typically much larger, such that their visualization again becomes more
difficult and usually requires horizontal scrolling. However, they provide
a great level of detail when one needs to fully understand the interactions
between the services.
Lastly, these graphs can be annotated with performance metrics, such as requests
per second going through a particular path or latency percentiles. Performance
annotations can quickly highlight problematic areas that require investigation.
Service graphs, even the basic pair-wise graphs, can illustrate a high degree of
connectivity between the services and can be very effective in highlighting these
architectural issues of nearly fully connected graphs, and pointing toward the
way of fixing them.
[ 267 ]
Turning the Lights On
For example, we may want to ensure higher affinity between services that belong to
the same business domains, such as payments or order fulfilment, while at the same
time reducing the connectivity between services in different domains and instead
proxying all requests through well-defined API gateways (Figure 9.5).
Figure 9.5: (A) Application starts as a monolith. (B) Application evolves into microservices,
but without clear boundaries: an almost fully connected "distributed monolith". (C) Application
is organized into business domains with clear APIs and boundaries.
Performance analysis
Using tracing data for application performance analysis is the classic use case
of distributed tracing. Different aspects of application performance can be
investigated via tracing:
"The critical path is defined to be the set of segments for which a differential
increase in segment execution time would result in the same differential increase
in end-to-end latency."
To put it differently, if we can increase the duration of a certain span in the trace
without affecting the overall duration of the transaction, then this span is not on
the critical path. When analyzing a trace, we are interested in the spans that are on
the critical path because by optimizing those, we can reduce the overall end-to-end
latency, while optimizing the spans off the critical path is not as useful.
Figure 9.6: Example of a critical path, in red bars striking through the spans,
in an imaginary trace for a social media web site.
[ 269 ]
Turning the Lights On
Figure 9.6 shows how a critical path might look for an imaginary trace for a social
media web site. Notice that the critical path is a function not only of the overall
trace, but also of the current visualization zoom level. For example, when looking
at the complete trace, only small portions of the api-server span are shown on the
critical path, since the rest of the time is spent in the downstream calls. However, if
we collapse all the details underneath that span, in order to focus on the critical path
through the top three services, then the whole api-server span becomes a part of
the critical path (Figure 9.7).
Figure 9.7: Change in the critical path when all following details the "api-server" span are collapsed
How can we use the critical path visualization to improve the end-to-end latency
of the request?:
• We can ignore all spans that are off the critical path, as optimizing them
will not reduce the latency.
• We can look for the longest span on the critical path. In the preceding
example, if we could reduce the duration of any given span by 50%, then
doing that for the first mysql.Query span will be the most impactful in
reducing the end-to-end latency.
• Finally, we can analyze critical paths across multiple similar traces and
focus our attention on the spans that represent the largest average percentage
of the critical path, or the spans that are found on the critical path more often
than others. Doing that reduces the chances that we spend time optimizing
a particular span in a trace that turns out to be an outlier.
Systematic analysis of the critical paths is important for the long-term health
of the system. At a recent keynote talk at the Velocity NYC 2018 conference [6],
Jaana B. Dogan, an engineer from Google who works on tracing and performance
profiling problems, coined a term critical path driven development (CPDD).
She observed that availability of every single service in a large architecture is
not a goal in itself.
[ 270 ]
Chapter 9
It is more important to see the system from the perspective of the end users, and
to ensure that services on the critical path for the end user requests are available
and performant. In CPDD, the engineering practices are based on:
Tools like distributed tracing play a major role in achieving these practices.
[ 271 ]
Turning the Lights On
Figure 9.8: Error markers on the spans usually point to a problem in the execution
Figure 9.9: Look for the longest span on the critical path as the first candidate for optimization
[ 272 ]
Chapter 9
Figure 9.10: A gap between the span in the execution often indicates missing instrumentation
[ 273 ]
Turning the Lights On
The approaches to avoid the "staircase" pattern depend on the specific scenario.
In the case of the HotROD example, there was no restriction in the business logic
that required us to load driver records from Redis one at a time, so all requests
could have been parallelized. Alternatively, many queries can be replaced with
bulk queries or joins on the database side, to avoid the individual sub-queries. It
is really surprising how often this simple issue occurs in practice, how easy it is
to spot with tracing, and how great the resulting performance improvements are,
going from seconds or tens of seconds to sub-second latency, for example, on screens
like menus or catalogues.
What could cause a series of spans to finish at exactly the same time? One
possible explanation is when the system supports timeouts with cancellations. In
Figure 9.12, the top-level span may have been waiting for the four tasks to finish, but
since they did not complete in the allotted timeframe, it canceled them and aborted
the whole request. In this scenario, we may want to tune the timeout parameter, or
to investigate why the individual work units were taking longer than anticipated.
Figure 9.12: Spans finishing at exactly the same time; color us suspicious.
Another example where we can observe this pattern is when there is a resource
contention and all the requests are waiting on some lock, such as a long-running
database transaction from another request that locked the table. Once the lock is
released, our units of work are able to complete quickly. We may want to investigate
what it is that is blocking all these spans, by adding additional instrumentation.
[ 274 ]
Chapter 9
Exemplars
The techniques we discussed in the previous section typically assume that we
are able to obtain a trace that is representative of the problem we are trying to
solve. Let's talk about how we can even find those traces among the millions
that a tracing infrastructure like Jaeger at Uber collects in just a few hours.
If we know that we are looking for requests taking longer than one second, and we
know during which time that abnormal latency was observed, for example, from
our metrics dashboard, then we can query the tracing system for representative
traces, assuming our sampling rates were high enough to capture them. For example,
the search panel in the Jaeger UI allows us to specify the exact time range and the
duration of the spans. However, doing so manually is a tedious process and not
something we want to be attempting during an outage when every minute counts.
Combining time series graphs with trace exemplars not only facilitates discovering
relevant traces that are representative of a performance degradation, but it is also
a great way of educating engineers about the capabilities of tracing systems, by
surfacing the information in their on-call workflows and dashboards.
Latency histograms
As the systems grow more complex, even monitoring high percentiles of latency
is not enough. It is not uncommon to see requests to the same endpoint of a service
exhibit radically different performance profiles, for example, depending on the caller
service or other dimensions we may want to associate with the request (for example,
a customer account). These performance profiles represent different behaviors in
the distributed systems, and we should not be measuring the performance with
a single number (even if it's a high-percentile number); we should be looking at
the distribution of that number. Latency histograms are a practical approximation
of the real performance distribution that we can plot based on the tracing data.
Different behaviors of the system often manifest themselves in multi-modal
distribution, where instead of the classic bell shape of the normal distribution,
we get multiple humps.
[ 276 ]
Chapter 9
We can see that the distribution in the histogram is multi-modal, with a couple
of humps on the short end (1-3ms), a large hump in the middle (70-100ms), and
a long tail going all the way up to 15 seconds. The tool supported the ability to
shift time by adjusting the sliders on the timeline at the top. Most importantly,
the tool allowed selecting a portion of the distribution and getting a breakdown
of the number of calls by the upstream caller, as well as the ability to view sample
traces exhibiting the selected latency.
The tool was an early prototype, but it allowed us to visualize and investigate
some abnormal behaviors in the system, such as one specific caller often causing
slow responses because it was querying for rarely used data that was causing
cache misses.
[ 277 ]
Turning the Lights On
Figure 9.15: Analyzing latency histograms using Live View from Lightstep.
The filter box at the top allows slicing and dicing the data by many dimensions.
Reproduced with permission from Lightstep, Inc.
Long-term profiling
Performance optimization is a job that is never done. Applications are constantly
evolving, developing new features, and adapting to new business requirements.
This adds complexity and introduces new, often unexpected behaviors into the
system. Optimizations done previously may no longer apply, or be offset by
performance degradation elsewhere, due to new interactions. The discipline
of long-term profiling allows us to keep the degradations in check, detect
them early, and fix them before they become a real problem.
[ 278 ]
Chapter 9
Monitoring trends via aggregate tracing data has a distinct advantage over
traditional measurements because in addition to informing us about degradations,
the aggregates also provide insights into the root cause of those changes. Naoman
Abbas, an engineer at Pinterest, gave a talk at Velocity NYC 2018 where he
presented two analysis techniques that the company applies to understand
trends in performance [9].
One of them is an offline analyzer that takes as input a certain description of the trace
shape and two timeframes, and runs aggregation over each timeframe to calculate
certain cumulative metrics, such as number of traces observed, number of services
involved, average overall latency, average self-latency of individual operations,
and so on. By comparing the resulting numbers, the engineers can make hypotheses
about what changes in the architecture may be responsible for the degradation, then
they can test those hypotheses by rerunning the analyzer with more detailed filtering
criteria.
The second approach involves real-time extraction of features from fully assembled
traces, such as cumulative time spent in the backend spans versus cumulative
time spent waiting on the network, or extracting additional dimensions from the
traces that can be later used for filtering and analysis, such as the type of client
application executing the request (Android or iOS app), or the country where the
request originated from. We will discuss this technique in more detail in Chapter 12,
Gathering Insights with Data Mining.
Summary
In this chapter, I just scratched the surface of all the possibilities of using tracing
data for the analysis and understanding of complex distributed systems. With
end-to-end tracing finally getting mainstream adoption, I am certain that many
more exciting and innovative techniques and applications will be developed
by software engineers in the industry and computer scientists in academia. In the
following chapters, I will cover more ideas; some based on the tracing data itself,
and others made possible by the tracing infrastructure.
[ 279 ]
References
1. Ben Sigelman. OpenTracing: Turning the Lights on for Microservices.
Cloud Native Computing Foundation Blog: https://www.cncf.io/
blog/2016/10/20/opentracing-turning-the-lights-on-for-
microservices/.
2. Weaveworks Scope: https://github.com/weaveworks/scope.
3. Kiali: observability for the Istio service mesh: https://kiali.io.
4. Vizceral: animated traffic graphs: https://github.com/Netflix/vizceral.
5. Michael Chow, David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch.
The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet
Services. Proceedings of the 11th USENIX Symposium on Operating Systems
Design and Implementation. October 6–8, 2014.
6. Jaana B. Dogan. Critical path driven development. Velocity NYC 2018: https://
conferences.oreilly.com/velocity/vl-ny/public/schedule/
detail/71060.
7. Bryan Boreham. How We Used Jaeger and Prometheus to Deliver Lightning-Fast
User Queries. KubeCon EU 2018: https://youtu.be/qg0ENOdP1Lo?t=1094.
8. Ben Sigelman. Performance is a Shape, Not a Number. Lightstep Blog, May 8,
2018: https://lightstep.com/blog/performance-is-a-shape-not-a-
number/.
9. Naoman Abbas. Using distributed trace data to solve performance and operational
challenges. Velocity NYC 2018: https://conferences.oreilly.com/
velocity/vl-ny/public/schedule/detail/70035.
Distributed Context
Propagation
[ 281 ]
Distributed Context Propagation
As it turns out, there are many other useful applications of context propagation
that can be implemented on top of tracing infrastructure. In this chapter, we will
look at several examples. However, first I want to address one question you may
already have: if end-to-end tracing, as well as other functionality, must be built
"on top of" context propagation, shouldn't context propagation be a separate,
underlying instrumentation layer, instead of being bundled with tracing APIs?
As we will see in the following section, the answer is yes, theoretically it can be
separate, but in practice it is more nuanced and there are good reasons for having
it bundled with tracing.
[ 282 ]
Chapter 10
Despite their massive potential and usefulness, these tools are challenging to deploy
in the existing distributed systems, especially those based on microservices, because
they often require modifications to the application source code, similar to tracing
instrumentation. The changes can often be logically broken into two parts: the
code used to propagate metadata, and the logic of a specific cross-cutting tool.
The context propagation is usually independent of the tool; it only depends on the
structure and concurrency model of the application, for example, threads, queues,
RPC, and messaging frameworks. The cross-cutting tool logic is usually concerned
with the exact semantics of the metadata the tool needs to propagate, but not with
the propagation mechanism itself.
The Tracing Plane separates the metadata propagation from the cross-cutting tool
instrumentation by providing a layered architecture. At the top of the architecture
is the Cross-Cutting Layer, which represents the instrumentation of the actual
tools, such as end-to-end tracing. Each tool has its own metadata that needs to be
propagated, for example, trace and span IDs for tracing. The tools define a schema
for their metadata using Baggage Definition Language (BDL), which resembles
the protocol buffers definition language. A tracing tool like Jaeger may define
the baggage schema as follows:
bag TracingTool {
int64 traceID = 0;
int64 spanID = 1;
bool sampled = 2;
}
[ 283 ]
Distributed Context Propagation
The Tracing Plane project provides a compiler from BDL to different programming
languages. The compiler creates interfaces for accessing and manipulating the data
in the baggage, for example:
tt := TracingTool.ReadFrom(baggageCtx)
tt.SetSpanID(123)
The Baggage Layer is the top layer of the Tracing Plane. It provides the cross-cutting
layer with access to the metadata in a structured way, performs encodings for data
types into cross-platform binary formats, handles nested data types, and allows
multiple cross-cutting tools to keep their own metadata in separate namespaces
within a single baggage context (multiplexing). The Baggage Layer is optional
because the top-level tool is free to access the lower Atom Layer directly, but
then the tool would have to operate on low-level binary data.
The atom layer is the core context propagation layer. It has no knowledge of the
semantics of the metadata stored in the baggage context, which it treats as opaque
binary data. The atom layer exposes five operations:
Serialize(BaggageContext): Bytes
Deserialize(Bytes): BaggageContext
Branch(BaggageContext): BaggageContext
Merge(BaggageContext, BaggageContext): BaggageContext
Trim(BaggageContext): BaggageContext
The first two operations are used for encoding and decoding baggage when
execution jumps between processes, for example, as part of an RPC request via
HTTP headers. The Branch and Merge operations are used when execution splits
into multiple branches (local forks or outgoing RPC requests) and then joins back.
The Merge operation has specific semantics of merging baggage items from two
contexts, which I will not discuss here (please refer to the paper).
Trim is used when there are constraints on the size of the baggage context, for
example, when communicating over a protocol that restricts the size of the request
metadata (a common occurrence in legacy systems and proprietary protocols).
The operations of the atom layer are used by the Transit Layer. This layer itself is
not part of the Tracing Plane framework. Instead, it is the actual instrumentation
that is written by the application and framework developers to manipulate the
baggage context. These developers know the ins and outs of their application or
framework; the concurrency semantics of threading and queueing models; details
of RPC implementations, and so on. Yet they do not need to know anything about
the contents of the baggage context, which allows different cross-cutting tools to be
built on top of the instrumentation in the transit layer.
[ 284 ]
Chapter 10
If you have read Chapter 4, Instrumentation Basics with OpenTracing, of this book,
you will probably see some similarities with OpenTracing:
• The Inject() and Extract() operations in the Tracer interface are similar
to Tracing Plane's Serialize() and Deserialize() operations in the Atom
Layer. To be completely agnostic to the transport protocols, the Tracing Plane
uses only binary encoding of the baggage, while OpenTracing allows text-
based representations.
• Starting a new child span in OpenTracing is somewhat equivalent to the
Branch() operation because the new span receives its own copy of the
baggage that is propagated independently of the parent. OpenTracing
does not support reverse propagation (for example, via RPC response
headers), and does not define clear semantics of baggage merging when
a span is created with more than one parent reference, so there is no
equivalent of the Merge() operation. The Trim() operation is also not
defined explicitly in OpenTracing.
• The span API maps to the cross-cutting layer specialized to the domain
of end-to-end tracing.
• The OpenTracing instrumentation inside the applications corresponds to the
transit layer. It uses the combination of inject and extract methods to encode
or decode the context to and from the wire formats, and the scope and scope
manager APIs to propagate the context in-process (please refer to Chapter 4,
Instrumentation Basics with OpenTracing).
• The SetBaggateItem() and GetBaggageItem() methods on
the span roughly correspond to the rest of the atom layer. Since
OpenTracing baggage only supports strings, there is no equivalent
of the baggage layer with complex data types and namespaces for
different metadata.
A reasonable question at this point would be, why was OpenTracing (as well
as other tracing APIs) not implemented with a similar layered architecture as the
Tracing Plane? Ironically, even though it was not using this architecture (which was
not invented at the time), the very first iteration of what later became OpenTracing
was, in fact, called distributed context propagation (DCP). Later, it was renamed
to OpenTracing, as the authors realized that the main reason for the project was to
make "distributed tracing", first and foremost, more accessible to software developers
by reducing the barrier to entry, which is typically the complexity of the white-box
instrumentation.
[ 285 ]
Distributed Context Propagation
The tracing practitioners working on the OpenTracing API felt that this would
further complicate an already non-trivial tracing instrumentation. On the other
hand, focusing on the tracing instrumentation alone, while still providing the option
for baggage propagation, was a much easier sell to organizations looking to adopt
distributed tracing. As we will see later in this chapter, a lot of cross-cutting tools
are still possible to implement, even with the limited version of baggage support
available in OpenTracing and other tracing APIs.
[ 286 ]
Chapter 10
Pivot tracing
Pivot Tracing [2] is another fascinating project from Brown University that
won the Best Paper Award at SOSP 2015 (Symposium on Operating Systems
Principles). It provides dynamic causal monitoring for distributed systems by
allowing users to define, at runtime, arbitrary measurements at one point in the
system, and then select, filter, and group these measurements by events in another
part of the system for reporting.
This query is applied to HDFS data nodes that are processing requests from different
clients running HBase, Map-Reduce, and direct HDFS clients. The query involves
the data from two trace points: one high in the stack, called ClientProtocols,
which captures the type and process name of the client, and the other running on
the data nodes at the bottom of the stack, called DataNodeMetrics, which collects
various statistics, including the number of bytes read from disk for a given request
(incrBytesRead).
The query groups all requests by the client.procName and calculates the total
amount of disk usage per client. It looks pretty similar to the HotROD example, but
in HotROD, we had to not only hardcode the calculation of the time spent, but also
manually attribute it to two parameters in the metadata: session ID and customer
name. If we wanted to do the aggregation by a different parameter, for example, by
the User-Agent header from the browser, we would have to change the code of the
route service. In Pivot Tracing, we only need to change the query!
[ 287 ]
Distributed Context Propagation
The happened-before join allows the joining of only those events that are causally
related, in this case client requests causing disk reads. In Pivot Tracing implementation,
the happened-before joins were restricted to events occurring in the same distributed
transaction, but in theory, they could be extended to a wider definition of causality,
such as events from one execution influencing another execution.
The paper demonstrates how Pivot Tracing was used to diagnose a performance
issue in HDFS and discover a bug in the implementation. Due to the dynamic nature
of Pivot Tracing queries, the authors were able to iteratively issue more and more
specific queries to the system, grouping the results, and throughput metrics by
various dimensions, until they were able to find the root cause of the issue and
demonstrate the software bug. I recommend reading the paper for a detailed
walkthrough of that investigation.
Let's discuss how Pivot Tracing is able to achieve all that. The implementation was
targeted at systems implemented in Java, which allows dynamic instrumentation of
the applications via byte code manipulation and the injection of trace points. Trace
points emit events that contain certain attributes, such as values of the in-scope
variables. Pivot Tracing can also work with the existing (permanent) trace points
in the code. The preceding diagram illustrates how it evaluates the queries:
1. The events of the trace points define a vocabulary for the queries, for
example, events ClientProtocols and DataNodeMetrics.incrBytesRead,
and their attributes, procName and delta, used in the preceding query.
2. The operator constructs a query that they want to evaluate in the system
using the supported vocabulary.
3. The Pivot Tracing frontend analyzes the query and compiles it to an
intermediate representation called "advice" that is distributed to Pivot
Tracing agents embedded in the applications.
[ 288 ]
Chapter 10
4. The agent maps the instructions in advice to code that it installs dynamically
at relevant trace points.
5. When the execution passes through trace points, they execute the code
from advice. Certain instructions in advice tell the trace points to pack
certain attributes of the observed events into the metadata context and
propagate it with the execution via baggage, for example, in the query the
procName attribute is packed by the first invocation of the ClientProtocols
trace point and later accessed (unpacked) by the incrBytesRead trace point
to produce a data tuple.
6. Other instructions in advice may tell the trace points to emit data tuples.
7. Data tuples are aggregated locally and streamed over the message bus to
the Pivot Tracing backend.
8. The frontend performs final aggregations and produces reports.
It is interesting to note that in general, implementing a happened-before join
could be very expensive, requiring all tuples to be aggregated globally in the cluster
prior to evaluating the join. Pivot Tracing greatly simplifies this process by relying
on baggage to capture and propagate relevant group-by attributes (at the cost of
increased request size), so that the actual join is performed implicitly, when the
attributes are extracted from baggage and included in the emitted tuples.
Pivot Tracing does not implement its own baggage propagation. Instead, it relies
on the Tracing Plane functionality that we discussed earlier. However, from the
description of the algorithm, it is easy to see that its requirements on the baggage
mechanism are pretty minimal and can be easily satisfied by the OpenTracing
implementation, for example, by encoding a single baggage item with the key
pivot-tracing and a JSON string as the value. There are some edge cases that
the Tracing Plane handles via the merging of baggage contexts, such as parallel
executions that both increment the same counter, but in many systems these merges
are not required (as long as we can represent the execution via the span model).
Since Pivot Tracing is able to install instrumentation dynamically, it can co-exist
with an OpenTracing instrumentation and make use of the OpenTracing baggage
for its query evaluation.
Chaos engineering
To address the massive impact of system downtime on business revenues, many
organizations are adopting Chaos Engineering in order to gain confidence that their
systems are fault-tolerant, that is, built to anticipate and mitigate a variety of software
and hardware failures. Many organizations are implementing internal "failure as
a service" systems, such as Failure Injection Testing (FIT) [6], Simian Army [7] at
Netflix, uDestroy at Uber, and even commercial offerings like https://gremlin.com.
[ 289 ]
Distributed Context Propagation
Unfortunately, having the infrastructure for fault injection is only half the
battle. The more difficult part is coming up with adequate failure scenarios:
combinations of faults across a distributed system. We can view this as a general
search problem: funding a set of fault injection scenarios (preferably a minimal set)
that exercise all failure modes that exist in the application. The number of distinct
scenarios is exponential in the number of potential faults, and therefore is intractable
for exhaustive search.
The most common approaches for creating failure scenarios are random search
and programmers-guided search. Random search has the advantage of simplicity
and generality, but it is unlikely to discover deep or cascading failures involving
a combination of rare conditions, and it often wastes a lot of resources by testing
scenarios that are redundant or could be proven to not affect the end user.
Leveraging the intuition of domain experts is difficult to scale to large architectures.
In organizations like Uber, with over 3,000 individual microservices, there are hardly
any engineers, even the most senior ones, who can keep the full complexity of the
system in their head, and anticipate non-trivial failures.
The full description of the LDFI technique is beyond the scope of this book.
However, the existing end-to-end tracing infrastructure at Netflix played a critical
role in enabling it. The LDFI service was continuously monitoring distributed traces
collected in production and using them to build long-lived models of dependencies
within individual executions and redundancy across them.
[ 290 ]
Chapter 10
The models were used to construct lineage graphs used by LDFI to reason backwards
(from effects to causes) about the impact that some combination of faults could
have on a successful outcome. Those combinations that could be proven to
have the potential to cause user-visible failure were then tested using the FIT
infrastructure [7]. The requests matching the criteria for the failure scenarios were
decorated with metadata describing the faults under test, such as adding latency
to specific service calls or failing those calls completely. These fault instructions
encoded in the metadata were passed through the call graph using, you guessed
it, distributed context propagation.
The use of collected end-to-end traces to guide the exploration of the failure search
space is very interesting. However, the use of metadata propagation for delivering
fault instructions to the services is of particular interest in the context of this chapter.
Many "failure as a service" systems support execution scripts that can target specific
services, service instances, hosts, and so on, for fault injection. However, many non-
trivial fault scenarios cannot be expressed in these terms because they need the faults
to be scoped to specific transactions, not just specific components in the system.
The time sequence diagram in Figure 10.3 illustrates a replication bug in Kafka
that was reproduced by the LDFI technique. Three Kafka replicas are configured
to handle a single partition, but when they send their membership message (M) to
the Zookeeper, a temporary network partition causes the messages from Replicas
B and C to be lost, and Replica A becomes the leader, while believing that it is the
sole surviving replica. The client's request (C) also acknowledges via message (L)
that Replica A is the leader. When the client writes (W) the data to Replica A, the
write is acknowledged (A) as successful. Immediately after that, the Replica A
crashes, thus violating the message durability guarantee.
[ 291 ]
Distributed Context Propagation
It is easy to see that even with fault instructions propagated via metadata, this
particular bug is very hard to reproduce because a number of faults must be
carefully orchestrated in time to simulate the failure scenario and there are at least
four different RPC chains involved. However, using metadata propagation is still
a very valuable technique for delivering targeted fault instructions to the system
components in the context of specific requests.
Traffic labeling
At a high level, adding metadata to request context and propagating it through the call
graph is a way of partitioning the overall traffic to the application along a number of
dimensions. As an example, if we label each external request with the type of company
product it represents (for Google it could be Gmail, Docs, YouTube, and so on, and for
Uber it could be Ridesharing, Uber Eats, Uber Bikes, and so on) and propagate it in
the metadata, then we can get a pretty accurate picture of how much traffic served by
a data center is attributed to each product line. Strictly speaking, the Pivot Tracing and
LDFI techniques I discussed earlier can also be considered as partitioning of the traffic,
but the values they pass through metadata are very complex and high cardinality.
In this section, I will talk about traffic labeling that uses low-cardinality dimensions.
Testing in production
Testing in production is a common practice today because given the complexity
of the internet-scale distributed systems, it is often impossible to provide the same
level of coverage of different edge cases and the variety of user behaviors that we
might observe in production by simulating them in the staging environments. Some
testing requests may only read the data, while others make changes to the system
state, for example, a simulated rider taking a simulated Uber ride.
Some services in the system may identify the requests as test traffic by looking at the
data involved, for example, a simulated Uber rider account is likely to have a special
marker in the database. However, many services may not have that knowledge. For
example, a generic storage layer, while having access to the test account marker,
would not know to look for it in the data. Therefore, it is useful to label the traffic
generated by test accounts from the root of the call graph by propagating a label
like tenancy, indicating whether the traffic is synthetic or real production. There
are multiple ways that the system components may use such a label:
• Let's assume you own some downstream service and you provision it to
handle a certain level of traffic. You will probably setup the monitoring of
the traffic volume coming to your service and define alerts when that volume
exceeds a certain threshold (even if you use auto-scaling you may want to do
that for cost reasons).
[ 292 ]
Chapter 10
Now imagine that some upstream service a couple of levels above you in
the stack is doing a capacity or resiliency test by generating a lot of synthetic
test traffic. Without using something like tenancy metadata to recognize
that traffic, your service will not be able to tell if the increase in traffic is due
to real production growth or just synthetic tests. Your alerts might start firing
for no reason! On the other hand, if you know that the synthetic traffic will
be labeled accordingly, you can define your alerts to only fire on the growth
of real production traffic and ignore the synthetic spikes. In Chapter 11,
Integration with Metrics and Logs, we will look at concrete ways this can
be achieved very easily.
• Some components may be configured to recognize the traffic with test
tenancy and automatically switch to read-only mode or to direct the writes
to a different database.
• Some components may not want to serve synthetic traffic from production
clusters at all (that is, they are not quite ready for testing in production).
The test tenancy metadata can be used by the routing layers to redirect
the requests to a staging cluster.
Debugging in production
If we embrace the philosophy of testing in production, in part because we cannot
construct staging environments that reproduce production, then we should also talk
about debugging in production, and especially debugging microservices. No matter
how powerful end-to-end tracing is, it is still limited to the information collected
by the preprogrammed trace points. Sometimes we need to inspect the full state
of the application, step through code, and change some variables, to understand
the issue or find a bug. This is where the traditional debuggers shine. However, in
a microservices-based application, the state is spread out across many processes, and
each process has many states for all of the concurrent requests it is processing at any
given time. If we want to debug the behavior of a particular request, we may need
to attach debuggers to instances of different services, possibly in different languages,
and we also need to figure out which instance of a given service is going to receive
our request.
One of the very interesting projects that helps to solve this problem is the Squash
debugger [9], developed by Solo.io, Inc. Squash consists of three components:
1. The Squash user interface is just a plugin to popular IDEs, like Visual Studio
Code or IntelliJ. You can set breakpoints in your microservices as if you're
developing locally, and the Squash plugin will coordinate with the Squash
server to install those breakpoints in the services running in the production
cluster.
[ 293 ]
Distributed Context Propagation
2. The Squash server holds information about the breakpoints and orchestrates
Squash clients.
3. The Squash client is a daemon process that runs alongside the applications,
contains the binaries for the debuggers, and allows the attaching of the
debuggers to the running microservice process.
Squash integrates with service mesh Istio by providing a filter for the Envoy proxy
that can initiate a debugging session for requests that contain certain markers, for
example, a special header. To isolate the instance from other production requests,
Squash can clone the full state of the process into another instance and attach
a debugger to that private instance, without affecting the rest of the production
traffic. I recommend watching some of the talks about Squash for more information.
How is this related to traffic labeling and context propagation? We do not want to
start hitting the breakpoints on random production requests, possibly affecting real
users. At the same time, we may need to hit breakpoints in multiple microservices
processing our hand-crafted request. This coordination becomes much easier with
metadata propagation. We can define the breakpoints to be only enabled when an
HTTP request has a particular baggage item encoded in the request headers and
issue the top-level request to the system with that baggage item set, for example,
using Jaeger baggage syntax:
$ curl -H 'jaeger-baggage: squash=yes' http://host:port/api/do-it
If the services are instrumented with OpenTracing and Jaeger, this baggage item will
be propagated through the call graph automatically and trigger the breakpoints. To
allow multiple developers to debug in production, the baggage item may be set to
some token or username, to ensure that each developer only gets their breakpoints
triggered by their hand-crafted request.
Developing in production
If we do not have a staging environment approximating production, then developing
microservices that interact with other services is also a challenge. Even if we do have
a staging cluster, the process of deploying the code there is usually not very snappy
(building a container, uploading it to the registry, registering a new service version,
and so on). A much faster way is if we can run the service instance locally and proxy
the production traffic. For the downstream services, it is easier: we can just set up
the tunnels. But what if our service needs upstream services from production to get
a sensible request? What if we want to use the production mobile app to execute
a high-level user workflow and have our local instance serve a portion of that
execution? Fortunately, there are solutions for this problem as well. For example,
Telepresence [10] integrates with Kubernetes and can replace a production service
with a proxy that forwards all requests to another service instance that we may be
running on a laptop, from our favorite IDE, with our favorite debugger attached.
[ 294 ]
Chapter 10
The diagram in Figure 10.4 illustrates this approach. The developer starts a local
instance of the service they want to debug (Service X), possibly from an IDE and
with a debugger attached. The IDE plugin communicates with a control server in
production, which could be a Service Mesh Control Plane (for example, Istio) or
a dedicated debugging server that the routing proxy recognizes.
The IDE sends instructions about which requests should be intercepted, for example,
only those that have the user=X label propagated via metadata. The user then makes
a regular production request, even from a mobile app. The API server authenticates
the user and stores user=X in the baggage, which is then used by the routing proxy
(or by a library embedded in the application) to intercept those specific requests
and forward them to the service instance on the developer's laptop.
Similar to the Squash example, the key to this approach is knowing which traffic
should be redirected to the local instance, and which traffic should remain in
production. Using traffic labeling via distributed context propagation provides
an elegant solution to this problem.
Summary
In this chapter, we discussed how the metadata propagation mechanism can
be separated from the tracing instrumentation that depends on it, for example,
using the approach from the Tracing Plane, and why this is not always done in
practice. We reviewed a number of cross-cutting techniques and tools for solving
the problems of monitoring, debugging, and testing distributed systems that are
not dependent on tracing directly, but depend on distributed context propagation.
Having tracing instrumentation in the distributed systems makes these additional
tools much easier to implement.
[ 295 ]
Distributed Context Propagation
We briefly touched upon using traffic labeling to affect application metrics and alerts.
In the next chapter, we will cover that in more detail, as well as other integrations
between tracing, metrics, and logging systems.
References
1. X-Trace: http://brownsys.github.io/tracing-framework/xtrace/.
2. Pivot Tracing: http://pivottracing.io/.
3. Brown Tracing Plane: http://brownsys.github.io/tracing-framework/
tracingplane/.
4. Jonathan Mace, Rodrigo Fonseca. Universal Context Propagation for Distributed
System Instrumentation. Proceedings of the 13th ACM European Conference on
Computer Systems (EuroSys '18).
5. Jonathan Mace. A Universal Architecture for Cross-Cutting Tools in Distributed
Systems. Ph.D. Thesis, Brown University, May 2018.
6. Kolton Andrus, Naresh Gopalani, Ben Schmaus. FIT: Failure Injection Testing.
The Netflix Tech Blog: https://medium.com/netflix-techblog/fit-
failure-injection-testing-35d8e2a9bb2.
7. Yury Izrailevsky, Ariel Tseitlin. The Netflix Simian Army. The Netflix Tech
Blog: https://medium.com/netflix-techblog/the-netflix-simian-
army-16e57fbab116.
8. Peter Alvaro, Kolton Andrus, Chris Sanden, Casey Rosenthal, Ali Basiri,
Lorin Hochstein. Automating Failure Testing at Internet Scale. ACM
Symposium on Cloud Computing 2016 (SoCC'16).
9. Squash: Debugger for microservices: https://github.com/solo-io/
squash.
10. Telepresence: Fast, local development for Kubernetes and OpenShift
microservices: https://www.telepresence.io/.
[ 296 ]
Integration with Metrics
and Logs
In this chapter, we will look at many integration points between these monitoring
tools, using a version of our favorite, the Hello application. We will see how metrics
and logs can be enriched with request metadata, how tracing instrumentation can
be used to replace explicit metrics instrumentation, and how logs and traces can
be bidirectionally integrated with each other.
Metrics are often the cheapest to collect, with the smallest impact on the application
performance, because metrics typically deal with simple numeric measurements
that are heavily aggregated to reduce the data volume; for example, to measure
the throughput of a REST service, we just need an atomic counter and to report just
a single int64 number once a second. Very few applications would be adversely
impacted by the cost of reporting such a measurement. As a result, metrics are often
used as highly accurate "monitoring signals" to keep track of the application health
and performance, while at the same time, they are highly ineffective at explaining
performance problems, due to the very same aggregation and lack of context.
Metrics as a monitoring tool are generally useful for monitoring individual entities,
such as a process, a host, or an RPC endpoint. Since metrics can be easily aggregated,
they can be used to monitor higher-level entities by combining individual time
series, for example, observing throughput or latency of a NoSQL database
cluster by aggregating stats from individual nodes via averages, min or max, and
percentiles. It is a common practice to partition a single logical metric, for example,
an endpoint error count, into multiple time series by adding extra dimensions, such
as the host name, availability zone, or data center name, and so on.
[ 298 ]
Chapter 11
Older metrics protocols, such as Graphite StatsD, supported time series partitioning
by encoding the dimensions into the metric name as positional arguments, for
example, the host name host123 in the second position: servers.host123.disk.
bytes_free. The Graphite query language allows aggregation via wildcards:
averageSeries(servers.*.disk.bytes_free)
Capturing metrics with extra dimensions provides more investigative power to the
operators, who can narrow down the time series aggregates to specific infrastructure
components. Unfortunately, since most metrics APIs are not context-aware, the
dimensions typically represent static metadata available at the process level, such
as host name, build version, and so on.
Logging frameworks perform no aggregation and report the events as-is, ideally in
the so-called "structured format" that is machine friendly and can be automatically
parsed, indexed, and processed by the centralized logging infrastructure. You will
see an example of this log format in this chapter.
Most logging frameworks record events as a stream of records that may be tagged
with the name of the execution thread (for example, this is a standard practice in
Java), which helps slightly with inferring causal relationships between the events;
however, that practice becomes less useful with the proliferation of frameworks
for asynchronous programming. In general, event correlation is a problem not
solved particularly well by logs as a toolset. The verbosity of the logs can be another
challenge for high-throughput services; to combat it most logging frameworks
support leveled logging where messages are explicitly classified by the developer
as debug, info, warning, error, and so on. Is it a common practice to disable any
debug-level logs in production and to keep even higher levels to an absolute
minimum, especially for successful requests.
[ 299 ]
Integration with Metrics and Logs
Prerequisites
Since we are talking about integrating tracing with logs and metrics, we will run the
backends for all three of these monitoring tools:
This section provides instructions on setting up the environment to run the Hello
application.
[ 300 ]
Chapter 11
The other top-level directories specify configurations for the monitoring tools. The
docker-compose.yml file is used to spin up all of them as a group, including the
two microservices from the Hello application. The Hello clients are run separately,
outside of Docker containers.
[ 301 ]
Integration with Metrics and Logs
We pass the -d flag to run everything in the background. To check that everything
was started correctly, use the ps command:
$ docker-compose ps
Name Command State
---------------------------------------------------------------------
-
chapter-11_elasticsearch_1 /usr/local/bin/docker-entr ... Up
chapter-11_formatter-1_1 /bin/sh -c java -DJAEG ... Up
chapter-11_hello-1_1 /bin/sh -c java -DJAEG ... Up
chapter-11_jaeger_1 /go/bin/standalone-linux - ... Up
chapter-11_kibana_1 /bin/bash /usr/local/bin/k ... Up
chapter-11_logstash_1 /usr/local/bin/docker-entr ... Up
chapter-11_prom_1 /bin/prometheus --config.f ... Up
Sometimes Elasticsearch takes a long time to complete its startup process, even
though the preceding ps command will report it as running. The easiest way
to check that it's running is to grep for Kibana logs:
$ docker-compose logs | grep kibana_1 | tail -3
kibana_1 | {"type":"log","@timestamp":"2018-11-
25T19:10:37Z","tags":["warning","elasticsearch","admin"],"pid":1,"mes
sage":"Unable to revive connection: http://elasticsearch:9200/"}
kibana_1 | {"type":"log","@timestamp":"2018-11-
25T19:10:37Z","tags":["warning","elasticsearch","admin"],"pid":1,"mes
sage":"No living connections"}
kibana_1 | {"type":"log","@timestamp":"2018-11-
25T19:10:42Z","tags":["status","plugin:[email protected]","info"],"pid"
:1,"state":"green","message":"Status changed from red to green -
Ready","prevState":"red","prevMsg":"Unable to connect to
Elasticsearch at http://elasticsearch:9200."}
[ 302 ]
Chapter 11
We can see that the top two logs indicate Elasticsearch not being ready, while the last
log reports the status as green.
{"id":"5ab0adc0-f0e7-11e8-b54c-f5a1b6bdc876","type":"index-
pattern","updated_at":"…","version":1,"attributes":{"title":"logstash
-*","timeFieldName":"@timestamp"}}
From this point, if we execute some requests against the Hello application,
for example:
$ curl http://localhost:8080/sayHello/Jennifer
Hello, Jennifer!
Figure 11.1: Example of a log message from the Hello application, as displayed in Kibana
[ 303 ]
Integration with Metrics and Logs
As you can see, the clients repeatedly execute the same HTTP request against
the Hello application, and some of those requests are successful, while others fail.
We will discuss later the meaning of the parameters accepted by the clients.
[ 304 ]
Chapter 11
Figure 11.2: Architecture of the Hello application and its monitoring components and backends
All components of the Hello application are configured with a jaeger client, a prom
client, and the logback logging framework with a LogstashTcpSocketAppender
plugin that sends the logs directly to Logstash, which saves them to Elasticsearch.
Kibana is the web UI used to query logs from storage. The Prometheus client
accumulates the metrics in memory, until the Prometheus server pulls them
via an HTTP endpoint. Since the Prometheus server runs inside the networking
namespace created by docker-compose, it is not configured to scrape metrics
from the two clients that run on the host network.
[ 305 ]
Integration with Metrics and Logs
• We start a server span for every inbound request, therefore we can count
how many requests our service receives, that is, its throughput or request
rate (R in RED).
• If the request encounters an error, the tracing instrumentation sets the
error=true tag on the span, which allows us to count errors (E in RED).
• When we start and finish the server span, we signal the trace points to
capture the start and end timestamps of the request, which allows us
to calculate the duration (latency) of the request (D in RED).
sum(rate(span_count{span_kind="server"}[1m]))
by (service,operation,error)
Since it would be meaningless to aggregate all the requests across both services
and all their endpoints, we group the results by service name, operation name
(that is, the endpoint), and the error flag (true or false). Since each microservice is
only exposing a single endpoint, grouping by service and operation is equivalent
to grouping by either one alone, but when we include both in the query, the legend
under the graph is more descriptive, as can be seen in Figure 11.3. We can also plot
the 95th percentile of request latency with the following query (using the span_
bucket metric):
[ 306 ]
Chapter 11
histogram_quantile(0.95,
sum(rate(span_bucket{span_kind="server"}[1m]))
by (service,operation,error,le))
Figure 11.3: Four time series representing the successful (top two lines)
and failed (bottom two lines) request rates from the Hello application
[ 307 ]
Integration with Metrics and Logs
This library implements the OpenTracing API as a decorator around another Tracer
implementation and uses the callbacks from the trace points to calculate span
metrics. Our services first instantiate the normal Jaeger tracer, then wrap it in the
metrics decorator, as can be seen in the TracingConfig class from the lib module:
@Bean
public io.opentracing.Tracer tracer(CollectorRegistry collector) {
Configuration configuration = Configuration.fromEnv(app.name);
Tracer jaegerTracer = configuration.getTracerBuilder()
.withSampler(new ConstSampler(true))
.withScopeManager(new MDCScopeManager())
.build();
You may ask, if we can get similar RED metrics for a service using traditional
metrics libraries, why bother doing it via tracing instrumentation? One of the main
benefits is the standardization of the metrics names that we can get from different
services, potentially built on top of different application frameworks; for example,
if we enable metrics directly from the Spring Boot framework, they are very likely
to be named very differently from the metrics emitted from another application
built on top of another framework, such as, Dropwizard.
[ 308 ]
Chapter 11
However, metrics emitted by the decorator we used in this exercise are going to
be the same across all frameworks. As a bonus, the service and operation names
used as labels in the metrics will have an exact match to the service and operation
names collected in the actual traces. By correlating the time series with traces in this
way, we can enable bi-directional navigation between traces and time series in the
monitoring UIs, and we can annotate spans in a given trace with accurate statistical
measures, for example, automatically calculating which percentile of latency
distribution a given span represents.
The decorator approach is not the only way to get metrics emitted by the tracing
instrumentation. Although we did not discuss it in Chapter 2, Take Tracing for
a HotROD Ride, the Jaeger tracer in Go has an optional module that can emit
metrics using an observer pattern, rather than a decorator pattern. If we run
the HotROD demo application with the --metrics=prometheus flag and execute
a few car orders from the UI, we can pull the metrics generated for the HTTP
request by the RPC metrics plugin:
$ curl -s http://localhost:8083/metrics | grep frontend_http_requests
# HELP hotrod_frontend_http_requests hotrod_frontend_http_requests
# TYPE hotrod_frontend_http_requests counter
hotrod_frontend_http_requests{endpoint="HTTP-GET-
/",status_code="2xx"} 1
hotrod_frontend_http_requests{endpoint="HTTP-GET-
/",status_code="3xx"} 0
hotrod_frontend_http_requests{endpoint="HTTP-GET-
/",status_code="4xx"} 1
hotrod_frontend_http_requests{endpoint="HTTP-GET-
/",status_code="5xx"} 0
hotrod_frontend_http_requests{endpoint="HTTP-GET-
/dispatch",status_code="2xx"} 3
hotrod_frontend_http_requests{endpoint="HTTP-GET-
/dispatch",status_code="3xx"} 0
hotrod_frontend_http_requests{endpoint="HTTP-GET-
/dispatch",status_code="4xx"} 0
hotrod_frontend_http_requests{endpoint="HTTP-GET-
/dispatch",status_code="5xx"} 0
You saw in Chapter 7, Tracing with Service Mesh, that using a service mesh is another
way of getting standardized metrics out of our microservices, which has a similar
benefit of the consistent labeling of metrics and traces. However, in situations
where deploying a service mesh is not an option, emitting metrics from tracing
instrumentation can be a viable alternative.
[ 309 ]
Integration with Metrics and Logs
Figure 11.4: Four time series representing the successful (top two lines) and failed
(bottom two lines) request rates from the Hello application. The y-axis is the count
of spans in a certain group and the x-axis is the time (minute of the hour).
The top two lines represent the successful requests at each of the services:
span_count{error="false",operation="sayHello",service="hello-1"}
span_count{error="false",operation="formatGreeting",service="forma
tter-1"}
The bottom two lines represent errors (the rate of errors from the hello service
is higher):
span_count{error="true",operation="sayHello",service="hello-1"}
span_count{error="true",operation="formatGreeting",service="formatt
er-1"}
From the chart in Figure 11.4, we know that a certain percentage of requests are
failing, but we do not know why, and the metrics provide little context to understand
the why. In particular, we know that our services are being accessed by two clients
that have different configurations, so it is possible that the clients cause different
behaviors in the application. It would be nice to see that in the chart. However, to do
that we need to label our metrics with metadata that represents the client version,
which is not known to the two microservices from the HTTP requests. This is where
our old friend baggage comes into play. In the lib module, we have a helper service
CallPath, with a single method, append():
public void append() {
io.opentracing.Span span = tracer.activeSpan();
String currentPath = span.getBaggageItem("callpath");
if (currentPath == null) {
currentPath = app.name;
} else {
currentPath += "->" + app.name;
}
span.setBaggageItem("callpath", currentPath);
}
[ 310 ]
Chapter 11
This method reads the baggage item called callpath and modifies it by appending
the current service name. The controllers in both microservices call this method from
their HTTP handlers, for example, in the formatter service:
@GetMapping("/formatGreeting")
public String formatGreeting(@RequestParam String name) {
logger.info("Name: {}", name);
callPath.append();
...
The clients also call this method, which puts the client name, for example,
client-v1, as the first segment of the call path. Note that we have to start the root
span in the clients to have a place to store the baggage item, as otherwise the span
will only be created by the RestTemplate when making the outbound call.
public void run(String... args) {
while (true) {
Span span = tracer.buildSpan("client").start();
try (Scope scope = tracer.scopeManager().activate(span,
false))
{
callPath.append();
...
runQuery(restTemplate);
}
span.finish();
sleep();
}
}
Finally, this callpath baggage item is being added to the metrics label by the
decorator we discussed in the previous section because we configure the reporter
with the withBaggageLabel() option:
PrometheusMetricsReporter reporter = PrometheusMetricsReporter
.newMetricsReporter()
.withCollectorRegistry(collector)
.withConstLabel("service", app.name)
.withBaggageLabel("callpath", "")
.build();
[ 311 ]
Integration with Metrics and Logs
To see this in action, all we need to do is to add the callpath to the group-by clause
of the Prometheus query:
sum(rate(span_count{span_kind="server"}[1m]) > 0)
by (service,operation,callpath,error)
Figure 11.5: Additional time series shown when we add the "callpath" label to the group-by clause.
The x-axis is the count of spans in a certain group and the y-axis is the time (minute of the hour).
Unfortunately, without the interactivity and mouse-over popups available in the real
dashboard, it is hard to see what is going on here, so let's look at the raw data that
is available in Prometheus in the Console tab, next to the Graph tab (formatted as
a table for better readability, where the service, operation, error, and callpath
labels are pulled into columns):
Service (that
# Callpath Operation Error Value
emits metric)
1 client-v2->hello-1 hello-1 sayHello false 7.045
client-v2->hello-1-
2 formatter-1 formatGreeting false 7.058
>formatter-1
3 client-v2->hello-1 hello-1 sayHello true 0.691
client-v2->hello-1-
4 formatter-1 formatGreeting true 0.673
>formatter-1
5 client-v1->hello-1 hello-1 sayHello false 6.299
client-v1->hello-1-
6 formatter-1 formatGreeting false 6.294
>formatter-1
7 client-v1->hello-1 hello-1 sayHello true 1.620
[ 312 ]
Chapter 11
We can now see interesting patterns emerging. Let's focus only on rows 3, 4, and
7, which represent failing requests according to the label error=true, and ignore
successful requests in rows 1-2 and 5-6. From the callpath label, we can see that the
requests in rows 3 and 4 originate from client-v2, and in row 7 from client-v1.
Requests from client-v1 never reach the formatter service and have a higher rate
of failure (1.62 failures per second) than requests from client-v2 (about 0.7 failures
per second, rows 3-4).
Requests from client-v2 all appear to fail in the formatter service because the
failure rates in rows 3-4 are almost the same. This fact may not be apparent, but
we know from the architecture of the application that if the formatter service fails,
then the hello service will fail as well. Therefore, if the hello service was failing
independently of the formatter service, it would not have made the call down
the chain and its failure rate would have been higher. In summary:
Now that we have deduced that the error patterns are caused by the version of the
client, let's confirm the hypothesis by looking at the code. The failures in the Hello
application are simulated with the help of the ChaosMankey class in the lib module.
During initialization, it reads two parameters from Java system properties:
public ChaosMonkey() {
this.failureLocation = System.getProperty("failure.location", "");
this.failureRate = Double.parseDouble(System.getProperty
("failure.rate", "0"));
}
The field failureLocation contains the name of the microservice where we want
to simulate the failure. The field failureRate contains the desired probability of
this failure occurring. Before making an HTTP request, the clients are calling the
maybeInjectFault() method on ChaosMonkey that probabilistically stores the
desired failure location in the fail baggage item:
public void maybeInjectFault() {
if (Math.random() < this.failureRate) {
io.opentracing.Span span = tracer.activeSpan();
span.setBaggageItem("fail", this.failureLocation);
}
}
[ 313 ]
Integration with Metrics and Logs
chaosMonkey.maybeFail();
The maybeFail() method compares the current service name with the value
of the fail baggage item, and if there's a match, throws an exception:
public void maybeFail() {
io.opentracing.Span span = tracer.activeSpan();
String fail = span.getBaggageItem("fail");
if (app.name.equals(fail)) {
logger.warn("simulating failure");
throw new RuntimeException(
"simulated failure in " + app.name);
}
}
Finally, the Makefile defines the configuration of the two versions of the client that
control the failure injection mechanism and explain the metrics pattern we observed
in Prometheus:
CLIENT_V1 := $(CLIENT_SVC) \
-Dclient.version=v1 \
-Dfailure.location=hello-1 \
-Dfailure.rate=0.2
CLIENT_V2 := $(CLIENT_SVC) \
-Dclient.version=v2 \
-Dfailure.location=formatter-1 \
-Dfailure.rate=0.1
We see that client-v1 is instructed to induce failures in the hello service for 20%
of the requests, which explains why we never saw the error call path reaching to
the formatter service. client-v2 is instructed to induce failures in the formatter
service for only 10% of the requests, which explains the difference in the error rates
we observed.
[ 314 ]
Chapter 11
In these situations, the instrumentation falls back onto traditional metrics APIs,
for example, by directly calling the client library of the metrics system (for example,
the Prometheus client) or by using an abstraction layer, such as Micrometer in
Java (https://micrometer.io). Most traditional metrics APIs are created without
the support for distributed context, making it harder to annotate the metrics with
additional request-scoped metadata, like the callpath label in our earlier example.
In the languages where the request context is passed around through thread-local
variables or similar mechanisms, it is still possible to enhance the traditional metrics
APIs to extract the extra labels from the context without altering the API. In other
languages, including Go, the metrics APIs need to be enhanced to accept the context
as one of the arguments to the functions that take the measurements; for example, the
popular microservices framework Go kit [6] defines the Counter interface like this:
type Counter interface {
With(labelValues ...string) Counter
Add(delta float64)
}
The Add() function gathers the actual measurements, but it does not accept the
Context object. We can work around that by creating a helper that would extract
the context-scoped label values from the context, for example:
type Helper struct {
Labels []string
}
This could be a viable approach, but it has the downside that the application
developer must remember to invoke this helper instead of calling the Add() function
with context directly on the Counter object. Fortunately, newer frameworks, such as
OpenCensus, are being developed as fully context-aware, so that "forgetting" to use
the right function is not an option:
// Record records one or multiple measurements with the same context
// at once. If there are any tags in the context, measurements will be
// tagged with them.
func Record(ctx context.Context, ms ...Measurement) {}
Structured logging
Before we go further, let us briefly talk about structured logging. Traditionally,
the logging frameworks generate the log lines as plain strings, an example of which
you can see in the output of the Hello application clients:
25-11-2018 18:26:37.354 [main] ERROR client.Runner.runQuery - error
from server
25-11-2018 18:26:37.468 [main] INFO client.Runner.runQuery -
executing http://localhost:8080/sayHello/Bender
25-11-2018 18:26:37.531 [main] ERROR client.Runner.runQuery - error
from server
25-11-2018 18:26:37.643 [main] INFO client.Runner.runQuery -
executing http://localhost:8080/sayHello/Bender
While these strings do have a certain structure that can be parsed by the log aggregation
pipeline, the actual messages are unstructured, making them more expensive to index
in the log storage. For example, if we wanted to find all logs with a specific URL, we
would need to express the question as a substring or regex query, as opposed to a much
simpler url="..." query.
[ 316 ]
Chapter 11
When expressed this way, the logs can be indexed much more efficiently and
provide various aggregation and visualization capabilities, such as those available
in Kibana.
In this chapter's exercise, we will use the standard SLF4J API, which does not
support the log messages as structured data. However, we do configure the
structured formatter for the logs when we send them to Logstash. It can be found
in resources/logstash-spring.xml files in every module, for example, in client:
<appender name="logstash"
class="net.logstash.logback.appender.LogstashTcpSocketAppender">
<destination>${logstash.host}:5000</destination>
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<customFields>
{"application":"hello-app","service":"client-1"}
</customFields>
</encoder>
</appender>
[ 317 ]
Integration with Metrics and Logs
The Hello application in this chapter includes this form of integration. As you may
recall from Chapter 4, Instrumentation Basics with OpenTracing, the OpenTracing API
for Java defines the concept of scope managers that are responsible for keeping track
of the currently active span.
replace("trace_id", traceId);
replace("span_id", spanId);
replace("trace_sampled", sampled);
}
As you can see, it casts the current span context to the Jaeger implementation
and retrieves the trace ID, span ID, and the sampling flag, which are then stored
in the MDC using the replace() method. The lookup() method is used to retrieve
previous values of these attributes, which are restored once the scope is deactivated:
@Override
public void close() {
this.scope.close();
replace("trace_id", previousTraceId);
[ 318 ]
Chapter 11
replace("span_id", previousSpanId);
replace("trace_sampled", previousSampled);
}
When we instantiate the Jaeger tracer in the TracingConfig, we pass it the custom
scope manager:
Configuration configuration = Configuration.fromEnv(app.name);
Tracer jaegerTracer = configuration.getTracerBuilder() //
.withSampler(new ConstSampler(true)) //
.withScopeManager(new MDCScopeManager()) //
.build();
Let's see how all this integration works in Kibana. We have already seen the logs
found in Kibana in the Prerequisites section. To the left of the log display, there is
a vertical section listing all field names that Elasticsearch has discovered in the log
stream produced by our services, including fields such as application, service,
trace_id, and so on. When you mouse-over those fields, the add button appears,
which allows you to add the fields to the Selected Fields section at the top of the
sidebar. Let's select at least three fields: service, level, and message, in this order.
We do not have to add the timestamp, as Kibana will automatically display it:
[ 319 ]
Integration with Metrics and Logs
Once the fields are selected, we will see a display of the logs that is much easier
to read:
Figure 11.7: Logs displayed by Kibana after selecting only a subset of the fields
As we could have expected, the logs are not very usable, since the logs from
different concurrent requests are all mixed up. How would we use that to investigate
a problem, such as the error rates that we were seeing with the metrics? If we scroll
the results, we will see, at some point, the logs with the message "simulating failure".
Let's focus on those by adding a filter in the query textbox at the top of the UI:
message:"simulating failure"
Hit the search button and we might get results like in Figure 11.8.
[ 320 ]
Chapter 11
Now expand one of the records using the triangle on the left (for example, from the
formatter service). We can see, as an example, that the log message was added by
the lib.ChaosMonkey class, as expected. Near the bottom of the list of fields, we find
trace_id, span_id, and trace_sampled fields added by our MDCScopeManager:
Figure 11.9: Once a single log record is expanded, it shows a list of fields and values
[ 321 ]
Integration with Metrics and Logs
We can now search for all logs for that specific request by replacing the query for the
message text with a search by trace ID:
trace_id:610d71be913ffe7f
This final search result represents the full execution of a request, starting at the client
and ending with the client logging the error message "error from server." Granted,
this is not particularly exciting in this simple application, but it would be much
more useful when we deal with a real production system that involves dozens
of microservices executing a single request.
[ 322 ]
Chapter 11
In the languages where thread-locals are not available, we are back to the same
overall problem of having access to the context when using the off-the-shelf logging
APIs not designed with context propagation in mind. Ideally, we would like to see
logging APIs in Go that require the context as the first argument to all the logging
methods, so that implementations of that API could pull the necessary metadata
out of the context into the log fields. This is currently an active area of design,
with the OpenCensus project looking to introduce context-aware logging APIs.
One extra benefit of OpenTracing–Spring Boot integration is that all log records
generated by the application using the logging API are automatically attached to
the currently active span (in the version of integration we used, it is only available
for the Logback logging framework and can be turned on and off via the application
properties). If we expand the root "client" span in the trace, we will see four logs;
two generated by the Jaeger client when the baggage is updated in the span, and
the bottom two appended by the OpenTracing–Spring Boot integration, in a similar
structured format (Figure 11.12).
[ 323 ]
Integration with Metrics and Logs
Figure 11.12: Four span logs in the root span of the trace
If you are interested, the code for this integration is in the SpanLogsAppender class
found in the io.opentracing.contrib.spring.cloud.log package in the GitHub
repository https://github.com/opentracing-contrib/java-spring-cloud/.
Having all the log statements displayed in the right place in the trace view can be
extremely informative and a much better experience in troubleshooting than looking
at the same logs in Kibana, even if filtered by the trace ID. It does not mean that all
logs must always be sent to the tracing backend: as long as the log records capture
the span ID, they can be lazily pulled from the logging backend by the tracing UI.
• We can use the log storage to search for logs for a single request, via trace ID.
• We can use the rich infrastructure that already exists in many organizations
for processing and aggregating logs to gain operational insights in
aggregate, and then drill down to individual traces, since all logs are
tagged with a trace ID.
[ 324 ]
Chapter 11
• We can build integration between tracing UIs and the logging storage to pull
the logs by trace and span ID into trace visualizations, and display them in
the right contextual place in the trace.
• In some cases, as in the example of OpenTracing–Spring Boot integration,
the log messages can be redirected to be stored as span logs in the tracing
backend.
All these integrations beg a question: should the logging and tracing backends be
separate in the first place? After all, traces are just a more specialized and structured
form of log events. When we discussed Pivot Tracing in Chapter 10, Distributed
Context Propagation, we saw the following query:
FROM bytesRead IN DataNodeMetrics.incrBytesRead
JOIN client IN FIRST(ClientProtocols) ON client bytesRead
GROUP BY client.procName
SELECT client.procName, SUM(bytesRead.delta)
Another example is the service from honeycomb.io that is built around the idea
of collecting raw, rich, structured events, and providing querying, aggregation, and
charting capabilities on top of that. As long as the events capture some causality
(which can be automatically provided, as you have seen in this chapter), the data
can be used to build both time series and traces. At some point of system scale, the
events need to be sampled, so the Honeycomb offering does not completely remove
the need for metrics if we want highly accurate measurements for monitoring,
but as far as troubleshooting and debugging complex systems goes, it makes little
distinction between logging and tracing events.
Of course, there are scenarios where tracing ideas seem not really applicable to
some events, for example, for the application's bootstrap logs, which are not tied to
any distributed transaction. However, even in that example, someone had to press
the button somewhere to kick off the deployment, or another instance of this service
died somewhere else causing this instance to start, so there is still a notion of some
distributed orchestration workflow that could be modeled as a trace to understand
its causality. In any case, if we treat all logs as equal events, some of which may or
may not have the context-based causality links, then the only difference between
them is in how we analyze those events.
[ 325 ]
Integration with Metrics and Logs
These are not particularly prescriptive answers, but this is a topic that people are
starting to seriously think and talk about, so we can hope for better answers to
come out soon.
Summary
Metrics, logs, and traces are often called the "three pillars of observability", a
term that does not do justice to each tool individually, or in combination, and makes
many software organizations inclined to check all three boxes, sometimes by using
three different vendors, without getting any better observability for their systems.
In this chapter, we discussed how metrics and logs lack the investigative and
debugging power when applied to distributed systems because in their standard
forms, they are not aware of the distributed request context and cannot provide
a narrative for a single execution. I showed how combining these tools with
context propagation and tracing enhances their ability to explain system behavior.
The chapter touched upon the general lack of, and the need for, context-aware APIs
for metrics and logging in the industry. While using context propagation provided
by tracing is a workable solution, developing context-aware monitoring APIs would
be much easier if there were general-purpose, tool-agnostic context propagation
APIs, similar to the Tracing Plane we reviewed in Chapter 10, Distributed Context
Propagation.
In the next chapter, we will return our attention exclusively to end-to-end tracing
and discuss some data mining techniques that can be used to gain application
insights from the large corpus of traces.
References
1. Charity Majors. There are no three pillars of observability. On Twitter:
https://twitter.com/mipsytipsy/status/1003862212685938688.
2. Ben Sigelman. Three Pillars, Zero Answers: We Need to Rethink Observability.
KubeCon + CloudNativeCon North America 2018: https://bit.
ly/2DDpWgt.
[ 326 ]
Chapter 11
[ 327 ]
Gathering Insights with
Data Mining
Let us finish Part III of the book with a discussion of perhaps the most exciting
and promising area for future exploration for the practitioners of end-to-end
tracing. Distributed tracing data provides a treasure trove of information about
our distributed systems. I have already shown that even inspecting a single trace
can be an exceptionally insightful exercise that often helps engineering teams
understand performance issues and identify the root causes. However, even
with low probability of sampling, software systems operating at internet-scale
can record millions or even billions of traces per day. Even if every engineer in
the company looks at a few traces each day, it is going to add up to a tiny fraction
of all the data collected by the end-to-end tracing backend. It is a shame to let the
rest of this data go to waste, when we can build data mining tools to process all
of it; create useful aggregations; discover patterns, anomalies, and correlations; and
extract insights that otherwise will not be apparent by looking at individual traces.
[ 329 ]
Gathering Insights with Data Mining
A very important reason to work with aggregations is to avoid being misled by one-
off traces that could be outliers; for example, if we find a trace where the latency of
some span is very high, is it worth investigating and debugging? What if its latency
was 15 seconds, but the overall 99.9 percentile of latency for this service is still under
a second? We could waste hours of engineering time chasing after a random outlier
that may have no impact on the system service level objectives (SLOs). Starting
the investigation from the aggregates, like the latency histograms we discussed in
Chapter 9, Turning the Lights On, and narrowing it to a few traces that are known
representatives of a certain class, is a much better workflow.
We discussed some examples of aggregate data in Chapter 9, Turning the Lights On,
such as deep dependency graphs, latency histograms, trace feature extractions, and so
on. Bulk analysis of tracing data is a relatively new field, so we can expect many more
examples to surface in blog posts and conference talks in the future. Very often, the
process of finding a root cause of a pathological behavior in the system is an iterative
process that involves defining a hypothesis, collecting data to prove or disprove it,
and moving on to another hypothesis. In this process, the prefabricated aggregations
can be useful as a starting point, but generally, the data analysis framework needs
to be more flexible to allow the exploration of patterns and hypotheses that may
be very unique to a given situation or a person doing the analysis.
In this chapter, rather than focusing on any specific aggregations or reports produced
with data mining, we will discuss the principles of building a flexible data analysis
platform itself. We will build a simple aggregation using the Apache Flink streaming
framework that will cover several architectural aspects needed for any trace analysis
system. We will also discuss some approaches that the other companies with mature
tracing infrastructure have taken.
Feature extraction
The number of possible aggregations and data mining approaches is probably
only limited by engineers' ingenuity. One very common and relatively easy-to-
implement approach is "feature extraction." It refers to a process that takes a full trace
and calculates one or more values, called features, that are otherwise not possible
to compute from a single span. Feature extraction represents a significant reduction
in the complexity of the data because instead of dealing with a large directed
acyclic graph (DAG) of spans, we reduce it to a single sparse record per trace, with
columns representing different features. Here are some examples of the trace features:
[ 330 ]
Chapter 12
In the following sections, I will go into detail about the responsibilities of each
component.
[ 331 ]
Gathering Insights with Data Mining
Tracing backend
The data pipeline needs the source of tracing data to process, and the tracing
backend acts as that source. A distinct characteristic of the tracing backend in
this scenario is that it receives tracing spans from many services in a distributed
application asynchronously, often out of order, and sometimes not at all due to
network failures. A backend like Jaeger simply puts all spans it receives into storage,
one by one, and does not attempt to reason about which spans belong to which trace.
It can only reassemble the full trace at query time.
Probably the most commonly used and the simplest approach is to wait for a
predetermined time interval after receiving the first span for a previously unseen
trace. As an example, we might have domain knowledge that most of the requests
to our application are serviced within a few seconds. We can pick an interval that
will fit nearly all requests, such as 30 seconds or one minute, and declare the trace
complete after that time elapses. Despite its simplicity, this approach has a few
obvious downsides:
[ 332 ]
Chapter 12
• It is very common for very large distributed systems to have workflows that
operate on vastly different time scales. For example, while many RPC-based
workflows are very quick (a few seconds in the worst case), some operational
workflows, like deploying a new version of a service across many nodes,
may take minutes or even hours. A single time window threshold is not
suitable in these cases.
There are other heuristics that can improve the accuracy of the time-window-based
trace completion trigger. Some of them need to be implemented in the tracing
libraries that receive callbacks from the trace points. As an example, keeping track
of how many children spans were created for a given parent span (inside the same
process) allows for some basic sanity checks in the trigger whether all those children
spans have been received. The trace completion trigger can use a larger time window
for slower workflows, yet detect that a trace for a short workflow is complete based
on these sanity checks of the DAG.
The trace completion trigger can also be given some statistics about previously
observed behaviors of the system, collected through another round of data mining
over historical data. The statistics can show approximate distribution of latencies for
each endpoint of each service, which can help the trigger to build estimates of when
each in-flight trace should complete. Perhaps the trace completion trigger can even
be built using machine learning.
Feature extractor
The feature extractor receives a complete trace and runs the business logic to
calculate useful features from it. The features can vary from simple numerical values,
such as the total span count, to more complex structures. As an example, if we want
to build a path-aware service graph that we discussed in Chapter 9, Turning the Lights
On, then for each trace we may want to produce a collection of (path, count) pairs.
The code exercise we will do in this chapter will be generating the former type of
features: count of spans by services found in the trace graph.
The feature extractor is where most of the custom logic resides, and it should
be extensible to allow adding more calculations. The common part across those
individual extractors is the data model representing the trace as a DAG. Canopy
authors described a specially designed higher-level trace model that is constructed
from the raw tracing data and makes it easier to write aggregations and business logic.
[ 333 ]
Gathering Insights with Data Mining
The model ideally should also provide some API to allow writing graph queries
against the trace DAG. In some cases, the users may only want to extract features
from a subset of all traces passing through the system, so the graph queries can also
be used as a filtering mechanism.
Depending on how the trace completion trigger is implemented, it is possible that the
trigger may fire more than once if some stray spans arrive to the backend later than
the first trigger fired. The feature extractor may need to deal with these situations.
Aggregator
The aggregator is an optional component. Since we reduce the rich span graph
to a small set of features, it is often not expensive to store the computed set of
features as a single record per each trace, especially if the underlying storage
directly supports aggregate queries. The aggregator in this case is a no-op; it
simply passes each record to the storage.
In other cases, storing each record may be too expensive or unnecessary. Consider
the pair-wise service graphs we saw in the Jaeger UI. The underlying data structure
for that graph is a collection of DependencyLink records records; for example, in Go:
type DependencyLink struct {
Parent string // parent service name (caller)
Child string // child service name (callee)
CallCount uint64 // # of calls made via this link
}
Many of these link records can be generated for each trace. It is not important to keep
each of them in the storage, since the final service graph is an aggregation of many
traces over a period of time. This is typically the job of the aggregator component.
For the service graph use case, it would group all DependencyLink records from
many traces by the (parent, child) pairs and aggregate them by adding up the
callCount values. The output is a collection of DependencyLink records where
each (parent, child) pair occurs only once. The aggregator is called for a set of
features extracted from a trace (after the trace completion trigger), accumulates the
data in memory, and flushes it to permanent storage after a certain time window,
for example, every 15 minutes or whichever interval is appropriate.
Since version 1.8, the Jaeger backend supports Kafka as an intermediate transport
for spans received by the collectors. The jaeger-ingester component reads the
spans from a Kafka stream and writes them to the storage backend, in our case
Elasticsearch. Figure 12.2 shows the overall architecture of the exercise. By using
this deployment mode of Jaeger, we are getting traces fed into Elasticsearch so that
they can be viewed individually using the Jaeger UI, and they are also processed
by Apache Flink for feature extraction. The feature records are stored in the same
Elasticsearch and we can use the Kibana UI to look at them and to build graphs.
We will need a steady source of tracing data. Any application that continuously
produces traces will do, for example, the Hello application from Chapter 11,
Integration with Metrics and Logs, with repeatedly running clients, or even HotROD
from Chapter 2, Take Tracing for a HotROD Ride, with a curl command running
in a loop. Instead, we will use a microservices simulator (https://github.com/
yurishkuro/microsim) that can simulate sufficiently complex architectures that
can be easily changed to produce a different shape of the traces.
The feature extraction job will count the number of spans per service in each trace
it receives and write a trace summary record to Elasticsearch. Figure 12.3 shows
how one of these records might look in Kibana. It includes the traceId field,
a timestamp, and a group of columns nested under spanCounts in the format
{serviceName}::{endpointName}: {number of spans}.
[ 335 ]
Gathering Insights with Data Mining
While this feature extractor is very simple, it serves as an illustration of the data
mining pipeline and includes all of the components we discussed previously,
except for the aggregator, which is not needed in this example.
Prerequisites
As you can see in Figure 12.2, there are quite a few components that we need to run
in order to bring up the exercise architecture. Fortunately, most of the components
can be run in Docker containers. The only two components that we will need to
run directly are the microservices simulator (in Go) and the Apache Flink feature
extraction job (Java).
There is only one Java artifact built by this project, therefore all the source code
is located in the top src/ directory. The docker-compose.yml file is used to
spin up other components: Jaeger backend, Apache Kafka, Elasticsearch, and
Kibana. elasticsearch.yml and kibana.yml are the configuration files for
Elasticsearch and Kibana respectively. The hotrod*.json files are the profiles
for the microservices simulator.
[ 336 ]
Chapter 12
We pass the -d flag to run everything in the background. To check that everything
was started correctly, use the ps command:
$ docker-compose ps
Name Command State
---------------------------------------------------------------------
chapter-12_elasticsearch_1 /usr/local/bin/docker-entr ... Up
chapter-12_jaeger-collector_1 /go/bin/collector-linux Up
chapter-12_jaeger-ingester_1 /go/bin/ingester-linux Up
chapter-12_jaeger_1 /go/bin/query-linux Up
chapter-12_kafka_1 /etc/confluent/docker/run Up
chapter-12_kibana_1 /bin/bash /usr/local/bin/k ... Up
chapter-12_zookeeper_1 /etc/confluent/docker/run Up
Sometimes, Elasticsearch and Kafka take a long time to complete their startup
process, even though the ps command will report them as running. The easiest
way to check is to grep the logs:
$ docker-compose logs | grep kibana_1 | tail -3
kibana_1 | {"type":"log","@timestamp":"2018-11-
25T19:10:37Z","tags":["warning","elasticsearch","admin"],"pid":1,"mes
sage":"Unable to revive connection: http://elasticsearch:9200/"}
kibana_1 | {"type":"log","@timestamp":"2018-11-
25T19:10:37Z","tags":["warning","elasticsearch","admin"],"pid":1,"mes
sage":"No living connections"}
[ 337 ]
Gathering Insights with Data Mining
kibana_1 | {"type":"log","@timestamp":"2018-11-
25T19:10:42Z","tags":["status","plugin:[email protected]","info"],"
pid":1,"state":"green","message":"Status changed from red to green -
Ready","prevState":"red","prevMsg":"Unable to connect to
Elasticsearch at http://elasticsearch:9200."}
We can see that the top two logs indicate Elasticsearch not being ready, while
the last log reports the status as green. We can also check the Kibana UI at http://
localhost:5601/.
The health check status ready indicates that the collector is ready to write to Kafka.
[ 338 ]
Chapter 12
The last two lines (your order may be different) indicate that the job has connected
to the Kafka broker and to Elasticsearch. We can leave the job running from this
point and create some traces to give it data to process.
Microservices simulator
Any application instrumented with Jaeger can be used to feed the data to the
Flink job. To demonstrate how this job can be used to monitor trends, we want
an application that can generate continuous load and can change the shape of the
traces so that we can see the differences in the trace summaries. We will be using
a microservices simulator microsim, version 0.2.0, from https://github.com/
yurishkuro/microsim/. The source code for this chapter includes two JSON files
describing the simulation profiles that model the HotROD demo application from
Jaeger. Let's run the simulator to generate a single trace. We can run it either as
a Docker image (recommended), or from source.
[ 339 ]
Gathering Insights with Data Mining
In the docker run command, we are asking to run the program on the host network,
so that it can locate the jaeger-collector via localhost name, which is the default
setting in microsim. We also mount the chapter's source code directory to /ch12
inside the container, so that we can access the simulation profile configuration files.
[ 340 ]
Chapter 12
The microsim project uses dep as the dependency manager that needs to be installed.
Please see the instructions at https://github.com/golang/dep#installation.
On macOS, it can be installed via brew:
$ brew install dep
$ brew upgrade dep
This will build the microsim binary and install it in under $GOPATH/bin. If you
have that directory added to your $PATH, you should be able to run the binary
from anywhere:
$ microsim -h
Usage of microsim:
-O if present, print the config with defaults and exit
-c string
name of the simulation config or path to a JSON config file
[ . . . ]
Verify
If we now look at the terminal where the Flink job is running, we should see that
it generated a single trace summary, indicated by a log line like this:
3> tracefeatures.TraceSummary@639c06fa
[ 341 ]
Gathering Insights with Data Mining
[ 342 ]
Chapter 12
Type trace-summaries into the Index pattern textbox and click Next step. On
the next screen, open the dropdown menu for Time Filter field name and select the
@timestamp field. Then click on the Create index pattern button. Kibana will create
an index and display a table of all the fields it discovered in the single trace summary
records we saved, including the span count fields, like spanCounts.frontend::/
dispatch or spanCounts.frontend::HTTP GET.
Once the index pattern is created, you can look at the trace summary records on
the Discover tab; you should see an entry similar to the one shown in Figure 12.3.
[ 343 ]
Gathering Insights with Data Mining
The class model.ProtoUnmarshaler is used to convert the spans from the Protobuf
model to the simplified Span type. These spans are then aggregated into a Trace type:
public class Trace {
public String traceId;
public Collection<Span> spans;
}
As we discussed previously, this requires waiting for all spans of the given trace
to arrive to the tracing backend from all the participating microservices. So, the first
part of the job implements the trace completion trigger using a simple time window
strategy of waiting for five seconds (in the local simulation we usually don't need to
wait longer than that). Then the job performs feature extraction and generates a trace
summary:
public class TraceSummary implements Serializable {
public String traceId;
public long startTimeMillis;
public Map<String, Integer> spanCounts;
public String testName;
}
Let's look into the code of the SpanCountJob itself. It starts by defining the data
source as a Kafka consumer:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "tracefeatures");
Here we are providing the address of the Kafka broker via properties to
FlinkKafkaConsumer, a standard component of the Flink distribution. We tell it
to read from the Kafka topic jaeger-spans, and pass ProtoUnmarshaler, which
converts the data from Protobuf to the model.Span type. For testing purposes, we
also instruct it to start consuming data from the beginning of the topic on every run.
In a production setting, you will want to remove that instruction.
[ 344 ]
Chapter 12
spanCounts.print();
spanCounts.addSink(ESSink.build());
[ 345 ]
Gathering Insights with Data Mining
It looks a bit busy, but in fact the main magic is in the first two stream operators.
keyBy() tells Flink to group all incoming records by a designated key; in our
case by trace ID. The output of this operator is a stream of span groups where
every group has spans with the same trace ID. The window() operator forces the
span accumulation in each group to continue for a certain period of time. Flink
has different notions of time (processing time, event time, and ingestion time),
each having different implications for the job's behavior.
Here we are using processing time, which means that all time-based functions,
such as windowing operators, use the current system clock on the machine executing
the respective operation as the timestamp of the record. The alternative is to use
event time, defined as the time that each individual event occurred on its producing
device. Since spans have a start and end time, there is already ambiguity of which
timestamp can be used as the event time. More importantly, the machines and
devices that produce the original spans can have various clock skews between
them, especially among mobile devices. We generally do not want our stream
processing to be dependent on those timestamps, so we're using the processing time.
We could have also used the ingestion time, which means each record is assigned
a timestamp equal to the current system clock of the machine executing the source
operator (that is, where records enter the Flink job). In our example, we are running
a job with a single windowing operator on a single-node cluster, thus using the
ingestion time would be equivalent to the processing time.
One operational downside of using the processing time for trace completion trigger
is when the job experiences a lag, maybe if it was down for a couple of hours for
whatever reason. When it starts again and resumes the processing of a Kafka topic,
there will be a large volume of spans in Kafka that it needs to process. Those spans
would have arrived to Kafka at a normal rate, but when the job is catching up,
it might read them much faster, and as a result their processing time is going to
be much closer to each other than during the normal operations. The job may need
to keep a lot more data in memory because all record timestamps would congregate
at the beginning of the time window, yet the job would have to wait for a full
window interval while processing a much higher data volume from the backlog.
The second aspect of the windowing strategy we use is "session windows." Unlike
the other type of windows supported in Flink, session windows are not fixed in
length. Instead, they are sensitive to the activity, that is, the arrival of new items with
the same key. The session window closes after it receives no new events during the
specified time interval, in the selected notion of time. In our case, since we are using
processing time, it would be five seconds after no new spans arrive for a given trace,
according to the system clock. This approach alleviates some of the problems of the
fixed size windows we discussed earlier in this chapter.
[ 346 ]
Chapter 12
Feature extractor
The second function, countSpansByService(), contains the actual feature
extraction logic.
private static DataStream<TraceSummary> countSpansByService(
DataStream<Trace> traces
) {
return traces.map(SpanCountJob::traceToSummary);
}
The function itself is very simple, since it delegates to another function to convert
a Trace to a TraceSummary:
private static TraceSummary traceToSummary(Trace trace) throws
Exception {
Map<String, Integer> counts = new HashMap<>();
long startTime = 0;
String testName = null;
for (Span span : trace.spans) {
[ 347 ]
Gathering Insights with Data Mining
Here we see that feature extraction can be rather simple. Counting spans does not
require building a DAG of spans; we only need to iterate through all spans and build
a map of service or operation to counts. We also compute the smallest timestamp
across all spans and designate it as the trace timestamp, which allows us to visualize
the span counts by service as time series.
Observing trends
Now that we have our job running and we understand what it is doing, let's run
some experiments. The two JSON files with profiles for the microservices simulator
model the architecture of the HotROD demo application we covered in Chapter 2,
Take Tracing for a HotROD Ride. The second profile file, hotrod-reduced.json, is
nearly identical to hotrod-original.json, except that the simulator is instructed to
make only five calls to the route service instead of the usual 10 calls. This difference
would affect the SpanCountJob. To do the experiment, let the simulator run with the
original profile for a few minutes:
$ make microsim-run-original
docker run -v /Users/.../Chapter12:/ch12:ro --net host \
yurishkuro/microsim:0.2.0 \
-c /ch12/hotrod-original.json \
-w 1 -s 500ms -d 5m
[ 348 ]
Chapter 12
[ . . . ]
2018/12/23 20:34:07 services started
2018/12/23 20:34:10 started 1 test executors
2018/12/23 20:34:10 running for 5m0s
Here, we tell the simulator to run a single worker (-w 1) for five minutes (-d 5m)
and to sleep for half a second (-s 500ms) after each request executed against the
architecture. If you look at the terminal where the Flink job is running, you should
be seeing the job printing lines about the trace summaries it generates:
3> tracefeatures.TraceSummary@4a606618
3> tracefeatures.TraceSummary@5e8133b
4> tracefeatures.TraceSummary@1fb010c3
1> tracefeatures.TraceSummary@147c488
3> tracefeatures.TraceSummary@41e0234e
2> tracefeatures.TraceSummary@5bfbadd2
3> tracefeatures.TraceSummary@4d7bb0a4
If we go to the Discover screen in Kibana, click on the time range in the top-right
corner and select Quick | Last 15 minutes, we should see the examples of the trace
summaries stored in the index (Figure 12.5).
[ 349 ]
Gathering Insights with Data Mining
After the first run of the simulator is finished, do a second run with the
reduced profile:
$ make microsim-run-reduced
docker run -v /Users/.../chapter-12:/ch12:ro --net host \
yurishkuro/microsim:0.2.0 \
-c /ch12/hotrod-reduced.json \
-w 1 -s 500ms -d 5m
[ . . . ]
2018/12/23 20:34:07 services started
2018/12/23 20:34:10 started 1 test executors
2018/12/23 20:34:10 running for 5m0s
After this second run, we can plot the trend of changes in the average span count
for the route service. To save some time, I have included a file called kibana-
dashboard.json that contains a pre-made dashboard configuration. To import
it into a blank Kibana, follow this process:
1. Make sure that you created the trace-summaries index pattern as described
in Prerequisites by running make kibana-create-index-pattern.
2. Open a fresh copy of Kibana at http://localhost:5601/.
3. Go to Management, then Saved Objects. You should see three tabs:
Dashboards, Searches, and Visualizations, all with zero count.
4. Click on the Import button in the top-right corner. Choose the kibana-
dashboard.json file. Confirm: Yes, overwrite all objects.
5. You may now get an Index Patterns Conflict popup that offers to associate
the objects being imported with the trace-summaries index pattern. Press
Confirm all changes.
6. Now the three tabs, Dashboards, Searches, and Visualizations, should
refresh and each should show the count of one.
7. Select the Dashboard screen from the left-side menu and pick the newly
imported dashboard called Trends. You should see a dashboard with two
panels: a plot on the left and a list of trace summaries on the right. You may
need to adjust the time interval in the top-right corner to see the data.
If the preceding import process does not work for you, do not despair; it is easy to
reconstruct manually. To do that, click on the Visualize item in the side bar menu.
You should see an empty screen with Create a visualization button. Click on it and
on the next screen, select the Line chart type from the Basic Charts.
[ 350 ]
Chapter 12
Kibana will ask if you want to create a visualization from a new search or a saved
search. Under the new search select the trace-summaries index. Kibana will open an
empty chart view and a side panel on the left where you can specify parameters. Make
sure you are on the Data tab in that panel, and in the Metrics/Y-Axis section specify:
• Aggregation: Average
• Field: spanCounts.route::/GetShortestRoute
Then in the next section, Buckets, select X-Axis from the table as the bucket type,
and for Aggregation select Date Histogram. Field and Interval should automatically
populate with @timestamp and Auto (Figure 12.6). Hit the run button (blue square
with white triangle at the top of the sidebar) and you should see a plot similar to the
one shown in Figure 12.7. If you do not see any data points, make sure your time range
selector in the top-right corner corresponds to the time when you ran the simulations.
[ 351 ]
Gathering Insights with Data Mining
Figure 12.7: Plot of a trend of average span count per trace for service "route", endpoint "/GetShourtestRoute"
Of course, the chart itself is not especially exciting and very much expected: we
changed the simulation profile to call the route service five times per trace instead
of 10 times, and that is exactly what we see in the chart. However, if we use this
technique in production, it can provide powerful regression detection capabilities.
I picked a very simple-to-implement feature for this exercise; there are many more
interesting features that can be extracted from traces.
[ 352 ]
Chapter 12
Figure 12.8: Name of the test included as a tag on the root span
The feature extractor, part of the SpanCountJob class, already extracts this tag into
a testName field of TraceSummary.
After the simulations finish, refresh the chart we used previously (you may have to
adjust the time frame). The average span count is now oscillating around the value
of 7.5, which is an average of 10 and 5 from the two streams of traces processed at
roughly similar rates, with some randomness introduced by imperfect alignment
with the time buckets (Figure 12.9).
[ 353 ]
Gathering Insights with Data Mining
Figure 12.9: Average count of GetShortestRoute spans per trace when running both simulations in parallel
Since we expect the normal average value to be 10, this time series indicates some
issue in the system. However, it does not point to the root cause, that is, that there
is another simulation running with a different shape of the call graph. Fortunately,
because our trace summaries include the testName feature, we can use it to group
the trace summaries and visualize them as two different time series, one per
simulation.
To do that in Kibana, navigate to the Visualize page from the left-sidebar menu. If
you loaded the dashboard from the provided JSON file, select the GetShortestRoute
chart from the list. If you created it manually, you should already have the edit
options for the graph on the screen. Under Buckets/X-Axis:
Apply the changes by clicking on the blue square with the white triangle. Kibana
should show two perfectly horizontal lines at levels 10 and 5, for the original and
reduced simulation configurations respectively (Figure 12.10). The root cause for the
drop of the average GetShortestRoute span count is now obvious.
[ 354 ]
Chapter 12
Figure 12.10: Average count of GetShortestRoute spans per trace partitioned by the testName attribute
This last example is very similar to the capabilities touted by many companies in the
monitoring space, which allow for partitioning of monitoring time series by multiple
dimensions, from various metadata sources. Here we do the same, but with the time
series built from trace features, some of which (the testName attribute) are used
as group-by dimensions. Thus, by storing the trace summaries in the raw form in
a storage capable of answering analytical queries, we open up many exploration
possibilities for hypotheses formulated by engineers.
If you use the dashboard, the representative trace summaries will show in the right
panel, otherwise switch to the Discover tab to find them. Copy one of the trace IDs
and use it to find the trace in the Jaeger UI, for example, http://localhost:16686/
trace/942cfb8e139a847 (you will need to remove the leading zeroes from the
traceId included in the trace summary).
[ 355 ]
Gathering Insights with Data Mining
As you can see in Figure 12.11, the trace only has five calls from the frontend service
to the route service, instead of the usual 10. If we had a more integrated UI, instead
of off-the-shelf Kibana, we could make this navigation from the chart to the trace
view completely seamless. With Elasticsearch as the backing store, the query used
to generate the chart is an aggregation query that computes an average per time
bucket, and I am not sure if there is a way to instruct Elasticsearch to return sample
document IDs (in our case, trace IDs) as exemplars for the aggregation buckets.
Figure 12.11: Sample trace in Jaeger UI showing an anomalous number of calls (five) to the "route" service
Beware of extrapolations
There is one potential problem with making conclusions based on the preceding
techniques we discussed. It is very common for high-scale systems to use distributed
tracing with low probability of sampling. The Dapper paper mentioned that Google
sampled traces with probability of 0.1%, and from recent conversations with Google
developers, it may be sampling even less: 0.01% of traces.
[ 356 ]
Chapter 12
How do we know that the data we derive via data mining is not complete garbage,
statistically speaking? Unfortunately, there is no simple formula here because the
statistical significance of the results depends not only on the sample size, but also on
the question we are trying to answer with the data, that is, the hypothesis. We may
not need highly accurate data to investigate the hypothesis, but we should know the
margin of error and decide if it is acceptable. My recommendation is to seek help
from your data scientists for the specific use cases.
Fortunately, most companies do not operate at the scale of Google or Facebook
and might afford a much higher rate of trace sampling. They may also tolerate the
performance overhead of the tail-based sampling approach that we discussed in
Chapter 8, All About Sampling. Tail-based sampling opens up new possibilities for
data mining because it needs to keep full traces in the memory of the collectors
before sampling them. It is possible to build an infrastructure into those collectors
to run feature extractors in line with the collection, on the full population of requests,
guaranteeing very accurate results.
Historical analysis
So far, we have only talked about real-time analysis of tracing data. Occasionally,
it may be useful to run the same analysis over historical trace data, assuming it is
within your data store's retention periods. As an example, if we come up with a new
type of aggregation, the streaming job we discussed earlier will only start generating
it for new data, so we would have no basis for comparison.
Fortunately, the big data frameworks are very flexible and provide a lot of ways to
source the data for analysis, including reading it from databases, or HDFS, or other
types of warm and cold storage. In particular, Flink's documentation says it is fully
compatible with Hadoop MapReduce APIs and can use Hadoop input formats as
a data source. So, we can potentially use the same job we implemented here and
just give it a different data source in order to process historical datasets .
While these integrations are possible, as of the time of writing, there are not very
many open source implementations of trace analysis algorithms. The Jaeger team
at Uber is actively working on building such tools, as well as teams from other
open source projects, like Expedia's Haystack.
Ad hoc analysis
In October 2018, the members of the tracing team from Facebook gave a presentation
at the Distributed Tracing – NYC meetup [2], where they talked about a new
direction that they are taking with their tracing system, Canopy. While not based
on open source technologies like Apache Flink, the feature extraction framework
in Canopy was conceptually similar to the approach we presented in this chapter.
[ 357 ]
Gathering Insights with Data Mining
The API for building new feature extractions was open to all Facebook engineers,
but it often had a steep learning curve and required fairly deep familiarity with the
overall tracing infrastructure and its data models. More importantly, new feature
extractors had to be deployed in production as part of Canopy itself, which meant
the Canopy team still had to be deeply involved in reviewing the code and deploying
the new analysis algorithms. Finally, feature extraction was primarily designed to
work on live data, not on historical data. All of this was creating enough procedural
friction to make feature extraction and data mining not very accessible or attractive
to rank-and-file engineers at Facebook as a platform for performance investigations.
The team realized that they needed to democratize the tooling and remove
themselves from the critical path of developing new data analysis algorithms.
They observed that there are three classes of data analysis:
• Experimentation with small data sets: When someone has an idea for
a new feature, it is not easy to get the exact calculation or algorithm right
on the first try. Running iterative experiments on large data sets is also time
consuming. Ideally, the engineers should have a playground where they can
try out various algorithms on small production data sets, to prove that they
are getting some useful signal from a new feature.
• Experimentation with historical data sets: Once an engineer is happy with
the small-scale experiments, they may want to run them on a larger historical
population of traces to verify that the new algorithm or feature still gives
a strong and useful signal, that is, it was not just an anomaly of a small
sample. Ideally, the historical experiment can be run with the same code
developed in the first step.
• Permanent deployment as a streaming job: If after running the experiment
on a large historical dataset the engineers still observe that they are getting
a useful signal from the new feature extraction, they may want to deploy it
in production to run continuously as a real-time streaming job, to calculate
the feature for all future traces, start observing trends, defining alerts, and
so on. Once again, ideally, they should be able to do it using the same code
as for the first two steps.
The Facebook team decided that supporting Python and Jupyter Notebooks was the
best way to gain wider adoption among engineers and data scientists. The previous
version of Canopy used a custom domain-specific language (DSL) for describing
the feature extraction rules. For example, this was a program to calculate how long
rendering of the page in the browser took:
[ 358 ]
Chapter 12
The DSL was difficult to learn, and the resulting code was difficult to maintain
by the engineering teams, often requiring the involvement of the tracing team.
In contrast, Python is a well-known language, and it is very popular among data
scientists. Feature extraction code expressed in Python is much easier to understand
and maintain. The speakers did not show the Python function equivalent to the
preceding program; however, it might look like this:
def browser_time_to_display(trace);
browser_thread = trace.execution_units[attr.name == 'client']
begin = browser_thread.points[0]
end = browser_thread.points[attr.marker == 'display_done']
return end.timestamp – begin.timestamp
Here is an example of a program the Facebook team did provide, which counts
the number of expensive (over 10 seconds) database calls in a trace:
def count_expensive_db_calls(trace);
count = 0
for execution_unit in trace.execution_units:
if execution_unit.db_duration_ms > 10000:
count += 1
return count
An engineer can use this program to investigate a single trace or a small set of traces,
for example, by running it in a Jupyter Notebook. Then the same program can be run
in a batch mode against a large volume of historical data, to validate the hypothesis
about a performance issue. Facebook has an internal infrastructure similar to AWS
Lambda or serverless compute that makes running a Python program against large
data sets very easy.
[ 359 ]
Gathering Insights with Data Mining
Finally, if the engineer decides that this particular feature is worthy of continuous
monitoring and alerting, the same code can be deployed as a streaming job.
The Facebook tracing team said that they learned some important lessons
by developing this data analysis platform:
• Engineers at Facebook have really interesting ideas and tools that they
want to apply to the tracing data.
• However, traces themselves can be hard to understand and manipulate,
especially as more and more of the architecture is instrumented and the
traces become really large, including tens of thousands of data points.
• Traces cover a very wide variety of workflows and heterogenous
applications, from mobile apps and browsers, to storage and messaging
backends. Building a single "do-it-all" tool is often impossible, or at least
not productive. Such a tool might be so complex that it would take a power
user to be able to understand it.
• By allowing simple programmatic access to the traces, including through
such convenient exploratory and visualization frameworks as Jupyter
Notebook, the infrastructure team removes itself from the critical path of data
analysis and enables the rest of the engineers to use their domain knowledge
and ingenuity to build very specific data analysis tools that solve the right
problems.
Summary
Even though distributed tracing is still a bit of a novelty in the software
engineering industry, the open source world is making great strides in making
free tracing infrastructure available to anyone, from data gathering via projects like
OpenTracing, OpenCensus, and W3C Trace Context, to storing and processing the
data via many open source tracing backends like Jaeger, Zipkin, SkyWalking, and
Haystack. As the tracing infrastructures become commodities, data mining and data
analysis are going to be the areas of the main focus of research and development.
In this chapter, we covered some basic techniques for building data analysis tools
on top of the tracing data, including looking at some of the challenges, such as trace
completion triggers, which do not yet have perfect solutions.
References
1. Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor
Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan
Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, Yee Jiun Song.
Canopy: An End-to-End Performance Tracing and Analysis System, Symposium
on Operating Systems Principles, October 2017.
2. Edison Gao and Michael Bevilacqua-Linn. Tracing and Trace Processing
at Facebook, Presented at Distributed Tracing NYC meetup, October
23, 2018: https://www.meetup.com/Distributed-Tracing-NYC/
events/255445325/.
[ 361 ]
VI
Deploying and Operating
Tracing Infrastructure
Implementing Tracing
in Large Organizations
"These and other clichés will be available to you all for one more day of training
with me."
We have arrived at the last part of this book. Hopefully, by now you are convinced
that end-to-end tracing is an invaluable and must-have tool in your arsenal
for monitoring and managing the performance of complex distributed systems.
Parts II and III of this book are primarily addressed to the users of distributed
tracing, covering topics from how to instrument applications, to how to use
tracing data to gain insights into system behavior and perform root cause analysis.
[ 365 ]
Implementing Tracing in Large Organizations
Even though we discussed many technical aspects of data gathering in Part II, there
are many non-technical organizational challenges that need to be solved, especially
in large engineering organizations. In this chapter, we will discuss these challenges
and some approaches to addressing them. Just like in business, there is no one-size-
fits-all technique that guarantees success, and each organization may be sufficiently
unique to require improvisation and custom solutions. The recommendations in this
chapter are based on the lessons learned from my discussions with colleagues from
other companies, as well as my own experience with rolling out distributed tracing
at Uber. Treat them as good practices and recipes, rather than a manual.
Consider a small tech company that has about a dozen software engineers. If this
company embraces the idea of microservices-based architecture, it can build a system
that contains from 10 to 20 microservices. Most engineers would know what each
service does, if not necessarily the details of how they all interact. If they decide to
add distributed tracing instrumentation to the system, it is generally not a large task,
and it can be accomplished by one or two people in a short period of time.
Contrast this with a large organization that has hundreds or even thousands of
engineers. Melvin Conway observed in his 1967 paper [1] that "organizations which
design systems...are constrained to produce designs which are copies of the communication
structures of these organizations"(an adage also known as Conway's law). Following
this reasoning, the design of a system based on microservices would naturally
mirror the hierarchical structure of the organization, where each team develops
and operates a small number of microservices and has little knowledge of how
the microservices in the other parts of the organization are implemented.
[ 366 ]
Chapter 13
Unless the system has been built from the start on top of a strongly unified
application development infrastructure, for example, a single application and
dependency injection framework, or a single RPC framework, then deploying tracing
instrumentation across the whole system inevitably requires massive amounts of
domain knowledge around the design of each group of microservices developed by
individual teams. It is intractable for a small, centralized tracing team to research all
the different ways the microservices are built by different teams, least of all to go and
add tracing instrumentation to them. If just a couple of motivated people can do this
work in a small company, in a large organization this work must be decentralized.
This psychological problem is one of the biggest road blocks for wide-scale adoption
of distributed tracing. If some of the technical challenges can be solved through purely
technical solutions, then this problem requires social engineering and cultural changes
in the organization. In the remaining sections, we will discuss both types of solutions.
[ 367 ]
Implementing Tracing in Large Organizations
Standard frameworks
Feature velocity is one of the most valuable properties of software development
to the business supported by that software. Sometimes, the desire to increase feature
velocity leads to situations where engineers are encouraged to use whichever tools
get them to production the fastest:
• You just came from a Ruby on Rails shop? Go implement the next service
in Ruby.
• Spent the last five years working with Node.js? Build the next backend
service in Node.js.
• Familiar with the Spring framework in Java? Go for it. What, not Spring,
but Dropwizard? Go for it.
This approach can lead to a highly fractured ecosystem of the software run by the
company. Even for a small team of a dozen engineers, it seems unsustainable, as
everyone needs to be familiar with every other technology or framework to make
code changes in different services. Yet the situation is not uncommon, if not to such
an extreme degree as shown.
At some point during the organization's growth, the velocity starts to slow down
due to the overhead of context switching. The system becomes less reliable as the
infrastructure teams are not able to provide common infrastructure services and
components, such as the metrics library, for every combination of programming
language and application frameworks used.
It seems obvious that by standardizing on a small set of technologies, the teams
can become more efficient, easily transfer skillsets between teams, and so on.
It is especially important for deploying tracing instrumentation, since every
framework in use by the engineers may require special instrumentation. The
application developers may not have enough understanding of tracing to do it
correctly (tell them to read this book—wink), therefore that task often falls onto
the central tracing team. It is difficult to scale the efforts of the tracing team if it
needs to instrument dozens and dozens of frameworks.
[ 368 ]
Chapter 13
Once the organization converges onto a small set of programming languages and
frameworks, making the tracing instrumentation come for free becomes easier.
The selection of the frameworks to standardize on should include considerations
of how well they are instrumented for observability, which includes distributed
tracing. Just like security, the observability cannot be an afterthought. These days,
there is no excuse for the frameworks used to build microservices to be designed
without observability features or at least extension points that allow improved
observability through middleware and plugins.
Along with the standardization efforts, the infrastructure team can build an internal
adapter of the framework. For example, consider the Spring framework in Java. It is
very flexible and there are probably numerous ways an application can be constructed
on top of it, and wired with additional dependencies for numerous infrastructure
concerns, for example, which metrics or logging library to use, how to find the service
discovery system, or where to find the configuration and secrets in production. This
diversity is actually detrimental to the overall organization; it's the same problem as
we discussed in the previous section, only on the scale of a single framework.
The infrastructure team can provide a library that bundles some of these
configuration components together to allow the application developers to focus on
the business logic instead of the infrastructure wiring. Instead of depending on the
open source Spring framework, they can depend on the internal adapter that brings
Spring as a transitive dependency and forces a standard way to initialize and
configure the application. These adapter libraries are a convenient place to enable
the tracing instrumentation as well, transparently to the application developer.
• If they are using OpenTracing, which tracer implementation they want to use
• Which wire format for the context propagation the tracer will be using
[ 369 ]
Implementing Tracing in Large Organizations
• How the tracer should export trace point data, for example, encoding,
transport, and so on
• How the sampling should be configured
Monorepos
One of the benefits of microservices-based architecture is the autonomy provided
to the teams supporting different microservices. This autonomy sometimes
translates into each service having its own source code repository. In contrast, in
a monorepo (monolithic repository) the code for all services is co-located in the
same source code repository. Monorepos are employed by many tech companies,
including Google, Facebook, Uber, Microsoft, and Twitter [2]. The discussion of
the pros and cons of monorepos is beyond the scope of this chapter, and there
are some challenges with scaling to large repositories. However, if you do work
in a monorepo, it is very beneficial to the infrastructure teams, and can be very
helpful in rolling out end-to-end tracing.
Google was one of the early adopters of distributed tracing on a large scale.
Two factors played an important role in the success of the tracing rollout:
a single standard RPC framework and a monorepo. Although this is probably
a simplification, the Dapper team only needed to add tracing instrumentation
to the RPC framework that was widely adopted across the company, and all
applications received that change automatically due to the monorepo.
[ 370 ]
Chapter 13
The goal here is to be able to collect some form of tracing data with minimal
involvement of the application developers. The ideas of distributed tracing and
context propagation are not new; many legacy applications may already have
some form of instrumentation that can be turned into proper tracing data with
a bit of ingenuity. It is better to have incomplete data than no data at all. Often
it is much easier to implement some central data conversion tools that can adapt
differently formatted data into the format understood by your tracing backend
than to try to replace the existing instrumentation with something new.
Where to start
In the previous sections, we discussed techniques that can help with making the
tracing rollout a "zero touch" process, that is, without requiring any additional
manual work by all the application teams.
[ 371 ]
Implementing Tracing in Large Organizations
• Where do we start?
• Is it all-or-nothing or can we do an incremental rollout?
• How do we get buy-in from the management and application developers?
Once again, I do not presume to have a rule book that guarantees success.
I have, however, observed some common patterns from discussions with industry
practitioners. The most common advice people give is this: start with the workflows
that are most important to your business. For example, for a ridesharing app it is
more important that the workflow for taking a ride is working than the workflow
for bookmarking a location. Ideally, both should be in great shape, but outages do
happen, and the financial and reputational impact of an outage in the ride-taking
workflow is orders of magnitude larger than the other one. Tracing is a powerful
tool for troubleshooting applications during an outage, so it makes sense to
prioritize the rollout of instrumentation by ranking the workflows according
to their importance to the business.
Start with the workflows that are most important to your business.
The all-or-nothing approach is simply not feasible if we are already in the position
where manual work is required because in a large organization, full rollout may
take months. Once we know the most valuable workflows, we can start working
towards instrumenting the endpoints that serve those workflows. This is the
time for the tracing team to get their hands dirty and really dive into some of the
application code to understand the landscape better. Typically, some API service
is the entry point to any workflow, so we can start there. Incidentally, in a well-
designed API service, an instrumentation implemented for one endpoint should
work equally well for all other endpoints, so the impact of the tracing team's work
is actually larger than just the main workflow.
People often do not realize that even for workflows served by dozens of
microservices, the call graph is rarely very deep. In a system that mostly uses
RPCs to communicate between microservices, rather than queues and messaging,
the call graph can be a shallow tree with just a few levels, while the branching factor
accounts for the large number of nodes. It means that if we instrument the top level
(the API service) and then only the services in the second level, that incomplete
instrumentation can already greatly improve the observability of the system and
allow us to narrow down the causes of outages to a small subset of services.
[ 372 ]
Chapter 13
Figure 13.1: A shallow call graph with only the top-two levels instrumented for tracing still allows
us to significantly narrow down the scope of the outage investigation (box with dashed line). Circles
with solid borders represent instrumented services, while a dashed border represents services without
tracing instrumentation. Triangles with lightning indicate errors or performance problems.
As shown in Figure 13.1, we can observe the errors or performance problems in the
requests passing through services A and B, and given adequate instrumentation
in service B, we might also be able to detect that the errors are coming from service
C, not the other dependencies of B. Given the lack of instrumentation in C, we know
nothing about its dependencies in the fourth level of the tree, but we still get a pretty
accurate picture of where to look for the root cause, compared to a state when the
system had no tracing instrumentation.
As an example, let's assume that service D is the one often causing the outages.
If you are a first responder responsible for the overall workflow and you see the
trace graph in Figure 13.1, your natural reaction would be to page the on-call
person for service C, since you can trace the error there, but you cannot see any
further. The on-call person for service C cannot actually fix the problem, so after
some investigation, they realize that service D is responsible and they page the
on-call for D. Eventually the developers of service C may realize that if they just
instrument their service for tracing, then the first responder could page the on-call
for D directly, without waking them up.
The who is to blame? question can be a sufficient motivator for teams and their
managers to do the work to enable tracing instrumentation, or it may not be,
in which case some other approaches are needed, such as changing the culture.
[ 373 ]
Implementing Tracing in Large Organizations
We saw in Chapter 10, Distributed Context Propagation, how Squash debugger uses tracing
instrumentation to deliver breakpoint information to the microservices encountered by
a specific request. While this technique itself does not expose the user to the tracing tools
directly, the fact that it depends on the context propagation of the tracing infrastructure
can serve as a motivation to developers to instrument their services.
When backend engineers work on new features, they often need to send realistic
requests to their microservice, sometimes indirectly through a higher-level service, or
even a mobile app. The interaction between microservices may be fairly complicated
and easy to break during development. Tracing can be very useful in surfacing the
exact execution of the request in these cases. We discussed in Chapter 8, All About
Sampling, that the Jaeger tracers understand a special HTTP header, jaeger-debug-
id, which can be used to force sampling of a given request and to find that request
in the Jaeger UI by a correlation ID. This allows us to integrate the use of tracing
tools into the development workflows.
[ 374 ]
Chapter 13
The integration with on-call and alerting tools is perhaps the most impactful
technique. There are two types of alerting that can be used to include tracing
information:
• The most common case is alerting from metrics, that is, time series. For
example, we can set up a threshold alert that fires if the 99th percentile of
some endpoint latency exceeds some value. The system generating the alert
can query the tracing backend for representative samples of that condition
and include links to the sample traces in the alert text as it is dispatched
via a communication channel like email or PagerDuty. An engineer receiving
this alert can jump right into the tracing tool, which gives them rich context
about the outage.
• In other cases, you may develop a black-box testing system that does not
look at the metrics from the services, but instead acts as a real user and
executes some synthetic test requests against the backend. This system can
generate alerts based on repeated failures of some tests. Since it fully controls
the execution of the requests, it can be aware of the trace IDs corresponding
to those requests (it can even start a trace itself). The link to the specific trace
is included in the alert text, similar to the previous example, but it is much
more precise, as it points to the trace of the exact request that caused an error.
Finally, I want to mention a talk given by Ted Young from Lightstep at KubeCon
2018 in Seattle [3]. The talk was titled Trace Driven Development and proposed
a thought-provoking idea of writing unit tests expressed as expectations over
the trace data collected from the execution of a request. For example, given an
account in a banking application, the test checks that the account cannot allow
withdrawals of amounts larger than the current balance:
model = NewModel()
Check(model, testData)
[ 375 ]
Implementing Tracing in Large Organizations
As we can see, the expectations of the test are expressed as queries over some trace
model (the syntax is a pseudo-language at this point). The talk proposed that the
same exact code can be used not only as a unit test, but also as an integration test
against a staging environment, and even as a continuous test in production that
also acts as a monitoring tool for the correctness of the operations. Ted concluded the
talk with an observation that currently our development and monitoring practices
are often divorced and that if monitoring is not useful during development, then the
quality of the monitoring code suffers, since there is no feedback loop. The proposed
approach puts the monitoring (tracing) code directly into the development process.
Figure 13.2: Summary of trace quality metrics for a microservice, with current (left) and historical (right) levels
We developed the tracing quality report when we realized that having just a "yes
or no" indicator for tracing instrumentation was not enough to adequately track
adoption of distributed tracing by the applications. Even though a given service
may have some tracing instrumentation in it and collect some tracing data, the
instrumentation may be done incorrectly, or it may be incomplete, such as passing
the tracing context to some downstream calls but not all of them.
[ 376 ]
Chapter 13
We have implemented a streaming job that performs analysis of all collected traces
and looks for common mistakes and omissions, which we call quality metrics. For
each quality metric, the job calculates how many traces satisfied the criteria, and how
many did not. The ratio of failures to the total number of traces gives a simple score
between 0 and 1 (or as a percentage). Figure 13.3 shows an example of a breakdown
of individual metrics for a single service api-gateway. The Metric column lists the
metrics and links to the documentation explaining what the metric means, what
conditions may cause it to fail, and ways to fix it. The columns Num Passes and
Num Failures link to sample traces in the respective category, so that the service
owner can investigate what happened.
Figure 13.3: Breakdown of Tracing Quality Metrics for a single service, "api-gateway"
• Completeness: Having a low score in this category means the traces are
likely to be broken, for example, one part of the trace is reported with one
trace ID, and another part of the trace with a different trace ID, preventing
full reassembly in the backend. Some examples of completeness metrics
include:
°° HasServerSpans: Let's say we have a trace where service A calls
service B. Service A has good instrumentation and emits a client-side
span, indicating via the peer.service tag that it's calling service B.
Service B, on the other hand, does not emit any server-side span for
this trace. We count this trace ID as a failure in service B.
[ 377 ]
Implementing Tracing in Large Organizations
• Other: Some criteria that indicate a problem that may not be easily
attributable to a service or actionable by the service developers. They
may also indicate some issues that we have not yet elevated to the
quality category because no automated process depends on them.
The breakdown screen calculates the average for the completeness and quality
categories, which are also reported to the overall quality tracking system. The
reporting system allows us to analyze which microservices, teams, or even groups
within the company have low tracing quality scores and work with them and their
management on improving the coverage. There may be many reasons why a given
organization has low cumulative tracing scores: they may be using non-standard
frameworks that do not have tracing instrumentation and need some help, or they
may be using different technologies, for example, some specific forms of queueing
and messaging that are not well supported by the mainstream instrumentation.
It may also be simply a matter of prioritization and the teams need a presentation
explaining the value and benefits of tracing.
The tracing quality report has been our main go-to tool for driving and tracking
the adoption of tracing instrumentation at Uber.
[ 378 ]
Chapter 13
Troubleshooting guide
When we just started a push for wide adoption of tracing at Uber, there were
a lot of questions coming to the tracing team in emails, tickets, and online chat:
Why doesn't this work? I am not seeing spans; how do I investigate? and so on. We have
aggregated many of those questions and distilled them into a step-by-step guide for
troubleshooting tracing instrumentation given the specifics of Uber's development
and production environments. This guide drastically reduced the number of support
tickets and questions, freeing the team up to focus on better integration with existing
popular frameworks and enabling application developers to solve their problems
without waiting for the feedback from the tracing team.
Summary
Deploying tracing instrumentation and infrastructure in large organizations
is a challenging task, given the variety of technologies and frameworks usually
present in mature companies, whether through acquisitions or through build
fast, use whichever tools you want policies. Even with good practices around
standardization and consolidated infrastructure, the sheer number of different
business problems that engineers need to solve dictates a wide variety of tools, from
various databases to numerous machine learning frameworks. The industry has not
reached the point where all these tools are designed with support for distributed
tracing and we can just plug a tracer, and start collecting consistent tracing data
across the whole ecosystem. If you find yourself in such an organization, it can
take many months or even years to gain high levels of adoption.
[ 379 ]
Implementing Tracing in Large Organizations
In this chapter, we discussed various techniques that facilitate that task by attacking
it from different angles, through technical as well as organizational and cultural
solutions. I have previously given two talks on similar topics: Distributed Tracing
at Uber Scale [4] and Would You Like Some Tracing with Your Monitoring? [5], and the
topic clearly seemed to resonate with the audience, indicating that this is indeed
a significant pain point. I am hoping these thoughts will help people navigate
their way in this area. If you have other ideas or stories about adoption, I would
be interested to hear them–send me a direct message on Twitter: @yurishkuro.
In the next and last chapter, we will address more technical aspects of operating
tracing infrastructure, such as deploying and running the tracing backend, dealing
with traffic spikes and sampling abuses, multi-tenancy, and multiple data centers.
References
1. Conway, Melvin E.(April 1968), How do Committees Invent?, Datamation,
14 (5): 28–31: http://www.melconway.com/Home/Committees_Paper.html.
2. Monorepo. Wikipedia: https://en.wikipedia.org/wiki/Monorepo.
3. Ted Young. Trace Driven Development: Unifying Testing and Observability.
KubeCon – CloudNativeCon North America 2018: Seattle: https://
kccna18.sched.com/event/GrRF.
4. Yuri Shkuro. Distributed Tracing at Uber Scale. Monitorama PDX 2017:
https://vimeo.com/221070602.
5. Yuri Shkuro. Would You Like Some Tracing with Your Monitoring? KubeCon
– CloudNativeCon North America 2017, Austin: https://youtu.
be/1NDq86kbvbU.
[ 380 ]
Under the Hood of a
Distributed Tracing System
This last chapter is aimed at engineers or DevOps people tasked with deploying
and operating a distributed tracing backend in their organization. Since my own
experience is mostly related to Jaeger, I will be using it as an example. I will try
to avoid focusing on very specific details of Jaeger configurations, since they may
change after the book is published while the project continues evolving. Instead,
I will use them to illustrate the general principles and the decisions you need to
make when deploying a tracing platform. Many of the topics that we will discuss
apply equally well to any other tracing backend, and even to using hosted solutions
like AWS X-Ray and Google Stackdriver, or offerings from commercial vendors.
[ 381 ]
Under the Hood of a Distributed Tracing System
Bandwidth cost
As you get serious about tracing your distributed architecture, you may start
producing so much tracing data that the cost of bandwidth for sending all this
data to a hosted solution may become a problem, especially if your company's
infrastructure operates in multiple data centers (DCs). The cost of network traffic
inside a DC is always orders of magnitude lower than sending it to the cloud.
Having said that, this particular consideration may only matter to very large,
internet-scale companies.
[ 382 ]
Chapter 14
If you want to build additional processing and data mining that is specific to
the unique properties of your architecture or business, you need access to the raw
tracing data. While some hosted solutions provide a way to retrieve the tracing data,
it is not very cost effective (the bandwidth costs are doubled), and it is much easier
to do while the data is being collected from your applications. We saw in Chapter 12,
Gathering Insights with Data Mining, how easy it was to add data mining jobs on top
of the already-existing data collection pipeline used by Jaeger.
Similarly, the OpenCensus libraries also support B3 headers, as well as the emerging
W3C Trace Context format [4]. In some way, choosing the propagation format is
even more important than choosing the instrumentation API. As we discussed in
Chapter 13, Implementing Tracing in Large Organizations, there are techniques, such as
in-house adapter libraries, that can minimize the exposure of application developers
to the details of the tracing instrumentation.
By upgrading the adapter libraries, which are under your control, you might
think that you can change the propagation format already used in production. Yet
in practice, even if you have a monorepo and you can force all microservices to pick
up a new version of the tracing library, it may still take a long time, in the order of
months, until all microservices in production are re-deployed with the new version.
[ 383 ]
Under the Hood of a Distributed Tracing System
1. Configure tracing libraries to be able to read both the X and Y format from
the inbound requests, but only send format X in the outbound requests.
2. Once all services are upgraded to understand format Y from the inbound
requests, upgrade the libraries again to start sending format Y in the
outbound requests.
3. Once all services are upgraded for the second time, you can make the
third change to the libraries configuration and instruct them to not parse
the format X.
If each of these steps takes several months to rollout, we can see that we are
potentially looking at a very long migration period. It is possible to shorten it slightly
by combining the first two steps and including both formats X and Y in the outbound
requests, at the expense of increasing the volume of network traffic across all of your
applications.
[ 384 ]
Chapter 14
Client
The client library, or the tracing library, or the tracer, is the code that runs inside
the business application. For example, the application that is instrumented with
OpenTracing would be making calls to the OpenTracing API, and the Jaeger client
library that implements that API would be using those calls to extract tracing data
from the application. The client library is responsible for exporting the data to the
tracing backend. The most common implementation is to stash the tracing data
into an internal memory buffer, to move it off the critical path of the request, and
then send it in batches to the tracing backend asynchronously, for example, from
a separate background thread.
Most Jaeger tracers support multiple formats for data exports, as well as multiple
protocols, for example, the spans can be converted to a Thrift message and sent
to a UDP port on the localhost (to be received by the agent), or as a JSON message
to an HTTP port on the collector. Which configuration to choose depends on
the specifics of your deployment environment. Sending data to the agent has
the benefit of requiring minimal configuration of the client, since the agent is
typically configured to be available on the localhost. Using the UDP port means
that the messages are sent as fire-and-forget and could be discarded if the host
is overloaded or the agent is not keeping up with reading them off the UDP port.
Submitting spans via HTTP protocol directly to the collector requires that the clients
are given the address of the collector, which tends to complicate the configuration
and deployment of the business applications. It may also require running or
configuring a third-party load balancer to avoid creating hot spots in the cluster
of collectors. However, in some cases this is the only possible configuration;
for example, if your application is deployed to the AWS Lambda platform, you
don't have the option of running the agent as a sidecar next to the application.
[ 385 ]
Under the Hood of a Distributed Tracing System
The diagram in Figure 14.1 shows that there is a feedback loop from the
collectors to the clients called the control flow. It is a pull-based mechanism for
updating certain configuration settings in the tracers and most importantly, the
sampling strategies when they are controlled by the adaptive sampling component
in the collectors, which we discussed in Chapter 8, All About Sampling. The same
channel can be used to pass other parameters to the clients, such as throttling limits
that control how many debug traces the application is allowed to initiate, or the
baggage restrictions that control which baggage keys the application is allowed
to use.
Agent
The Jaeger agent implements the sidecar design pattern, which we discussed in
Chapter 7, Tracing with Service Mesh, by encapsulating the logic of submitting data
to the collectors, including service discovery and load balancing, so that it does
not need to be repeated in the client libraries in each programming language.
It is a pretty simple and mostly pass-through component of the Jaeger backend.
There are two primary modes of deploying the agents:
Similar to Jaeger clients, the agents also employ a memory buffer for the tracing
data they receive from the clients. The buffer is treated as a queue that supports
load shedding by discarding the oldest items when there is no more space in the
buffer to add new spans.
Collector
The Jaeger collectors are stateless, horizontally scalable services that perform
a number of functions:
[ 386 ]
Chapter 14
• They convert and normalize span data to a single internal data model.
• They send the normalized spans to a pluggable persistent storage.
• They contain the adaptive sampling logic that observes all inbound
span traffic and generates sampling strategies (discussed in Chapter 8,
All About Sampling).
The collectors also employ a configurable internal memory queue used to tolerate
the traffic spikes better. When the queue is full, the collectors can shed the load
by dropping data. They are also capable of consistent down-sampling of the traffic.
These features are described later in this chapter.
Streaming architecture
As more and more services were being instrumented with distributed tracing at
Uber, we realized that the simple push architecture we originally deployed had
certain drawbacks. In particular, it was struggling to keep up with the traffic spikes,
especially during routine DC failovers that Uber SREs perform to test capacity and
disaster recovery.
We were using Apache Cassandra as the storage backend for traces, which was
under-provisioned to handle all of the traffic during failovers. Since collectors
are designed to write to the storage directly and have only a limited amount of
memory for the internal buffer for smoothing short bursts of traffic, during failovers
the internal buffers would fill up quickly and the collectors were forced to start
dropping data.
[ 387 ]
Under the Hood of a Distributed Tracing System
Normally, the tracing data is already sampled, so dropping more of it should not
be an issue; however, since the spans from a single trace can arrive to any one of
the stateless collectors, some of the collectors may end up dropping them while
others store the remaining spans, resulting in incomplete and broken traces.
The result of this architectural change was that we were able to eliminate
indiscriminate data loss during traffic spikes, at the expense of increased
latency for the availability of traces in the persistent storage.
[ 388 ]
Chapter 14
The second significant benefit was that the stream of spans available in Kafka
allowed us to build more efficient, streaming-based data mining jobs. As part of this
work, we switched from the Apache Spark to the Apache Flink framework, because
Flink provides a true streaming platform for data mining and it was easier for us
to deploy on Uber's infrastructure, but in the end, both frameworks are probably
equally capable of handling the task of processing tracing data.
Multi-tenancy
Multi-tenancy refers to the ability of a single system to serve the needs of different
customers, or tenants, while providing isolation of their data. This requirement is
very typical for hosted commercial solutions, but many organizations have similar
requirements internally, for example, for regulatory reasons. Each organization may
have a different notion of what constitutes a tenant and what exact requirements
for multi-tenancy are imposed on the tracing backend. Let's consider some of them
separately and discuss what implications they have on the tracing backend, and
how we can implement them.
Cost accounting
Tracing infrastructure incurs certain operational costs for processing and storing
the traces. Many organizations have internal policies where systems are charged
back for the resources they consume from other systems. If you instrument your
service to generate 10 spans for each RPC request, while a neighbor service only
produces two spans per RPC, then it's reasonable to charge your service more for
the resources of the tracing backend. At the same time, when you or your neighbor
developer look at the traces, you can see the same data, across both services. The
data access is not restricted per tenant.
This scenario is easy to implement with Jaeger and most tracing backends. The
tracing spans are already tagged with the name of the service that emitted them,
which can be used for cost accounting. If we need a coarser notion of the tenant,
such as at the level of a division or business domain rather than an individual
microservice, then it can be captured in the span tags.
[ 389 ]
Under the Hood of a Distributed Tracing System
Figure 14.3: Defined as a tracer-level tag, "tenant" is automatically added to all spans
Since all users of the tracing backend in this scenario are still able to see each
other's data, this is not a real multi-tenancy. Accordingly, it does not require any
special deployment aside from defining the tenant in the applications' environment
variables.
Complete isolation
Similar to hosted solutions or Software as a Service (SaaS), each tenant may want
to have complete isolation of their data from all other tenants. This is not possible
today with Jaeger without deploying isolated installations of the Jaeger backend
for each tenant.
[ 390 ]
Chapter 14
The main reason why this use case is more difficult to support is because it
requires the storage implementation to be tenant-aware. Multi-tenant storage can
take different forms, with no guarantee that any particular solution will satisfy all use
cases. As an example, if we use Cassandra for storage, there are at least three different
options to support multi-tenancy: isolated clusters, a shared cluster with different
keyspaces, and a shared cluster with a single keyspace where tenancy is an attribute
of the span data. All these options have their strengths and weaknesses.
Figure 14.4: Multi-tenant setup with complete isolation of the tracing backend and a tenancy-aware storage
Aside from multi-tenant storage, there are other implications of providing complete
isolation, especially for the internal deployments. If you use a vendor-hosted
tracing backend to collect the spans, your own software stack is already (hopefully)
isolated, even if you run on a cloud platform. Therefore, the configuration for
reporting tracing data is going to be identical in all microservices. However, if
you are deploying a tracing infrastructure internally and you need isolation by
tenant, it may require running multiple per-tenant stacks of the tracing backend
components. As an example, the two internal tenants may be sharing the compute
resources, for example, a Kubernetes cluster. If the Jaeger agent runs as a DaemonSet,
then the spans from different tenants might be mixed up if their applications happen
to be scheduled on the same host. It is best to run the Jaeger agent as a sidecar,
so that it can forward the data to the appropriate tracing backend (Figure 14.4).
[ 391 ]
Under the Hood of a Distributed Tracing System
This scenario goes somewhat against the premise of distributed tracing as a tool that
provides end-to-end visibility into the execution of distributed requests. If you can
only see a portion of the trace, you are not getting end-to-end visibility.
However, the situation might become more common once more cloud services
start implementing distributed tracing and correlating the internal traces with
the external requests. As an example, Google and Amazon are unlikely to expose
all the intricate details of the internal execution in Spanner or DynamoDB to their
customers.
How can a single tracing backend satisfy these data access requirements and
still be useful? One option is that the data aggregations that are performed by
the backend can still operate on the full set of data, and access to the aggregation
results is controlled, similar to the access to the raw traces. This might be quite
difficult to guarantee, as there are various known techniques where the aggregate
data might reveal information that was otherwise not accessible from the raw
data with granular access control. A discussion of this topic is outside the scope
of this book.
To implement granular access controls at the level of raw trace data, the data
(spans) needs to be tagged with tenancy attributes, and the tracing backend and
its data querying components must always be aware of the tenancy information.
To my knowledge, no existing tracing system goes to this extreme today.
Security
Multi-tenancy goes hand in hand with security, both for data access controls by
the users and for securing the data transmission channels from the application to
the tracing backends. Jaeger and other tracing backends have mixed support for
these two types of security. Many components support transport-level security
(TLS) between the internal parts of the backend, such as communications between
Jaeger agents and collectors using gRPC with TLS-enabled, or communication with
the storage backends that can also be configured with TLS certificates.
On the other hand, the Jaeger query service provides no built-in authentication
or authorization for users. The motivation for this gap is to leave this function to
external components that can be deployed alongside the Jaeger query service, such
as Apache httpd [8] or Keycloak [9] proxies.
[ 392 ]
Chapter 14
These networking tools are dedicated to developing solutions for securing other
components and integrating with additional services, such as single sign-on and
other authentication mechanisms. By leaving the security aspects to them, the tracing
backend developers can focus on tracing-related functionality and not reinvent the
wheel. The only downside is that the granular access controls described in the previous
sections are not possible, since they do require domain knowledge about the traces.
This could also happen for legitimate reasons due to the design of the application,
for example, a global company like Uber may store user profiles, favorite locations,
trip history, and so on, in DCs close to the user's home location, for example, in
one or more zones of the EU region for a user who lives in Paris. When that user
travels to New York, the mobile app requests are going to be routed to zones in the
U.S. regions, since that's where the fulfillment (ride-matching) services are likely
operating. The services from the U.S. region will need to access user data located
in the EU region. Granted, the data may be replicated on-demand and cached,
but at least one request would be a cross-region request, and that's the request
we probably want to trace, since it will exhibit unusual latency. Therefore, our
tracing infrastructure may need to deal with these requests spanning multiple DCs.
[ 393 ]
Under the Hood of a Distributed Tracing System
There is, of course, a simple solution of running all of the tracing backend
components in just one region, with some replication for resiliency. However,
as we discussed earlier in this chapter, this may be prohibitive due to the network
bandwidth costs. It is also highly inefficient, since the majority of the requests in
a well-designed system will be local to a single zone. An efficient solution would be
to only incur a cross-zone bandwidth cost for traces that are themselves cross-zone
and handle all other traces locally.
[ 394 ]
Chapter 14
It could be used to propagate the origin_zone as part of the trace context, but it
would be only available in the application at runtime, since baggage items are not
stored in the spans.
To enable this approach in the already deployed tracing infrastructure requires an
upgrade of tracing libraries in many applications, which is something that can take
a long time. Note that we are only talking about capturing and propagating the
original zone; recording in which zone a given span was emitted is much easier,
as we can do that by enriching the span in the agents or collectors, since they know
in which zone they are running.
Cross-zone federation
Another problem related to multi-zone deployment is getting a cross-zone view
of the system. For example, if our service has a certain SLO for p99 latency and the
service is deployed over a dozen zones, we don't want to go to a dozen different
URLs to check that number. This is an example from metrics, but we can replace
the latency SLO with any other feature that is only available from traces.
Another example is if we want to query for traces across all zones. Answering
these questions becomes much easier if you have only a single location for all of
your tracing data, but as we already discussed, that approach may not scale well.
The alternative is to build a federation layer that can fan out the requests to multiple
tracing backends and aggregate the results. Jaeger does not have such a component
today, but we will most likely build it in the future.
[ 395 ]
Under the Hood of a Distributed Tracing System
As we can see, it reports the number of started and finished spans, partitioned by
the sampled flag, and the number of traces started or joined. Here is another group
of metrics:
hotrod_frontend_jaeger_reporter_queue_length 0
hotrod_frontend_jaeger_reporter_spans{result="dropped"} 0
hotrod_frontend_jaeger_reporter_spans{result="err"} 0
hotrod_frontend_jaeger_reporter_spans{result="ok"} 24
hotrod_frontend_jaeger_sampler_queries{result="err"} 0
hotrod_frontend_jaeger_sampler_queries{result="ok"} 0
hotrod_frontend_jaeger_sampler_updates{result="err"} 0
hotrod_frontend_jaeger_sampler_updates{result="ok"} 0
hotrod_frontend_jaeger_span_context_decoding_errors 0
hotrod_frontend_jaeger_throttled_debug_spans 0
hotrod_frontend_jaeger_throttler_updates{result="err"} 0
hotrod_frontend_jaeger_throttler_updates{result="ok"} 0
Here we see statistics about the reporter, a sub-component of the tracer that
is responsible for exporting the spans to the agent or collector. It reports the
current length of its internal queue, how many spans it sent out (successfully or not),
and how many spans it dropped because the internal buffer was full. Other Jaeger
backend components are similarly chatty about their internal state. For example,
this group of metrics from the agent describes how many batches and spans in
those batches it forwarded to the collectors:
jaeger_agent_tchannel_reporter_batch_size{format="jaeger"} 1
jaeger_agent_tchannel_reporter_batches_failures{format="jaeger"} 0
jaeger_agent_tchannel_reporter_batches_submitted{format="jaeger"} 42
jaeger_agent_tchannel_reporter_spans_failures{format="jaeger"} 0
jaeger_agent_tchannel_reporter_spans_submitted{format="jaeger"} 139
Here is another, significantly truncated set that describes the behavior of the UDP
server that receives spans as packets from the clients; the packet size; how many
were processed or dropped because the internal queue was full; the current queue
size; and how many packets could not be parsed:
thrift_udp_server_packet_size{model="jaeger",protocol="compact"} 375
thrift_udp_server_packets_dropped{model="jaeger",protocol="compact"} 0
thrift_udp_server_packets_processed{model="jaeger",protocol="compact"} 42
thrift_udp_server_queue_size{model="jaeger",protocol="compact"} 0
thrift_udp_server_read_errors{model="jaeger",protocol="compact"} 0
[ 396 ]
Chapter 14
The Jaeger query service is also instrumented with OpenTracing and can be
configured to send traces back to Jaeger. That instrumentation can be especially
useful if the query service experiences latency, because all database access paths
are generously decorated with spans.
Resiliency
I want to finish this chapter with a brief discussion of the importance of designing
a tracing backend that is resilient to potential, often unintentional, abuse. I am not
talking about an under-provisioned cluster, as there is little that can be done there.
While operating Jaeger at Uber, we have experienced a number of tracing service
degradations or even outages due to a few common mistakes.
Over-sampling
During development, I often recommend engineers to configure the Jaeger tracer
with 100% sampling. Sometimes, inadvertently, the same configuration is pushed to
production, and if the service is one of those serving high traffic, the tracing backend
gets flooded with tracing data. It does not necessarily kill the backend because, as
I mentioned previously, all Jaeger components are built with in-memory buffers
for temporary storage of spans and handling short traffic spikes, and when those
buffers are full, the components begin shedding their load by discarding some
of the data. Unfortunately, the resulting degradation in the quality of the data is
nearly equivalent to the backend being down completely, since most of the Jaeger
components are stateless and are forced to discard data without any consistency
(unlike the sampling that ensures full traces are sampled and collected).
[ 397 ]
Under the Hood of a Distributed Tracing System
Debug traces
Debug traces are created when the application explicitly sets a sampling.
priority=1 tag on a span. There are certain command-line tools at Uber that
are used mostly for debugging purposes, such as a utility to send Thrift requests,
similar to curl.
The utility was automatically forcing the debug flag on all traces it originated,
because it was useful for developers to not have to remember to pass an additional
flag. Unfortunately, on many occasions, developers would create some ad hoc
scripts, maybe for one-off data migrations, that used such utilities repeatedly with
high frequency. Unlike the regular over-sampling that can be somewhat mitigated
by the down-sampling in the collectors, the debug traces were intentionally excluded
from the scope of the down-sampling.
[ 398 ]
Chapter 14
Perpetual traces
This is just a peculiar story from production that I want to mention, not a persistent
issue. There was a bug in the instrumentation of certain systems at Uber in the early
days of rolling out Jaeger. The system was implementing a gossip protocol where
all nodes in the cluster periodically informed other nodes of some data changes.
The bug was causing the nodes to always reuse the span from the previous round
of gossip, meaning that new spans were perpetually generated with the same trace
ID, and the trace kept growing in the storage, creating all kinds of issues and out-
of-memory errors. Fortunately, the behavior was easy to spot, and we were able
to locate the offending service and fix its instrumentation.
Summary
In this chapter, we discussed many aspects of operating the tracing backend,
from architecture and deployment choices to monitoring, troubleshooting,
and resiliency measures. I intentionally kept the discussion at a slightly abstract
level because I feel the concrete details about configuring and deploying Jaeger
are probably going to become quickly outdated and therefore are better left
for Jaeger documentation. Instead, I tried to only use Jaeger as an illustration
of general principles that will be useful to anyone deploying and operating
a tracing infrastructure, whether it is Jaeger or any other competing solution.
References
1. Alberto Gutierrez Juanes. Jaeger integration with Kiali. Kiali project blog:
https://medium.com/kialiproject/jaeger-integration-in-kiali-
13bfc8b69a9d.
2. Jaeger native trace context format: https://www.jaegertracing.io/
docs/1.8/client-libraries/#propagation-format.
[ 399 ]
Under the Hood of a Distributed Tracing System
[ 400 ]
Afterword
Congratulations, you have reached the end of the book! Sometimes, when
I finish a book, I think, finally, it's over! Other times, I think, wait, it's over?
I wish there was more! So, which "R" are you filled with: relief, or regret?
We covered a lot of ground in this book. I am confident that you have a much
better understanding of distributed tracing, which is a fairly complex and often
challenging field. I am also confident that you still have many questions. I still
have many questions myself as well! Tracing is still a very new field, and with
many more people getting into it, I expect to see a lot of innovation.
[ 401 ]
Afterword
My team at Uber has pretty grandiose plans for the future of tracing. Uber's
architecture is growing more and more complex every day, counting thousands
of microservices and spanning many data centers. It is becoming obvious that
managing this infrastructure in an automated fashion requires new techniques,
and the capabilities of distributed tracing place it at the center of those techniques.
For example, Google engineers wrote the famous SRE book [1], where they advocate
for an SLA-driven approach to reliability. Unfortunately, it sounds much simpler
than it is in practice.
One of the main API gateways at Uber has over 1,000 different endpoints. How do
we even start assigning each of them SLOs such as latency or availability? An SLO
is easier to define when you have a concrete product and can estimate an impact of
SLO violation on the business. However, an API endpoint is not a product; many
of them often work in complex combinations for different products. If we can agree
on the SLO of a product or a workflow, how do we translate that to the SLOs of the
many API endpoints? Even worse, how do we translate that to SLOs for thousands
of microservices sitting below the API? This is where distributed tracing comes in.
It allows us to automatically analyze the dependencies between microservices and
the shapes of the call graphs for different business workflows, and this can be used
to inform the SLOs at multiple levels of the call hierarchy. How exactly? I don't know
yet; stay tuned.
There are more examples like this. Many organizations, including Uber, are forging
ahead with microservices-based architectures, but they have only scratched the
surface of the capabilities that distributed tracing opens up for managing those
architectures. The future is quite exciting.
At the same time, the distributed tracing field has many other, less advanced,
challenges it still needs to overcome. I have a wish list of things that I would
really like to see happen in the industry sooner:
[ 402 ]
Summary
[ 403 ]
Afterword
In closing, I want to issue a call to action: join us! Jaeger is an open source project
and we welcome contributions. If you have an idea, open a ticket in the main Jaeger
repository [6]. If you already implemented it and achieved interesting results, write
a blog post and tweet about it @jaegertracing or mention it on our online chat [7];
we are always looking for interesting case studies (and can help you to promote
them). In turn, I am committed to continuing to release the advanced features
we are building for distributed tracing at Uber as open source tools.
Happy tracing!
References
1. Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff. Site
Reliability Engineering: How Google Runs Production Systems. O'Reilly
Media, 2016.
2. Eclipse Trace Compass. An open source application to solve performance
and reliability issues by reading and analyzing traces and logs of a system:
https://www.eclipse.org/tracecompass/.
3. Common Trace Format. A flexible, high-performance binary trace format:
https://diamon.org/ctf/.
4. Trace-Viewer. The JavaScript frontend for Chrome about:tracing and
Android systrace: https://github.com/catapult-project/catapult/
tree/master/tracing.
5. plexus. A React component for rendering directed graphs: https://github.
com/jaegertracing/jaeger-ui/tree/master/packages/plexus.
6. Jaeger backend GitHub repository: https://github.com/jaegertracing/
jaeger.
7. Jaeger project online chat: https://gitter.im/jaegertracing/Lobby.
[ 404 ]
Other Books You
May Enjoy
If you enjoyed this book, you may be interested in these other books by Packt:
ISBN: 978-1-78712-538-4
Mastering Go
Mihalis Tsoukalos
ISBN: 978-1-78862-654-5
[ 406 ]
Other Books You May Enjoy
[ 407 ]
Index
A in Java 148, 149
in Python 151, 153
adaptive sampling
about 238, 239 B
extensions 246, 247
global adaptive sampling 240 backend
implications 246 bandwidth cost 382
local adaptive sampling 239 customizations 382
Ad-hoc analysis 357-360 data, tracing 383
Ad-hoc sampling 248-250 integrations 382
Agent-based instrumentation tracing 382
about 65 baggage
style 186 about 48, 226
all-or-nothing approach 372 using 144
Ambassador pattern 208 using, in Go 144
application performance using, in Java 144, 146
aspects, investigating 268 Baggage Definition Language (BDL) 283
application performance monitoring barrier to adoption
(APM) 255 reducing 367, 368
application performance analysis Billing component 190
critical path analysis 269, 270 black-box inference 63
exemplars 275 brown tracing plane
latency histograms 276 about 282-286
trace patterns, recognizing 271 Atom Layer 284
architectural problems Baggage Layer 284
detecting 267 Transit Layer 284
architecture modes 384 brush select 355
aspect-oriented programming (AOP) 16 building applications
aspects, application performance culture, modifying 374
availability 268 integrating, with developer workflows 374
correctness 268 value 374
speed 268
asynchronous code C
instrumenting 179-183
causality
auto-instrumentation
inter-request causality 69, 70
about 146
preserving 67-69
[ 409 ]
causality references 14 data tracing, via Istio
Chaos Engineering 289-292 Hello application, implementing 213-215
clock skew adjustment 74, 76 Istio, installing 212, 213
cloud-native applications 3 Java development environment 211
Cloud Native Computing Foundation (CNCF) Kubernetes, installing 211
about 5 prerequisites 210
reference 87 source code 210, 211
components, data mining pipeline debug sampling 248-250
about 331 deployment modes 384
aggregator 331, 334 desired process value 241
backend, tracing 331, 332 detecting degradation 269
completion trigger, tracing 331-333 Directed Acyclic Graph (DAG) 32
extractor, features 331 distributed context
feature extractor 333 using 228-231
storage 331 distributed context propagation
context (DCP) 281, 285
passing 113 distributed stack traces 73
passing, between processes 132-134 distributed tracing
context-sensitive sampling 248 about 13-18
Context type anatomy 66
reference 125 with Istio 216-226
contextualized logs Docker
about 38 reference 157
handling 35-38 server, executing 301
contextual metadata 14 docker-compose tool
continuation local storage (CLS) 120 reference 158
control flow 386 Docker images
control plane HotROD demo application, executing from 25
about 208 reference 25
components 209 domain-specific language (DSL) 248, 358
Conway's law 366 dtrace 7, 61
correction signal 241 dynamic metadata
counter 243 using 68
critical path driven development (CPDD) 270
Cross-Cutting Layer 283 E
curl
reference 86 effective sampling rate 239
Elasticsearch 335
D Elasticsearch-Logstash-Kibana (ELK) 109
emerging standards 383
data analysis Enterprise Service Bus (ESB) 4, 206
classes 358 error 241
data centers (DCs) 382 event model 70, 71
data mining pipeline execution identifier 64
components 331 exemplars 275, 276
downsides 332 exporters 201
data plane 208 extract trace point 66
[ 410 ]
F H
Failure Injection Testing (FIT) 289 happens-before relation 67
feature extraction 330-336 head-based consistent sampling
federating transactions 199 about 234
fixed-width dynamic metadata adaptive sampling 238, 239
using 68 Ad-hoc sampling 248-250
Flask framework context-sensitive sampling 248
reference 98 debug sampling 248-250
Fluentd 5 guaranteed-throughput probabilistic
functions sampling 237, 238
tracing 113 oversampling 251
functions-as-a-service (FaaS) 3 probabilistic sampling 235
rate limiting sampling 236, 237
G Hello application
about 304
global adaptive sampling architecture 305
about 240 creating 90, 91
architecture 242-244 executing, in Go 91-95
goals 240, 241 executing, in Java 95-98
sampling probability, calculating 244, 245 executing, in Python 98-100
theory 241, 242 jaeger client 305
Go prom client 305
baggage, using 144 historical analysis 357
context, passing between HotROD demo application
processes 134, 135 about 29-31
Hello application, executing 91-95 architecture 32
in-process context propagation 125-127 data flow 33, 35
microservices 129 executing, from Docker images 25
multiple spans, combining into single executing, from pre-packaged binaries 24, 25
trace 117, 118 executing, from source code 25
open source instrumentation 146-148 implementing, in Go language 26
reference 340, 341 Jaeger source code 26, 27
span, annotating 110, 111 prerequisites 24
span, starting 106 sources of latency, identifying 40-54
standard tags 141, 142 hypothetical e-commerce website
tracer, creating 101-103 architectural diagram 62
Go development environment
installation link 84 I
Go language
HotROD demo application, implementing 26 IDE plugin 295
reference 26 incremental tracing rollout 373
guaranteed-throughput probabilistic indexer component 388
sampling 237, 238 individual functions
guaranteed-throughput sampler 237 tracing 114
tracing, in Go 114
[ 411 ]
tracing, in Java 115 monitoring 395-397
tracing, in Python 116 query service 387
ingester component 388 starting 27-29
Injection Testing (FIT) 289 troubleshooting 395-397
inject trace point 66 UI 387
in-process context propagation jaeger-ingester 335
in Go 125 Java
in Java 123, 124 auto-instrumentation 148-150
in Python 122 context, passing between processes 136-138
in-process propagation 64 Hello application, executing 95-98
instrumentation in-process context propagation 123, 124
styles 186-189 microservices 130
instrumentation API multiple spans, combining into single
about 188 trace 118, 119
requirements 188, 189 span, annotating 111
integrations, with logs span, starting 106
about 316 standard tags 142, 143
context-aware logging APIs 322, 323 tracer, creating 103
logging backends 324, 325
logs, capturing in tracing system 323, 324 K
logs, correlating with trace context 317-322
structured logging 316, 317 Kafka
tracing backends 324-326 about 305
integrations, with metrics messages, consuming 175-179
about 305 messages, producing 173-175
context, adding 310-314 support for OpenTracing 172
context-aware metrics APIs 315, 316 Kibana
standard metrics, via tracing about 305
instrumentation 305-309 index pattern, declaring 303
inter-process propagation 64 index pattern, defining in 342, 343
inter-request causality 69, 70 Kibana UI 335
Inventory module 190 Kubernetes
Istio installing 211
installing 212, 213 reference 211
reference 212
service graph, generating 226-228 L
used, for distributed tracing 216-226
latency histograms 276-278
leaky bucket algorithm 236
J lib module, Tracing Talk chat application
Jaeger about 160
about 5, 196 AppId class 161
agent 386 GiphyService class 162
architecture 384 KafkaConfig class 161
client 385 KafkaService class 161
collector 386, 387 Message class 161
data mining jobs 387 RedisConfig class 161
RedisService class 161
[ 412 ]
Lineage-driven Fault Injection (LDFI) 290 download link 86
local adaptive sampling 239
logs O
versus span tags 39
long-term profiling 278, 279 object-relational mapping (ORM) 273
observability
M about 7, 8
distributed tracing instrumentation 300
Mapped Diagnostic Context (MDC) 317 logging frameworks 299
metadata propagation metrics 298
about 64, 65 pillars 298
disadvantage 65 sampling 299
in-process propagation 64 OpenCensus 199-201
inter-process propagation 64 OpenJDK
metrics 298 download link 85
Micrometer OpenTracing
reference 315 about 5, 201, 202
microservices Kafka instrumentation 172
and cloud-native applications 4-6 Redis instrumentation 170, 171
characteristics 4, 5 reference 150
design, challenges 9, 10 Spring instrumentation 169
in Go 129 tracer resolver 169, 170
in Java 130 Tracing Talk chat application, instrumenting
in Python 131 with 169
observability challenge 8, 9 OpenTracing Industrial Advisory Board
microservices simulator (OTIAB) 202
about 339 OpenTracing project
Docker image, running as 340 about 87-89
reference 335 community contributions, reference 153
running, from source 340 OpenTracing-recommended tags
verifying 341 applying 140
minikube 211 OpenTracing Specification Council
modeled trace 74 (OTSC) 202
monkey-patching 186 OpenZipkin 196
monorepo (monolithic repository) 370 origin zone
multiple DCs capturing 394, 395
cross-zone federation 395 OSS (open-source software) framework 195
executing 393, 394
multiple spans P
combining, into single trace 116, 117
multi-server spans 73 parent ID 73
multi-tenancy parent span 72
about 389 partial sampling 257
complete isolation 390, 391 path-aware graphs 264-267
cost accounting 389, 390 pip tool
granular access control 392 reference 85
MySQL database Pivot Tracing 287-289
[ 413 ]
post-collection down-sampling 251 R
pre-packaged binaries
HotROD demo application, executing rate limiter 236
from 24, 25 rate limiting
reference 24 sampling 236
prerequisites React
about 83 reference 159
Go development environment 84 Redis
Java development environment 85 support for OpenTracing 170, 171
MySQL database 86 RED (Rate, Error, Duration) method 209
project source code 83, 84 regions 393
Python development environment 85 remotesrv application 71
query tools (curl or wget) 86 request correlation, distributed tracing
tracing backend (Jaeger) 87 about 63
prerequisites, microservices simulator black-box inference 63
about 336 metadata propagation 64, 65
index mapping, defining in Elasticsearch 338 schema-based 63, 64
index pattern, defining in Kibana 342, 343 reservoir sampling 236
Java development environment 339 resiliency
project source code 336 about 397
servers, running in Docker 337, 338 DC failover 398
prerequisites, tracing integration debug traces 398
clients, executing 304 long traces 399
index pattern, declaring in Kibana 303 over-sampling 397
Java development environment 301 perpetual traces 399
project source code 300, 301 resource usage attribution 55-59
servers, executing in Docker 301, 303 RFC (request for comment) process 202
probabilistic sampling 235 Rides on Demand (ROD) 29
process context routing 228-231
propagating 120, 121 RPC requests
process value 241 monolith, breaking up 128
Prometheus 5, 305 tracing 128
proportional-integral-derivative (PID) 241
Python S
auto-instrumentation 151, 153
baggage, using 144 sampling 67
context, passing between sampling per service 239
processes 139, 140 schema-based technique 63, 64
Hello application, executing 98, 99 security 392
in-process context propagation 122, 123 service and endpoint 239
microservices 131 service graph
multiple spans, combining into single about 263
trace 119, 120 generating, with Istio 226-228
span, annotating 112 limitations 263, 264
span, starting 107 path-aware graphs 264-267
standard tags 143 service level objectives (SLOs) 105, 330
tracer, creating 104 service mesh
about 206-209
[ 414 ]
control plane 208 about 197
data plane 208 OpenCensus 199-201
observability 209, 210 OpenTracing 201, 202
Service Mesh Control Plane 295 W3C 199
service mesh Istio 294 W3C Trace Context 197-199
service name 101 standard tags
service-oriented architecture (SOA) 15 http.method 141
shared spans 73 in Go 141, 142
shrimp 265 in Java 142, 143
sidecar pattern in Python 143
about 207 span.kind 141
advantages 207 static metadata
Simian Army 290 using 67
single-host span model 73 streaming architecture 387-389
Site Reliability Engineering (SRE) 275
SkyWalking 196 T
SLF4J API 317
Software as a Service (SaaS) 390 tail-based consistent sampling
SOSP 2015 (Symposium on Operating about 253-257
Systems Principles 287 downsides 254
source code techniques, for reducing barrier to adoption
HotROD demo application, executing from 25 existing infrastructure, integrating with 371
source code, Jaeger in-house adapter libraries 369
reference 26 monorepos 370, 371
span standard frameworks 368
annotating 107-109 tracing enabled by default 370
annotating, in Go 110, 111 Telepresence 294
annotating, in Java 111 throttling 252
annotating, in Python 112 trace
starting 105 about 62
starting, in Go 106 creating 100
starting, in Java 106 creating, in Go 101
starting, in Python 107 instance, creating 100, 101
span context 89 trace analysis 76, 77
Span Count job trace context
about 343, 344 logs, correlating 317-322
completion trigger, tracing 346-348 trace features
span model 72-74 examples 330
span tags trace model
versus logs 39 about 70
Spring event model 70, 71
support for OpenTracing 169 span model 72-74
Spring Boot framework trace patterns
reference 95 error markers, checking 271
Squash debugger 293 longest span on critical path, checking 272
Stackdriver 197 missing details, checking 273
standard projects, tracing systems recognizing 271
[ 415 ]
sequential execution, avoiding 273, 274 instrumenting, with OpenTracing 169
span series, finish time 274, 275 prerequisites 156
staircase pattern, avoiding 273, 274 reference 158
trace points 14, 62, 70 source code 156, 157
tracer traces, observing 165-168
creating, in Java 103 Tracing Talk chat application, implementation
creating, in Python 104 about 160
span, starting 105 chat-api service 162, 163
traceresolver module giphy-service 164
reference 150 lib module 160
tracer resolver storage-service 163
about 169, 170 Tracing Talk chat application, prerequisites
reference 169 Jaeger, installing 158
trace segment 68 Java development environment 157
tracing Kafka, installing 158
activity 192 Redis, installing 158
benefits 262 Zookeeper, installing 158
deployment 189-191 tracing team
ecosystem 195 objective 379
interoperability 189-191 traditional monitoring tools
target audience 194, 195 about 11
transactions, analyzing 193 logs 12, 13
transactions, correlating 194 metrics 11
transactions, describing 194 traffic labeling
transactions, federating 193 about 292
transactions, recording 193 debugging, in production 293, 294
troubleshooting guide 379 developing, in production 294, 295
tracing instrumentation testing, in production 292
deploying, limitations 366, 367 transport-level security (TLS) 392
Tracing Quality Metrics trends
about 376-378 extrapolations 356
completeness 377 observing 348-356
quality 378
tracing systems U
about 196
Jaeger 196 uDestroy 290
OpenZipkin 196
SkyWalking 196 V
Stackdriver 197
variable-width dynamic metadata 69
standard projects 197
virtualenv
X-Ray 197
reference 85
Zipkin 196
Tracing Talk chat application
about 158, 159
W
architecture 159, 160 W3C 199
executing 164, 165 W3C Trace Context 197-199
[ 416 ]
wget
reference 86
World Wide Web Consortium (W3C) 197
X
X-Ray 197
Z
Zipkin 196
zones 393
[ 417 ]