Machine Learning Logistics Final
Machine Learning Logistics Final
Logistics
Model Management in the Real World
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Machine Learning
Logistics, the cover image, and related trade dress are trademarks of O’Reilly Media,
Inc.
While the publisher and the authors have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the authors disclaim all responsibility for errors or omissions, including without
limitation responsibility for damages resulting from the use of or reliance on this
work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is sub‐
ject to open source licenses or the intellectual property rights of others, it is your
responsibility to ensure that your use thereof complies with such licenses and/or
rights.
978-1-491-99759-8
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
v
Adding Metrics 39
Rule-Based Models 42
Using Pre-Lined Containers 42
6. Models in Production. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Life with a Rendezvous System 56
Beware of Hidden Dependencies 60
Monitoring 63
7. Meta Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Basic Tools 66
Data Monitoring: Distribution of the Inputs 73
8. Lessons Learned. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
New Frontier 77
Where We Go from Here 78
A. Additional Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
vi | Table of Contents
Preface
vii
Chapter 8 draws final lessons. In Appendix A, we offer a list of addi‐
tional resources.
Finally, we hope that you come away with a better appreciation of
the challenges of real-world machine learning and discover options
that help you deal with managing data and models.
Acknowledgments
We offer a special thank you to data engineer Ian Downard and data
scientist Joe Blue, both from MapR, for their valuable input and
feedback, and our thanks to our editor, Shannon Cutt (O’Reilly) for
all of her help.
viii | Preface
CHAPTER 1
Why Model Management?
1
The Best Tool for Machine Learning
One of the first questions that often arises with newcomers is,
“What’s the best tool for machine learning?” It makes sense to ask,
but we recently found that the answer is somewhat surprising.
Organizations that successfully put machine learning to work gener‐
ally don’t limit themselves to just one “best” tool. Among a sample
group of large customers that we asked, 5 was the smallest number
of machine learning packages in their toolbox, and some had as
many as 12.
Why use so many machine learning tools? Many organizations have
more than one machine learning project in play at any given time.
Different projects have different goals, settings, types of data, or are
expected to work at different scale or with a wide range of Service-
Level Agreements (SLAs). The tool that is optimal in one situation
might not be the best in another, even similar, project. You can’t
always predict which technology will give you the best results in a
new situation. Plus, the world changes over time: even if a model is
successful in production today, you must continue to evaluate it
against new options.
A strong approach is to try out more than one tool as you build and
evaluate models for any particular goal. Not all tools are of equal
quality; you will find some to be generally much more effective than
others, but among those you find to be good choices, likely you’ll
keep several around.
Lesson
It’s important to recognize what data is available to be collected, how
decisions can be structured, and to define a sufficiently narrow goal
so that it is practical to carry out. Note that domain knowledge—
such as, the predator is a blue jay—is critical to the effectiveness of
this project.
Figure 1-2. Data flow for a prototype blue jay detection project using
tensors in the henhouse. Details are available on the Big Endian Data
blog (image courtesy of Ian Downard).
Lesson
The design provides a reasonable way to collect data for training,
takes advantage of simplified model development by using
Inception-v3 because it is sufficient for the goals of this project, and
the model can be deployed to the IoT edge.
SLAs
One issue with the design, however, is that the 30 seconds required
for the classification step on the Pi are probably too slow to detect
the blue jay in time to take an action to stop it from destroying eggs.
That’s an aspect of the design that Ian is already planning to address
Lesson
Retraining or updating models as well as testing and rolling out
entirely new models is an important aspect of successful machine
learning. This is another reason that you will need to manage multi‐
ple models, even for a single project. Also note the importance of
domain knowledge: After model deployment, Ian realized that some
of his chickens were not of the type he thought. The model had been
trained to erroneously identify some chickens as Buff Orpingtons.
As it turns out, they are Plymouth Rocks. Ian retrained the model,
and this shift in results is used as an example in Chapter 7.
Lesson
The power of machine learning often leads to mission creep. After
you see what you can do, you may begin to notice new ways that
machine learning can produce useful results.
Real-World Considerations
This small tensor-in-the-henhouse project was useful as a way to get
started with deep learning image detection and the requirements of
building a machine learning project, but what would happen if you
tried to scale this to a business-level chicken farm or a commercial
enterprise that supplies eggs from a large group of farms to retail
outlets? As Ian points out in his blog:
Real-World Considerations | 7
Imagine a high-tech chicken farm where potentially hundreds of
chickens are continuously monitored by smart cameras looking for
predators, animal sickness, and other environmental threats. In sce‐
narios like this, you’ll quickly run into challenges...
Data scale, SLAs, a variety of IoT data sources and locations as well
as the need to store and share both raw data and outputs with multi‐
ple applications or teams, likely in different locations, all complicate
the matter. The same issues are true in other industries. Machine
learning in the real world requires capable management of logistics,
a challenge for any DataOps team. (If you’re not familiar with the
concept of DataOps, don’t worry, we describe it in Chapter 2).
People new to machine learning may think of model management,
for instance, as just a need to assign versions to models, but it turns
out to be much more than that. Model management in the real
world is a powerful process that deals with large-scale changing data
and changing goals, and with ways to deal with models in isolation
so that they can be evaluated in specifically customized, controlled
environments. This is a fluid process.
11
warmed up so that they can replace production models without sig‐
nificant lag time. The design strongly supports ongoing model eval‐
uation and multi-model comparison. It’s a new approach to
managing models that reduces the burden of logistics while provid‐
ing exceptional levels of monitoring so that you know what’s
happening.
Many of the ingredients of the rendezvous approach—use of
streams, containers, a DataOps style of design—are also fundamen‐
tal to the broader requirements of building a global data fabric, a key
aspect of digital transformation in big data settings. Others, such as
use of decoy and canary models, are specific elements for machine
learning.
With that in mind, in this chapter we explore the fundamental
aspects of this approach that you will need in order to take advan‐
tage of the detailed architecture presented in Chapter 3.
Stream-Based Microservices
Microservices is a flexible style of building large systems whose
value is broadly recognized across industries. Leading companies,
including Google, Netflix, LinkedIn, and Amazon, demonstrate the
advantages of adopting a microservices architecture. Microservices
enables faster movement and better ability to respond in a more
agile and appropriate way to changing business needs, even at the
detailed level of applications and services.
What is required at the level of technical design to support a micro‐
services approach? Independence between microservices is key.
Services need to interact via lightweight connections. In the past, it
has often been assumed that these connections would use RPC
mechanisms such as REST that involve a call and almost immediate
response. That works, but a more modern, and in many ways more
advantageous, method to connect microservices is via a message
stream.
Stream-Based Microservices | 15
A stream transport technology that decouples produc‐
ers from consumers offers a key capability needed to
take advantage of a flexible microservices-style design.
Streams are also a useful way to provide raw data to multiple con‐
sumers, including multiple machine learning models. Recording
raw data is important for machine learning—don’t discard data that
might later prove useful.
We’ve written about the advantages of a stream-based approach in
the book Streaming Architecture: New Designs Using Apache Kafka
and MapR Streams (O’Reilly, 2016). One advantage is the role of
streaming and stream replication in building a global data fabric.
With a global data fabric, applications also need to run where you
want them. The ability to deploy applications easily in predictable
and repeatable environments is greatly assisted by the use of
containers.
Figure 2-4. Containers can remain stateless even when running stateful
applications if there is data flow to and from a platform. (Based on
“Data Where You Want It: Geo-Distribution of Big Data and Analyt‐
ics.)
Decisioning
Machine learning applications that fall under the description of
“decisioning” basically seek a “correct answer.” Out of a short list of
Search-Like
Another category of applications involves search or recommenda‐
tions. These projects use bounded input and return a ranked list of
results. Multiple answers in the list may be valid—in fact the goal of
search is often to provide multiple desired results. Use cases involv‐
ing search-like or recommendation-based applications include auto‐
mated website organization, ad targeting or upsell, and product
recommendation systems for retail applications. Recommendation
is also used to customize web experience in order to encourage a
user to spend more time on a website.
Interactive
This last broad category contains systems that tend to be more com‐
plex and often require even higher level of sophistication than those
we’ve already described. Answers are not absolute; the validity of the
output generally depends on context, often in real-world and rapidly
changing situations. These applications use continuous input and
Conclusion
All of these categories of machine learning applications could bene‐
fit from some aspects of the solutions we describe next, but for
search-like projects or sophisticated interactive machine learning,
the rendezvous architecture will likely need to be modified to work
well. Solutions for model management for all these categories are
beyond the scope of this short book. From here on, we focus on
applications of the decisioning type.
Conclusion | 23
CHAPTER 3
The Rendezvous Architecture for
Machine Learning
25
tioned. We start with the shortcomings of previous designs and
follow a design path to a more flexible approach.
This is very much like the first version for the henhouse monitoring
system described in Chapter 1. The biggest virtue of such a system is
its stunning simplicity, which is obviously desirable.
That is its biggest vice, as well.
Problems crop up when we begin to impose some of the other
requirements that are inherent in deploying machine learning mod‐
els to production. For instance, it is common in such a system to
require that we can run multiple models at the same time on the
exact same data in order to compare their speed and accuracy.
Another common requirement is that we separate the concerns of
decision accuracy from system reliability guarantees. We obviously
can’t completely separate these, but it would be nice if our data sci‐
entists who develop the model could focus on science-y things like
accuracy, with only broad-brush requirements around topics like
redundancy, running multiple models, speed and absolute stability.
Similarly, it would be nice if the ops part of our DataOps team could
focus more on guaranteeing that the system behaves like a solid,
Figure 3-2. A load balancer, in which each request is sent to one of the
active models at a time, is an improvement but lacks key capabilities of
the rendezvous style.
But we immediately run into the question that if we put the requests
into a stream, how will the results come back? With the original dis‐
crete decision architecture in Figure 3-1, there is a response for
every request and that response can naturally include the results
from the model. On the other hand, if we send the requests into a
stream and evaluate those requests with lots of models, the insertion
into the input stream will complete before any model has even
looked at the request. Even worse, with multiple models all produc‐
ing results at different times, there isn’t a natural way to pick which
result we should return, nor is there any obvious way to return it.
These additional challenges motivate the rendezvous design.
Message Contents
The messages between the components in a rendezvous architecture
are mostly what you would expect, with conventional elements like
timestamp, request id, and request or response contents, but there
are some message elements that might surprise you on first exami‐
nation.
The messages in the system need to satisfy multiple kinds of goals
that are focused around operations, good software engineering and
data science. If you look at the messages from just one of these
points of view, some elements of the messages may strike you as
unnecessary.
All of the messages include a timestamp, message identifier, prove‐
nance, and diagnostics components. This makes the messages look
roughly like the following if they are rendered in JSON form:
{
timestamp: 1501020498314,
messageId: "2a5f2b61fdd848d7954a51b49c2a9e2c",
provenance: { ... },
diagnostics: { ... },
... application specific data here ..
}
The first two common message fields are relatively self-explanatory.
The timestamp should be in milliseconds, and the message identifier
should be long enough to be confident that it is unique. The one
shown here is 128 bits long.
The provenance section provides a history of the processing ele‐
ments, including release version, that have touched this message. It
also can contain information about the source characteristics of the
request in case we want to drill down on aggregate metrics. This is
particularly important when analyzing the performance and impact
of different versions of components or different sources of requests.
Including the provenance information also allows limited trace
diagnostics to be returned to the originator of the request without
having to look up any information in log files or tables.
Message Contents | 33
nal request identifier to collect together results for a request in antic‐
ipation of returning a response. The return address doesn’t need to
be in the score messages, because the rendezvous server will get that
from the original request.
The result message has whatever result is selected by the rendezvous
server and very little else other than diagnostic and provenance
data. The model outputs can have many forms depending on the
details of how the model actually works.
Data Format
The data format you use for the messages between components in a
rendezvous architecture doesn’t actually matter as much as you
might think especially given the heat that is generated whenever you
bring data format conventions up in a discussion. The cost of mov‐
ing messages in inefficient formats including serializing and deseri‐
alizing data is typically massively overshadowed by the
computations involved in evaluating a model. We have shown the
messages as if they were JSON, but that is just because JSON is easy
to read in a book. Other formats such as Arrow, Avro, Protobuf, or
OJAI are more common in production. There is a substantial advan‐
tage to self-describing formats that don’t require a schema reposi‐
tory, but even that isn’t a showstopper.
That being said, there is huge value in consensus about messaging
formats. It is far better to use a single suboptimal format everywhere
than to split your data teams into factions based on format. Pick a
format that everybody likes, or go with a format that somebody else
already picked. Either way, building consensus is the major consid‐
eration and dominates anything but massive technical considera‐
tions.
Stateful Models
The basic rendezvous architecture allows for major improvements
in the management of models that are pure functions, that is, func‐
tions that always give the same output if given the same input.
Some models are like that. For instance, the TensorChicken model
described in Chapter 1 will recognize the same image exactly the
same way no matter what else it has seen lately. Machine translation
and speech recognition systems are similar. Only deploying a new
model changes the results.
Message Contents | 35
Figure 3-5. With stateful models, all dependence on external state
should be positioned in the main rendezvous flow so that all models
get exactly the same state.
The point here is that all external state computation should be exter‐
nal to all models. Forms of internal state that have stable and com‐
monly used definitions can be computed and shared in the same
way as external state or not, according to preference. As we have
mentioned, the key rationale for dealing with these two kinds of
state in this way is reproducibility. Dealing with state as described
here means that we can reproduce the behavior of any model by
using only the data that the decoy model has recorded and nothing
else. The idea of having such a decoy model that does nothing but
archive common inputs is described more fully in the next section.
For detecting input shifts, the distribution of outputs for the canary
can be recorded and recent distributions can be compared to older
distributions. For simple scores, distribution of score can be sum‐
marized over short periods of time using a sketch like the t-digest.
These can be aggregated to form sketches for any desired period of
time, and differences can be measured (for more information on
this, see Chapter 7). We then can monitor this difference over time,
and if it jumps up in a surprising way, we can declare that the canary
has detected a difference in the inputs.
We also can compare the canary directly to other models. As a
bonus, we not only can compare aggregated distributions, we can
use the request identifier to match up all the model results and com‐
pare each result against all others.
Adding Metrics
As with any production system, reporting metrics on who is doing
what to whom, and how often, is critical to figuring out what is
really going on in the system. Metrics should not be (but often are)
an afterthought to be added after building an entire system. Good
metrics are, however, key to diagnosing all kinds of real-world issues
that crop up, be they model stability issues, deployment problems,
or problems with the data logistics in a system, and they should be
built in from the beginning.
With ordinary microservices, the primary goal of collecting metrics
is to verify that a system is operating properly and, if not, to diag‐
nose the problem. Problems generally have mostly to do with
whether or not a system meets service-level agreements.
With machine learning models, we don’t just need to worry about
operational metrics (did the model answer, was it quick enough?);
we also need to worry about the accuracy of the model (when it did
answer, did it give the right answer?). Moreover, we usually expect a
model to have some error rate, and it is normal for accuracy to
degrade over time, especially when the model has real-world adver‐
saries, as in fraud detection or intrusion detection. In addition, we
need to worry about whether the input data has changed in some
way, possibly by data going missing or by a change in the distribu‐
tion of incoming data. We go into more detail on how to look for
data changes in Chapter 7.
Adding Metrics | 39
To properly manage machine learning models, we must collect met‐
rics that help us to understand our input data and how our models
are performing on both operational and accuracy goals. Typically,
this means that we need to record operational metrics to answer
operational questions and need to record scores for multiple models
to answer questions about accuracy. Overall, there are three kinds of
questions that need to be answered:
The first kind of metrics helps us with the overall operation of the
system. We can find out whether we are meeting our guarantees,
how traffic volumes are changing, how to size the system going for‐
ward, and help diagnose system-level issues like bad hardware or
noisy neighbors. We also can watch for unexpected model perfor‐
mance or input data changes. It might be important to be able to
inject tags into this kind of metrics so that we can drill into these
aggregates to measure performance for special customers, queries
that came from particular sources, or where we might have some
other hint that there is a class of requests to pay special attention to.
We talk more about analyzing aggregated metrics in Chapter 7.
The second kind of metrics helps us drill into the specific timing
details of the system. This can help us debug issues in rendezvous
policies and find hot-spots in certain kinds of queries. These trace-
based measurements are particularly powerful if we can trigger the
monitoring on a request-by-request basis. That allows us to run low
Anomaly Detection
On seriously important production systems, you also should be run‐
ning some form of automated analytics on your logs. It can help
dramatically to do some anomaly detection, particularly on latency
for each model step. We described how to automate much of the
anomaly detection process in our book Practical Machine Learning:
Adding Metrics | 41
A New Look at Anomaly Detection (O’Reilly, 2014). Those methods
are very well suited to the components of a rendezvous architecture.
The basic idea is that there are patterns in the metrics that you col‐
lect that can be automatically detected and give you an alert when
something gets seriously out of whack. The model latencies, for
instance, should be nearly constant. The number of requests han‐
dled per second can be predicted based on request rates over the last
few weeks at a similar time of day.
Rule-Based Models
Nothing says that we have to use machine learning to create models,
even if the title of this book seems to imply that you do.
In fact, it can be quite useful to build some models by hand using a
rule-based system. Rule-based models can be useful whenever there
are hard-and-fast requirements (often of a regulatory nature) that
require exact compliance. For example, if you have a requirement
that a customer can’t be called more often than once every 90 days,
rules can be a good option. On the other hand, rules are typically a
very bad way to detect subtle relationships in the data; thus, most
fraud detection models are built using machine learning. It’s fairly
common to combine these types of systems in order to get some of
the best out of both types. For instance, you can use rules to gener‐
ate features for the machine learning system to use. This can make
learning much easier. You could also use rules to post-process the
output of a machine learning system into specific actions to be
taken.
That said, all of the operational aspects of the rendezvous model
apply just as well if you are using rule-based models or machine
learning models, or even if you are using composites. The core ideas
of injecting external state early, recording inputs using a decoy, com‐
paring to a canary model and using streams to connect components
to a rendezvous server still apply.
Investing in Improvements
Over time, systems that use machine learning heavily can build up
large quantities of hidden technical debt. This debt takes many
forms, including data coupling between models, dead features,
redundant inputs, hidden dependencies, and more. Most important,
this debt is different from the sort of technical debt you find in nor‐
mal software, so the software and ops specialists in a DataOps team
won’t necessarily see it and data scientists, who are typically used to
working in a cloistered and sterilized environment won’t recognize
it either because it is an emergent feature of real-world deployments.
45
A variety of straightforward things can help with this debt. For
instance, you should schedule regular efforts to do leave-one-out
analysis of all input variables in your model. Features that don’t add
performance are candidates for deletion. If you have highly collinear
features, it is important to make time to decide if the variables really
both need to be referenced by the model. Many times, one of the
variables has a plausible causal link with the desired output, while
the other may be a spurious and temporary correlation.
Other Considerations
During development, the raw and input streams can be replicated
into a development environment, as shown in Figure 4-1.
Other Considerations | 47
If the external data available in the production environment is suit‐
able for the new model, it is easiest to replicate just the input stream
to avoid replicating the external data injector. On the other hand, if
the new model requires a change to the external data injector, you
should replicate the raw stream, instead. In either case, a new model,
possibly with new internal state management can be developed
entirely in a development environment but with real data presented
in real time. This ability to faithfully replicate the production envi‐
ronment can dramatically decrease the likelihood of models being
pulled from production due to simple configuration or runtime
errors.
49
in real-time. In fact, with a rendezvous architecture it is probably
easier to deploy a model into a production setting than it is to gather
training data and do offline evaluation.
With a rendezvous architecture, it is also be possible to replicate the
input stream to a development machine without visibly affecting the
production system. That lets you deploy models against real data
with even lower risk.
The value in this is that you can take your new model’s output and
compare that output, request by request, against whatever bench‐
mark system you want to look at. Because of the way that the ren‐
dezvous architecture is built, you know that both models will have
access to exactly the same input variables and exactly the same
external state. If you replicate the input stream to a development
environment, you also can run your new model on the live data and
pass data down to replicas of all downstream models, thus allowing
you to quantify differences in results all the way down the depend‐
ency chain.
The first and biggest question to be answered by these comparisons
is, “What is the business risk of deploying this model?” Very com‐
monly, the new model is just a refinement of an older model still in
production, so it is very likely that almost all of the requests will be
almost identical results. To the extent that the results are identical,
you know that the performance is trivially the same, as well. This
means that you already have a really solid bound on the worst-case
risk of rolling out this new model. If you do a soft roll out and give
the new model only 10 percent of the bandwidth, the worst-case risk
is smaller by a factor of 10.
You can refine that estimate of expected risk/benefit of the new
model by examining a sample of the differing records that is strati‐
fied by the score difference. Sometimes, you can get ground truth
data for these records by simply waiting a short time, but often,
finding the true outcome can take longer than you have. In such
cases, you might need to use a surrogate indicator. For example, if
you are estimating likelihood of response to emails, response rate at
two hours is likely a very accurate surrogate for the true value of
response rate after 30 days.
With the raw scores, the drastic differences in score calibration pre‐
vent any understanding of the correlation of scores below about 2
on the x-axis and obscure the quality of correlation even where
scales are comparable. The q-q diagram on the right, however, shows
that the same two scores are an extremely good match for the top 10
percent of all scores and a very good match for the top 50 percent.
You can combine multiple t-digest sketches easily. This means that
we can store sketches for score distributions for each, say, 10-minute
period for each of the unique combinations of request qualifiers
such as time of day, or source geo-location, or client type. Then, we
can retrieve and combine sketches for any time period and any com‐
bination of qualifiers to get a sketch of just the data we want. This
capability lets us compare distributions for different models, for dif‐
ferent times in terms of raw values, or in terms of quantiles. Further‐
more, we can retrieve the t-digest for any period and condition we
want and then convert a set of scores into quantiles by using that
sketch.
Many of the comparisons described here would be very expensive to
do with large numbers of measurements, but they can be completed
in fractions of a second by storing t-digests in a database.
55
appear to be well-designed microservices with high degrees of isola‐
tion.
There are no silver bullet answers to these problems. Available solu‐
tions are pragmatic and have to do with making basic operations
easier so that you have more time to think about what is really hap‐
pening. The rendezvous architecture is intended to make the day-to-
day operations of deploying and retiring machine learning models
easier and more consistent. Getting rid of mundane logistical prob‐
lems by controlling those processes is a big deal because it can give
you the time you need to think about and understand your systems.
That is a prime goal of the rendezvous architecture.
The architecture is also designed to provide an enormous amount of
information about the inputs to those models and what exactly
those models are doing. Multiple models can be run at the same
time for comparison and cross-checking as well as helping to meet
latency and throughput guarantees.
1. Start any new components. This could be the external state sys‐
tems or the rendezvous server. All new components should start
in a quiescent state and should ignore all stream inputs.
2. Inject a state snapshot token into the input of the system. There
should be one token per partition. As this snapshot token passes
through the system, all external state components will snapshot
their state to a location specified in the token and pass the snap‐
shot token into their output stream.
3. The new versions of the external state maintenance systems will
look for the snapshots to complete and will begin updating their
internal state from the offsets specified in the snapshots. After
the new external state systems are up to date, they will emit
snapshot-complete tokens.
4. When all of the new external state systems have emitted their
snapshot-complete tokens, transition tokens are injected into
the input of the system, again one token per partition.
When this sequence of steps is complete, the system will have upgra‐
ded itself to new versions with no latency bumps or loss of state. You
can keep old versions of the processes running until you are sure
that you don’t need to roll back to the previous version. At that time,
you can stop the old versions.
Monitoring
Monitoring is a key part of running models in production. You need
to be continually looking at your models, both from a data science
perspective and from an operational perspective. In terms of data
science, you need to monitor inputs and outputs looking for any‐
thing that steps outside of established norms for the model. From an
operational perspective, you need be looking at latencies, memory
size, and CPU usage for all containers running the models.
Chapter 7 presents more details on how to do this monitoring.
Monitoring | 63
CHAPTER 7
Meta Analytics
I know who I WAS when I got up this morning, but I think I must have
been changed several times since then.
—Alice in Wonderland, by Lewis Carroll
65
Generally, meta analytics can be divided into data monitoring and
operational monitoring. Data monitoring has to do with looking at
the input data and output results of models to see how things are
going. Operational monitoring typically ignores the content and
looks only at things like latencies and request frequency. Typically,
data monitoring appeals and makes sense to data scientists and data
engineers, whereas operational monitoring makes more sense to
operations specialists and software engineers. Regardless, it is
important that the entire team takes meta analytics seriously and
looks at all the kinds of metrics to get a holistic view of what is
working well and what is not.
Basic Tools
There are a few basic techniques that are very useful for both data
monitoring and operational monitoring for meta analytics, many of
which are surprisingly little known. All of these methods have as a
goal the comparison of an estimate of what should be (we call that
“normal”) with what is happening now. In model meta analytics,
two key tools are an ability to look for changes in event rates and to
estimate the distribution of a value. We can combine these tools in
many ways with surprising results.
Figure 7-1. Detecting shifts in rate is best done using n-th order differ‐
ence in event time. The t-digest can help pick a threshold.
You may have noticed the pattern that you can’t see changes that are
(too) small (too) quickly without paying a price in errors. By setting
your thresholds, you can trade off detecting ghost changes or miss‐
ing real changes, but you can’t fundamentally have everything you
want. This is a kind of Heisenberg principle that you can’t get
around with discrete events.
Similarly, all of the event time methods talked about here require an
estimate of the current rate λ. In some cases, this estimate can be
trivial. For instance, if each model evaluation requires a few data‐
base lookups, the request rate multiplied by an empirically deter‐
mined constant is a great estimator of the rate for database lookups.
In addition, the rate of website purchases should predict the rate of
frauds detected. These trivial cross-checks between inputs and out‐
puts sound silly but are actually very useful as monitoring signals.
Basic Tools | 67
Aside from such trivial rate predictions, more interesting rate pre‐
dictions based on seasonality patterns that extend over days and
weeks can be made by using computing hourly counts and building
a model for the current hour’s count based on counts for previous
hours over the last week or so. Typically, it is easier to use the log of
these counts than the counts themselves, but the principle is basi‐
cally the same. Using a rate predictor of this sort, we can often pre‐
dict the number of requests that should be received by a particular
model within 10 to 20 percent relative error.
From just the data presented here, it is absolutely clear that a change
happened, but it isn’t clear whether the world changed (i.e., Buff
Orpington chickens disappeared) or whether the model was
changed.
This is exactly the sort of event that one-dimensional distribution
testing on output scores can detect and highlight. The distinction
between the two options is something that having a canary model
can help us distinguish.
A good way to highlight changes like this is to use a histogramming
algorithm to bin the scores by score range. The appearance of a
score in a particular bin is an event in time whose rate of occurrence
can be analyzed using the rate detection methods described earlier
in this chapter. If we were to use bins for each step of 0.1 from 0 to 1
in score, we would see nonzero event counts for all of the bins up to
sample 120. From then on, all bins except for the 0.0–0.1 bin would
get zero events.
The bins you choose can sometimes be picked based on your
domain knowledge, but it is often much handier to use bins that are
picked automatically. The t-digest algorithm does exactly this and
Basic Tools | 69
does it in such a way that the structure of the distribution near the
extremes is very well preserved.
K-Means Binning
Taking the issue of the model change in TensorChicken again, we
can see that not only did the distribution of one of the scores
change, the relationship between output scores changed. Figure 7-3
shows this.
Figure 7-3. The scores before the model change (black) were highly cor‐
related but after the model change (red), the correlation changed dra‐
matically. K-means clustering can help detect changes in distribution
like this.
Figure 7-4. The float histogram uses variable width bins. Here, we
have synthetic data in which one percent of the data has high latency
(horizontal axis). A linear scale for frequency (vertical axis) makes it
hard to see the high latency samples.
Basic Tools | 71
These problems can be highlighted by switching to nonlinear axes,
and the nonuniform bins in the FloatHistogram also help.
Figure 7-5 shows the same data with logarithmic vertical axis.
With the logarithmic axis, the small cluster of slow results becomes
very obvious and the prevalence of the problem is easy to estimate.
Event rate detectors on the bins of the FloatHistogram could also be
used to detect this change automatically, as opposed to visualizing
the difference with a graph.
Latency Traces
The latency distributions shown in the previous figures don’t pro‐
vide the specific timing information that we might need to debug
certain issues. For instance, is the result selection policy in the ren‐
dezvous actually working the way that we think it is?
To answer this kind of question, we need to use a trace-based met‐
rics system in the style of Google’s Dapper (Zipkin and HTrace are
open source replications of this library). The idea here is that the
overall life cycle of a single request is broken down to show exactly
what is happening in different phases to get something roughly like
what is shown in Figure 7-6.
1 This is a good example of a “feature” that can cause serious latency surprises: “Docker
operations slowing down on AWS (this time it’s not DNS)”.
And this is an example of contention that is incredibly hard to see, except through
latency: “Container isolation gone wrong”.
Combining Alerts
When you have a large number of monitors running against com‐
plex systems, the danger changes from the risk of not noticing prob‐
lems because there are no measurements available to a risk of not
noticing problems because there are too many measurements and
too many false alarms competing for your attention.
In highly distributed systems, you also have a substantial freedom to
ignore certain classes of problems since systems like the rendezvous
architecture or production grade compute platforms can do a sub‐
stantial amount of self-repair.
As such, it is useful to approach alerting with a troubleshooting time
budget in mind. This is the amount of time that you plan to spend
New Frontier
Machine learning, at scale, in practical business settings, is a new
frontier, and it requires some rethinking of previously accepted
methods of development, social structures, and frameworks. The
emergence of the concept of DataOps—adding data science and data
engineering skills to a DevOps approach—shows how team struc‐
ture and communication change to meet the new frontier life. The
rendezvous architecture is an example of the technical frameworks
that are emerging to make it easier to manage machine learning
logistics.
77
The old lessons and methods are still good, but they need to be
updated to deal with the differences between effective machine
learning applications and previous kinds of applications.
We have described a new approach that makes it easier to develop
and deploy models, offers better model evaluation, and improves
ability to respond.
These resources will help you plan, develop, and manage machine
learning systems as well as explore the broader topic of stream-first
design to build a global data fabric:
79
Selected O’Reilly Publications by Ted Dunning
and Ellen Friedman
• Data Where You Want It: Geo-Distribution of Big Data and Ana‐
lytics (March 2017)
• Streaming Architecture: New Designs Using Apache Kafka and
MapR Streams (March 2016)
• Sharing Big Data Safely: Managing Data Security (September
2015)
• Real World Hadoop (January 2015)
• Time Series Databases: New Ways to Store and Access Data
(October 2014)
• Practical Machine Learning: A New Look at Anomaly Detection
(June 2014)
• Practical Machine Learning: Innovations in Recommendation
(January 2014)