Statistical Distances For Machine Learning
Statistical Distances For Machine Learning
Machine Learning
Observability
Introduction
Statistical Distances are used to quantify the distance between two
distributions and are extremely useful in ML observability. This blog post
will go into statistical distance measures and how they are used to detect
common machine learning model failure modes.
Table of contents
Why Statistical Data Checks? X
XX
Conclusion
Why Statistical Distance Checks?
Data problems in Machine Learning can come in a wide variety that range
from sudden data pipeline failures to long-term drift in feature inputs.
Statistical distance measures give teams an indication of changes in the data
affecting a model and insights for troubleshooting. In the real world post
model-deployment, these data distribution changes can come in a myriad of
different ways and cause model performance issues.
Here are some real-world data issues that we’ve seen in practice.
These are examples of data issues that can be caught using statistical
distance checks.
In the image below, you can see that the examples of the fixed reference
distribution include a snapshot of training distribution, initial
model deployment distribution (or a time when the
distribution was thought to be stable), and validation/
test set distribution. An example of a statistical
distance using a fixed reference distribution is to
compare the model’s prediction distribution
made from a training environment
(distribution A) to the model’s prediction
distribution from a production
environment (distribution B).
The options for what you set as the reference distribution will depend on
what you are trying to catch and we will dive into common statistical distance
checks to set up for models.
Identifying if there has been a distribution change in the feature can give
early indications of model performance regressions or if that feature can be
dropped if it’s not impacting the model performance. It can lead to model
retraining if there are significant impacts to the model performance. While
a feature distribution change should be investigated, it does not always
mean that there will be a correlated performance issue. If the feature was
less important to the model and didn’t have much impact on the model
predictions, then the feature distribution change might be more
of an indication it can be dropped.
The goal of output drift is to detect large changes in the way the model
is working relative to training. While these are extremely important to
ensure that models are acting within the boundaries previously tested and
approved, this does not guarantee that there is a performance issue. Similar
to how a feature distribution change does not necessarily mean there is a
performance issue, prediction distribution changes doesn’t guarantee there
is a performance issue. A common example is if a model is deployed to a new
market, there can be distribution changes in some model inputs and also the
model output.
MODEL ACTUALS
6. Actuals Distribution at Training vs Actuals Distribution in Production
Actuals data might not always be within a short-term horizon after the model
inferences have been made. However, statistical distance checks on actual
distributions help identify if the structure learned from the training data is
no longer valid. A prime example of this is the Covid-19 pandemic causing
everything from traffic, shopping, demand, etc patterns to be vastly different
today from what the models in production had learned before the pandemic
began. Apart from just large-scale shifts, knowing if the actuals distribution
between training data vs production for certain cohorts can identify if there
PSI
The PSI metric has many real-world applications in the finance industry.
It is a great metric for both numeric and categorical features where the
distributions are fairly stable.
Equation:
PSI = ∑(Pa - Pb) • ln(Pa/Pb)
PSI is an ideal distribution check to detect changes in the distributions
that might make a feature less valid as an input to the model. It is used
often in finance to monitor input variables into models. It has some well-
known thresholds and useful properties.
The PSI is symmetric -- that is, if you reverse the distributions, the PSI value
is the same. In the above example we have switched the purple graph with
the yellow graph of previous examples, the value of 0.98 is the same as the
previous reversed distribution.
The example below uses the Population Stability Index (PSI) on an important
feature. The check is run periodically, trading off how quickly you want to be
alerted on change, versus the type of change you are trying to detect.
When the check falls below a well-defined threshold, the change needs to be
investigated and could indicate a model performance issue.
FPO - graphic
When the check falls below a well-defined threshold, the change needs to be
investigated and could indicate a model performance issue.
9/7 9/9 9/11 9/13 9/15 9/17 9/19 9/21 9/23 9/25 9/27 9/29 10/1
The following shows a live feature where the stability index is far below the
0.15 limit that was set (0.1-0.25 finance industry standard range). On setup,
we recommend looking at a multi-day window of statistics for setting the
detection range.
9/7 9/9 9/11 9/13 9/15 9/17 9/19 9/21 9/23 9/25 9/27 9/29 10/1 10/3 10/5
Equation:
KLdiv = Ea[ln(Pa/Pb)] = ∑ (Pa)ln(Pa/Pb)
JS Divergence has some useful properties. Firstly, it’s always finite, so there
are no divide-by-zero issues. Divide by zero issues come about when one
distribution has values in regions the other does not. Secondly, unlike KL-
Divergence, it is symmetric.
The moving window changes each period for every distribution check. It
represents a sample of the current periods distribution. The JS Distribution
has a unique issue with a moving window, in that the mixture will change with
each window of time you are comparing. This causes the meaning of the value
returned by JS Divergence to shift on a periodic basis, making comparing
different time frames on a different basis, which is not what you want.
The PSI and JS are both symmetric and have potential to be used for metric
monitoring. There are adjustments to PSI that we recommend versus JS as a
distance measure for moving windows used for alerts.
EMDi+1=(Ai+EMDi) − Bi
The Earth Mover’s Distance measures the distance between two probability
distributions over a given region. The Earth Movers Distance is useful for
statistics on non-overlapping numerical distribution moves and higher
dimensional spaces (images, for example)
Image by Author. Distribution Changes May Not Always Mean Performance Issue
The changes in a distribution may or may not cause large downstream issues.
The point is that no change should be looked at in a vacuum, or investigated
just because something changed. The changes should be filtered against other
system performance metrics to investigate the ones that matter.
FPO - graphic
Contact Us
If this blog post caught your attention and you’re eager to learn more, follow
us on Twitter and Medium! If you’d like to hear more about what we’re doing
at Arize AI, reach out to us at [email protected]. If you’re interested in
joining a fun, rockstar engineering crew to help make models successful in
production, reach out to us at [email protected]!
Types of Bins:
CATEGORICAL:
The binning of categorical variables occurs by binning on the value itself,
based on inputs before 1-hot encoding. The text string represents the bin.
Depending on how the system handles capitalization, a capitalized word
might or might not be binned separately based on how the feature pipeline
handles capitalization.
The binning of a numeric feature is not absolutely required to get a metric but
it’s very helpful for visualization and debugging.
As a numerical input to the model changes it will move between bins, for
example moving from bin 1.0-4.0 (decreases) to bin 4.0-8.0 (increases). As
you evaluate the change you can slice performance metrics (Accuracy,
RMSE,etc) by those bins to see if the model itself has any issues with the new
distribution.
FIXED DISTANCE:
The fixed distance is easy to set up and easy to analyze. It works best for
data that doesn’t have a lot of variation in a small area relative to the entire
distribution. Data that is more evenly distributed over a range.
QUINTILES
The quintiles can be used for data that is not evenly distributed. The quintiles
are taken from a single distribution, say reference, and then used to define the
knots that all distributions use. It helps ensure each bin region has a similar
amount of data. The differing points or knots between distributions can make
visual comparisons harder for regions with fewer samples on secondary
distributions.
CUSTOM:
If you know your data well and have common breakpoints for data OR you
want to capture movements between well defined regions/bins you can break
up your data with custom breakpoints.
Another unique challenge with moving windows is that you want to define
bins that don’t change, so you truly have a reference, but you need to do it
based on the distributions you have initially. New distributions that show up in
the future need to be covered by the previous bins you had created.
The example below might be the purpose of the loan for a fictitious business.
One can imagine post-covid loans to cover Weddings might jump a large
percentage versus previous periods. This example shows how that change
would be caught by PSI with a setting of 0.25.
The above changes generate a PSI over 0.25 which, in financial industries,
would require a model rebuild:
• Delta change of one bin in this case is 13% versus an initial bin value of 4%
• There are a three large movements (greater than 5%) between bins of this
distribution
The example above with a smaller change, where the maximum delta is 8%
and a PSI value above 0.1 but below 0.25. This example would fall into the
finance industries range requiring an investigation of the model, but not
above the 0.25 requiring a new model or retrain.