BigML WhizzML Tutorials
BigML WhizzML Tutorials
Version 1.18
BigML and the BigML logo are trademarks or registered trademarks of BigML, Inc. in the United States
of America, the European Union, and other countries.
BigML Products are protected by US Patent No. 11,586,953 B2; 11,328,220 B2; 9,576,246 B2; 9,558,036
B1; 9,501,540 B2; 9,269,054 B1; 9,098,326 B1, NZ Patent No. 625855, and other patent-pending appli-
cations.
I Beginner 2
1 Model or Ensemble? 3
1.1 Creating the Training and Testing Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Creating Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Evaluating Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 The Whole Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Extension to Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Summing Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Dataset Transform 6
2.1 Filtered-Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Excluded-Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Present-Percent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Missing-Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
II Intermediate 11
3 Covariate Shift 12
3.1 Phi-Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Comb-Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Ids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.4 Eval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.5 Avg-Phi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Comb-Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Split-Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Sample-Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Model-Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6 Avg-Phi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6.1 All Together: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.7 Multi-Phis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Best-K 19
4.1 Code Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
III Advanced 26
5 Gradient Boosting 27
5.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
iii
iv CONTENTS
As you have read in the WhizzML Primer document, 1 WhizzML is a powerful tool for automating
Machine Learning (ML) workflows and implementing higher-level ML algorithms.
This document contains tutorials for writing these WhizzML workflows and algorithms. The tutorials
are organized into three categories: Part I, Part II and Part III.
In the beginner tutorials, we go over each piece of the code in detail, explaining WhizzML language
concepts and its standard library. As the tutorials get more advanced, we leave these details behind to
focus on higher level concepts.
1 https://bigml.com/whizzml
1
Part I
Beginner
We’ll start with a few basic tutorials that demonstrate just a few of the most
basic features of WhizzML. It might be handy to have the WhizzML Reference
nearby as you go through these examples.
2
CHAPTER 1
Model or Ensemble?
In this tutorial, we’ll do a simple test to determine whether a single tree or an ensemble is a better fit for
a given dataset. We’ll go over every line of the code here, but you can also grab it from our WhizzML
examples repository 1 on Github.
The idea here is simple: We can use WhizzML to build both single trees and ensembles of trees as
predictors. BigML can also conduct evaluations of predictors on datasets. So our script will:
• Split our datasets into two parts, training and testing
• Train a single tree on the training data
• Train an ensemble of trees on the training data
• Evaluate both on the testing data
• Compare the evaluations to decide which is better
As you can see, the sample-dataset function takes three parameters, ds-id, which is the identifier for
the input dataset, rate which is the amount of the data that we want to keep, and oob, a Boolean value
that says whether to use the sample specified, or its complement. That is, if rate is 0.75 and oob is false,
we’ll get 75% of the data in our sample. If oob is true, we’ll get the other 25%.
This is important to our split as we want our training and testing data to be mutually exclusive (so
we can’t “cheat” by training on testing points before the test). In fact, you can see this at work in
split-dataset. Here we return a list of two datasets created by sample-dataset, where the only
1 https://github.com/whizzml/examples/tree/master/model-or-ensemble
3
4 Chapter 1. Model Or Ensemble?
parameter we’ve varied is oob. This will give us a two mutually exclusive datasets, one of which we’ll use
to train, and the other to test.
(create-model ds-id)
For the ensemble, we have the corresponding create-ensemble primitive, although in this case we will
want to also provide a configuration parameter: the number of models in the forest, size:
In both cases, the create procedure will first check that the dataset that is used has reached its finished
state (i.e., has status code 5), waiting if necessary for it, and then request the creation of the corresponding
model or ensemble. The result of the call will be the identifier of the new resource.
These calls will return the corresponding evaluation identifiers. We will want to, first, wait until they are
finished, and, once they are completed, fetch them and extract from the full resource map the quantity
that we are going to use to assess the quality of our model or ensemble, namely, the average F-measure.
These two steps are encapsulated in the helper function quality-measure, that we define as follows:
As you can see, we first call wait with the evaluation’s identifier to make sure the evaluation is finished
before fetching it. wait will eventually return that identifier again if everything is fine, and then we use
that identifier again as the argument of fetch, that recovers the full map of the evaluation. All that is
left is to access the nested field “average_f_measure” within that map:
{"result" {"model" {"average_f_measure" 0.5
"average_phi" 0.6
...}
...}
...}
As you can see, we use the evaluation map itself, ev, as a function to perform the lookup, passing to it
the path to the desired key as a list, ["result" "model" "average_f_measure"].
The final function looks basically like our itemized list at the top of this tutorial. We first create a dataset
using the input source identifier. We then split this dataset using the function split-dataset above
that gives us a list of two sub-datasets. We can use the returned list as a procedure to pull out the first
and second elements of this list into train-id and test-id, respectively. We then make our single tree
and our ensemble and evaluate them both, pulling the value for the F-measure out of the evaluation.
Depending on which value for the F-measure is better, we return the identifier of the better thing, and
we’re done!
The implementation of this function also shows a simple example of error signaling using raise.
1.6 Summing Up
We have seen an example of how we can use WhizzML to do simple model selection with some basic code
and no external libraries. We have written our first simple functions to abstract away implementation
details, such as quality-measure, seen how to create and fetch BigML resources and access its properties,
and simple error handling.
We’ll start with a script that will remove a field from a dataset if it has “too much” missing data. As
those of you who have dealt with production data know, sometimes there are fields which are missing a
lot of data. So much so, that we want to ignore the field altogether.
BigML will automatically detect bad fields like this and ignore them automatically if we create a predictive
model. But what if we want to specify the required “completeness” of the data field? How can we eliminate
fields that have, say, 95% non-missing data?
We can use WhizzML!
Let’s Do it! Use the WhizzML reference guide 1 if you need it along the way.
To reiterate our goal, we want to write a function that:
1. Given a dataset and a specified threshold (e.g. 0.95)
2. Returns a new dataset with only the fields that are more than 95% populated.
We can define the base function here.
2.1 Filtered-Dataset
What do we want this function to do? We want it to return a new dataset, hence:
(create-and-wait-dataset ...)
But we don’t just want any old dataset, we want one based off our old dataset:
1 https://bigml.com/whizzml
6
Chapter 2. Dataset Transform 7
"origin_dataset" dataset-id
And we also want to exclude some fields from our old dataset!
"excluded_fields" (...)
Ah, but which fields do we want to exclude? We can let a new function called excluded-fields figure
that out for us.
But for now, all we need to know is that this new function (excluded-fields) takes two arguments: our
old dataset and our specified threshold.
The line above becomes: (indentation removed for clarity)
As we progress, keep in mind that we want this new function (excluded-fields) to return a list of field
names (e.g. ["field_1" "field_2" "field_3"])
Great! We defined our base function. Now we have to tell our new function, excluded-fields how to
give us the list that we want.
2.2 Excluded-Fields
Wow what? You can use that code for reference, but don’t be intimidated. We’ll go over each piece.
First we define the function, declaring its two arguments: our original dataset, and the threshold we want
to use.
Before we write any more code, let’s talk about the meat of this function. We want to look at all the
columns (fields) of this dataset, and find the ones that are missing too much data. We’ll keep the names
of these “bad” fields so that we can exclude them from our new dataset.
To do this, we can use the function filter. It takes two arguments: a list and a predicate. filter will
return a new list composoed of the elements in the original list that satisfy the predicate. In our case,
the predicate is that the field has to have at least 95% of the data.
The first argument given to filter is our predicate. The predicate should be a function that either
evaluates to true or false based on each element of the list we pass to it. If the predicate returns true,
then that element of the list is kept. Otherwise, it is thrown out.
We can define this predicate function using lambda.
lambda is like any other function definition. We have to tell it the name of the thing we are passing into
it
(field-name)
In our case, we are checking to see if the threshold is greater than the amount of data present. We will
keep the field-name(s) that do not have enough data. (Because remember, these are the fields that will
be excluded from our new dataset!)
Cool! But two things are still missing from our filter.
1. all-field-names
2. <percent of data that the field has>
How do we get these?
The first isn’t too difficult because BigML datasets have this information readily available. We just have
to fetch it from BigML first.
(fetch dataset-id)
Nice.
To figure out what percent of the rows are populated for a specific field, we get to. . . Define a new
function!
But before we do that, let’s talk about some things we skipped over in our excluded-fields function.
Here it is again, for convenience.
What is let?
let is the preferred method for declaring local variables in WhizzML.
• We set the value of data to the result of (fetch dataset-id).
• We set the value of all-field-names to the result of (get data "input_fields").
• We set the value of total-rows to the result of (get data "rows"). (We didn’t talk about this
yet. It’s one of the values we need to pass to the present-percent function)
let is useful for a couple of reasons in this function. First, we use data twice. So we can avoid the
repetition of writing (fetch dataset-id) twice. Second, naming these variables at the top of the
function makes the rest much easier to read!
So to wrap up this excluded-fields function, let’s talk through what it does again. First, it declares
local variables that we’ll need. Then, it takes the list of all-field-names and filters it based on a
function that checks its “present percent” of data points. We keep the names of the fields that do not
meet our criteria. Cool!
Now, we’ll go over that present-percent function
2.3 Present-Percent
Then, we divide the missing-count from the field by the total-rows. This gives us a “missing percent”
We subtract the “missing percent” from one and that gives us the “present percent”!
2.4 Missing-Count
missing-count takes two arguments. First, the name of the field we are inspecting (field-name) and
second, the fields object we mentioned earlier. It holds a bunch of information about each of the dataset
fields.
To get the count of missing rows of data from the field, we do this:
It lets us access an inner value (i.e. 10) from a data object structured like so:
fields = {field-name:
{"summary":
{"missing_count": 10
"tags": ["cool" "fun"]}}
something-else: ...}
And. . . That’s it! We have now written all the pieces to make our filtered-dataset function work!
All together, the code should look like this:
Intermediate
Now that we’ve mastered the basics, let’s take a look at some more involved
workflows. Here, we’ll be dealing with multiple resource types and using many
of the standard library functions to accomplish our task.
11
CHAPTER 3
Covariate Shift
If this is your first time writing WhizzML, we suggest you start with one of the examples in Part I, since
we won’t explain all the details in this walkthrough.
In this post, we’ll write a WhizzML script that automates a process to investigate Covariate Shift. To
get an understanding of what we’re trying to do, read this article first. 1
Again, use the WhizzML reference guide 2 if you need help along the way.
Our goal is to write a function that:
• Given two datasets (one that represents the data used to train a predictive model, one that repre-
sents production data)
• Returns an indication of whether the distribution of data has changed.
As we read in the article, the indicator of change in our data distribution is called the phi coefficient.
Our WhizzML script will return us this number, so lets name our base function phi-coefficient.
3.1 Phi-Coefficient
1 http://blog.BigML.com/2014/01/03/simple-machine-learning-to-detect-covariate-shift/
2 https://bigml.com/whizzml
12
Chapter 3. Covariate Shift 13
3.1.1 Comb-Data
comb-data is the result of (combined-data training-dataset production-dataset)
Here we’re combining the two datasets into one big dataset. But before they are combined, we have to
do a transformation on each dataset (add the “Origin” field). We’ll talk about that transformation when
we define combined-data.
The result of our comb-data dataset looks something like this:
| field_1 | field_2 | ... | "Origin" |
123 124 ... "Training"
123 124 ... "Production"
123 124 ... "Production"
123 124 ... "Training"
... ... ... ...
3.1.2 Ids
Next, we have a variable called ids. This is a list of dataset-ids that is the result of
What our split-dataset function does is takes the comb-data (one big dataset) and randomly splits it
into two datasets. We split it so that we can train a predictive model with the larger portion of the split,
and then evaluate its performance on the smaller part.
The split-dataset function returns something like this
["dataset/83bf92b0b38gbgb" "dataset/83hf93gf012bg84b20"].
3.1.3 Model
model is a BigML predictive model resource. We are creating this model from the first element of our
ids list: "dataset" (head ids). The model is built to predict whether the value for the “Origin” field
is “Training” or “Production”. Thus, the “objective_field” is “Origin”. "objective_field" "Origin".
3.1.4 Eval
eval is a BigML evaluation resource. To create an evaluation, we need two arguments: a predictive
model and a dataset we want to test the model against. Our model is stored in model and our dataset
is the second element in the ids list, that is, the element in position 1: (ids 1)
3.1.5 Avg-Phi
We’re done with the local variables, but what does the whole phi-coefficient function return - what’s
our end product?
(avg-phi eval)
That line gives us the average phi score for the evaluation we just created. A bunch of information is
stored inside the eval data object that will be retrieved from BigML. But of course, we have to tell the
function avg-phi how to get what we want! We’ll save that for later.
...
So we have built our base function and understand its components. Now we have to go back and build
the functions we haven’t defined yet, specifically comb-data, split-dataset, model-evaluation and
avg-phi. We’ll start with comb-data.
3.2 Comb-Data
Again, this function combines two datasets. We tell BigML what datasets we want to combine using the
origin_datasets parameter and passing it a list of dataset ids.
But what are train-data and prod-data?
Those are helper functions that add the “Origin” field we talked about.
• train-data adds the “Origin” field with the value “Training” in each row
• prod-data adds the “Origin” field with the value “Production” in each row
They are defined here:
Since we are doing pretty similar things in both functions, (adding an “Origin” field) we can separate
that logic into its own function. Here it is:
3.3 Split-Dataset
• How are we splitting it - 80%/20%? 90%/10%? We can do whatever we want. This is determined
by rate.
• How are we going to shuffle our data before we split it? The seed determines this.
As you can see, we are sampling the same dataset twice. One sample will be used to build a predictive
model, the other will be used to evaluate the predictive model.
sample-dataset is another function. Here it is below:
3.4 Sample-Dataset
This function is what actually interacts with BigML. We create a new dataset, passing in the rate, the
original dataset (dataset-id), whether it is out_of_bag or not (we’ll go over this) and the seed used to
determine how the original dataset was shuffled.
Here’s a little diagram that will help explain how the seed and out_of_bag (oob) work.
So if out_of_bag is set to true, we grab the rows labeled “oob”. Otherwise, we grab the ones marked “x”
The seed just changes which rows we label “oob” and “x”. The seed also enables this whole process to be
deterministic. So if you run the phi-coefficient function with the same seed (and the same datasets),
you’ll get the same results!
Cool. That wraps up our sample-dataset and split-dataset functions. Next up, model-evaluation.
3.5 Model-Evaluation
We apologize if you were hoping for something more exciting. This function is just a wrapper for the
method included with /whizzml, create-and-wait-evaluation. As you can see, we are simply creating
an evaluation with a model and a dataset.
Our last function is. . .
3.6 Avg-Phi
...
And there we have it. A WhizzML script that helps predict covariate shift.
But. . .
As we read in the article, it is best to do this process several times and look at the average of the results.
How could we add some more code to to do this programmatically?
Here’s one implementation.
3.7 Multi-Phis
Again, we are giving this function our training-dataset and production-dataset. But we are also
passing in n, which is the number of phi-coefficients we want to calculate.
As you can see, we are defining a loop.
Within this loop, we set some variables.
• seeds, we give the default (starting) value of (range 0 n)
– If we pass in 4 for the value of n then the initial value of seeds = [0 1 2 3]
• out is our output. We will add the result of a phi-coefficient run each time through the loop.
– Initially, out = []
We also define the end-scenario.
• If seeds is empty, then we return a map with the values list and average. (we’ll explain these in
a bit)
• If seeds is not empty, we go back to the loop, but define values for seeds and out.
• seeds = (tail seeds). This grabs everything but the first element of seeds
– So the first time through, it might be [0 1 2 3], then it will be [1 2 3], then [2 3]. . .
• out = (append out (phi-coefficient ...)) We take the result of our phi-coefficient func-
tion and add it to the out list.
– First time through, it’s [], then [-0.0838], then [-0.0838, 0.1240] . . .
The seed we will use for each of these phi-coefficient runs will be "test-0", "test-1", "test-2" etc.
Thats what (str "test-" (head seeds)) is doing - joining the string "test-" with the first element
of the seeds list.
The last thing we should discuss is the end-case return value:
{"list" out
"average" (/ (reduce + 0 out) (count out))}
The value of “list” (out) is just the list of phi-coefficient values from each run. The “average” is. . . Yep.
The average of all the runs. reduce adds up the elements. count counts the number of elements. /
divides the first by the second.
You got it! The average of many phi-coefficients between two datasets, to help predict covariate shift.
Example run 3 :
Cool
Take a second and think about what you can accomplish now in a few clicks with this WhizzML Script.
1. Make a bunch of predictive models
2. Evaluate their performances
3. Get the knowledge of whether your data characteristics have changed
Sweet!
And even more powerful. . . The knowledge to automate your own processes!
3 Since we used the same dataset for the Training and Production data, it guesses the wrong value for the “Origin” nearly
every time! That’s why the phi-coefficient values are all close to -1.
BigML has an implementation of G-means clustering built into the cluster API. 1 G-means clustering is
an enhancement to K-means clustering that seeks to find the optimal value of K under the assumption
that the neighborhood of points around the centroid of a cluster should have a Gaussian distribution in
a certain sense.2 In our API, G-means clustering is invoked by omitting the number K of centroids for
K-means clustering, e.g:
Assuming a priori that clusters in a dataset conform to the assumptions of the G-means algorithm may
be inappropriate. In addition, the G-means method for determining K incrementally fissions clusters and
adjusts the resulting clusters appropriately, so that in many cases not every value of K is considered.
This proclivity of the algorithm to split a cluster into two clusters is determined by the critical_value
parameter, but frequently there aren’t principled ways to select its value.
Fortunately, using WhizzML we can implement an alternative to G-means for determining K based on the
Pham-Dimov-Nguyen algorithm..3 Pham, Dimov, and Nguyen define a measure of concentration f (K) on
a K-means clustering and use that as evaluation function to determine the best K. This tutorial presents
an example implementation of the PDN-based algorithm that finds the best number K of centroids for
K-means clustering in an arbitrary range of Kmin to Kmax .
The full code for this tutorial is available in our whizzml examples repository 4 on Github.
1 https://bigml.com/developers/clusters
2 https://blog.bigml.com/2015/02/24/divining-the-k-in-k-means-clustering/
3 http://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf
4 https://github.com/whizzml/examples/tree/master/best-k
19
20 Chapter 4. Best-K
Inputs:
• dataset: (string) Dataset identifier of the dataset to be clustered
• args: (map) Arguments for the cluster operation
• k-min: (number) Minimum value of K
• k-max: (number) Maximum value of K
Output: (list) Cluster metadata for created clusters
We begin with a function generate-clusters to create a collection of K-means clusterings for an arbi-
trary range k-min ≤ K ≤ k-max of centroids. The input map args for the WhizzML cluster function is
expanded to fargs with the input dataset, the number K of centroids, and the name for each cluster
instance. After setting up the list of arguments for generating each cluster in the collection, the routine
uses the common idiom create*, wait* to generate the returned set of candidate BigML Cluster objects
in parallel.5
Inputs:
• cluster: (string) Cluster identifier of the cluster
Output: (map) Cluster metadata used to compute the evaluation function f (K)
To simplify the implementation of the PDN algorithm, we next define a helper function to extract certain
metadata items from the full metadata for a cluster. The metadata returned by extract-eval-data are
just the items needed to compute the PDN evaluation function f (K). These include the number K of
centroids in the K-means clustering, the number of covariates n considered in the K-means computation,
the total sum-squared distance between the items in the cluster and the cluster centroid for all clusters
Sk (within_ss), and finally S1 available in the metadata for every cluster (total_ss).
We could have included this helper function in the generate-clusters function. We chose here to define
it separately for illustrative purposes. Users may find other application-specific alternatives.
(define (alpha-func n)
(let (alpha_2 (- 1 (/ 3 (* 4 n)))
w (/ 5 6))
(lambda (k)
(if (<= k 2)
alpha_2
(+ (* (pow w (- k 2)) alpha_2) (- 1 (pow w (- k 2))))))))
Inputs:
• n: (number) Number of covariates
Output: (function) Weighting function α(K)
5 The reader might have observed this routine illustrates how WhizzML is not a pure functional language. The primary
intent of WhizzML is to orchestrate BigML operations through side effects such as, in this case, creating BigML clusters
from a BigML dataset.
As discussed in more detail below, the Pham-Dimov-Nguyen algorithm is based on an evaluation function
f (K) that includes a weighting α(K) function parameterized on the number of covariates n. The weighting
function in the Pham-Dimov-Nguyen paper is in recursive form. This factory function returns the closed-
form equivalent:
1 − 3/4n k=2
α(K) =
k−2 k−2
(5/6) α(2) + 1 − (5/6) k>2
In the same manner as we do next with evaluation-func, we can implement α(K) as the partial
application of a function α(K, n) for n. We realize this partial function application as a factory function
apply-func. This factory function defines a closure that in turn includes n and returns an anonymous
function lambda(k) as α(K).
(define (evaluation-func n)
(let (fa (alpha-func n))
(lambda (k sk skm)
(if (or (<= k 1) (not skm) (zero? skm))
1
(/ sk (* (fa k) skm))))))
Inputs:
• n: (number) Number of covariates
Output: (function) Weighting function α(K)
The Pham-Dimov-Nguyen approach to finding the best K has at its core an evaluation function f (K). The
version in the Pham-Dimov-Nguyen paper is a function of a single argument K that internally includes
SK and SK−1 (the within_ss field of the cluster metadata map returned by extract-eval-data). The
form returned by this factory function has the SK and SK−1 values as arguments.
1 k = 1 or SK−1 = 0 or SK−1 undefined
f (K, SK , SK−1 ) =
SK /[α(K)SK−1 ] otherwise
Following the conventions of functional programming languages we can implement f (K, SK , SK−1 ) as the
partial application of a function f (K, SK , SK−1 , n) for n.
This factory function evaluation-func defines a closure that in turn includes n and returns an anony-
mous function lambda(k sk skm) as f (K, SK , SK−1 ). Note also that the weighting function α(K) in
the PDN evaluation f (K) could have been a subfunction inside this evaluation-func. As discussed
previously, for illustrative purposes we instead have implemented it as the returned result of a separate
factory function alpha-func.
Inputs:
• clusters: (list) Cluster metadata maps ordered by K
Output: (list) Sequence of maps that have the field fk with the value f (K, SK , SK−1 ) added
Having defined a number of component functions, we pull them together in evaluate-clusters. This
function applies the Pham-Dimov-Nguyen evaluation function f (K) to a list that ranges over K of the
K-means clusters for a dataset. The result is a list over K of items that include the value of f (K).
In more detail, evaluate-clusters applies the evaluation function f (K, SK , SK−1 ) returned by evaluation-func
as fe to the cluster metadata returned by extract-eval-data as cmdata. The body of the function is
a loop that iterates over the in list of metadata maps returned by extract-eval-data and sequentially
builds the out list of metadata maps. Each map in the out list is the source map in the list list aug-
mented with the value f (K, SK , SK−1 ) as fk of the evaluation function fe applied to the data values K
and SK (within_ss) from the input map. In addition, the input cluster metadata map ck to an iteration
of the loop is passed back into the next iteration as the source ckz for the value of SK−1 .
Inputs:
• evaluations: (list) Sequence of maps of evaulation results for the clusters
• cluster-id: (string) Cluster to save (not delete)
• logf: (boolean) Flag to enable logging
Output: (string) Returns the cluster-id supplied as an input.
Before we define the final, top-level functions we define two more helper functions. The first is this
straightforward helper function clean-clusters that deletes the BigML cluster objects created as inter-
mediate computation results, except for the cluster specified by cluster-id.
Inputs:
• dataset: (string) Identifier of dataset to be processed with the cluster operation.
• cluster-arg: (map) Arguments for cluster function
• k: (number) Number of centroids for cluster operation
Output: (string) Returns the cluster-id of the created cluster.
The second helper function best-cluster is also required by some of our top-level functions. It performs
a single K-means clustering with the specified number K of centroids. The input map args for the
WhizzML cluster function is expanded to ckargs with the identifier of the input dataset, the number
K of centroids k, and the name for the cluster instance.
Inputs:
• dataset: (string) identifier of dataset to be processed with the cluster operation.
• args: (map) Arguments for cluster function
• k-min: (number) Minimum value of K
• k-max: (number) Maximum value of K
• clean: (boolean) Flag to delete all but the optimal cluster
• logf: (boolean) Flag to enable logging
Output: (list) Pham-Dimov-Nguyen evaluations of the clusters.
This function combines the generate-clusters and evaluate-clusters functions to create a list of
K-means clusters with a range of centroids and a list of metadata maps that include the Pham-Dimov-
Nguyen evaluation function f (K) values for those clusters.
The args argument is a map that one can use to optionally specify all of the parameters for the K-means
cluster function except the dataset, the number of centroids k, and the cluster name parameters. (See the
the BigML developer documentation 6 for details.) The dataset, args, and k-min and k-max, are passed
directly to the generate-clusters function. The list of cluster metadata for the clusters it creates is
passed directly to the evaluate-clusters function.
This function can be used as top-level function to just return the list of Pham-Dimov-Nguyen evaluation
functions f (K) result over the specified number of centroids k-min to k-max (inclusive). When called
as top-level function, the clean parameter can be specified as true to automatically delete the BigML
cluster objects created after the PDN evaluations are computed. Setting the boolean parameter logf to
true causes generation of log data.
Inputs:
• dataset: (string) identifier of dataset to be processed with the cluster operation.
• cluster-arg: (map) Arguments for cluster function
• k-min: (number) Minimum value of K
6 https://bigml.com/developers/clusters#cl_cluster_arguments
Inputs:
• dataset: (string) Identifier of dataset to be processed with the cluster operation.
• cluster-arg: (map) Arguments for cluster function
• k-min: (number) Minimum value of K
• k-max: (number) Maximum value of K
• clean: (boolean) Flag to delete all but the optimal cluster
• logf: (boolean) Flag to enable logging
Output: (string) Identifier of created batchcentroid
This final top-level routine first uses the best-k-means function to generate a best K-means clustering
determined by the Pham-Dimov-Nguyen evaluation function f (K). It then creates a BigML batchcentroid
object and BigML dataset annotated with clusters numbers in the best K-means clustering of the supplied
dataset.
4.2 Examples
We conclude with a few brief examples of how to use the top-level functions and some additional appli-
cation suggestions.
7 https://bigml.com/developers/clusters#cl_cluster_arguments
Using the Top-Level Functions The main top-level function can be called to generate the best
BigML batchcentroid and an annotated BigML dataset for an input dataset with the identifier ds-id
as:
If only the best K-means BigML cluster object is required, perhaps for evaluating different ranges of
k-min to k-max, one can use the top-level function:
Finally, if one only seeks the list of Pham-Dimov-Nguyen evaluation function f (K) results for the K-means
clusterings for k-min to k-max, one can use the top-level function:
Specifying clean as true causes these top-level functions to delete the intermediate BigML datasets
created during the computation. Other information generated during execution of function can be logged
by specifying logf as true.
Using args and best-args The input args can be used to specify a custom map of arguments for
the BigML cluster function that generates the collection of clusters for k-min to k-max in the evaluation
phase. Similarly, the input best-args in the best-k-means and best-batchcentroid functions can be
used to specify a custom map of arguments for the K-means cluster operation with the best number K
of centroids.
As an example, suppose we have a dataset that has two fields depvar1 and depvar2 that we consider
to be dependent variables predicted from the remaining fields of the dataset. We’d like to discover the
best number of clusters when only the remaining fields are considered by the K-means algorithm. In
addition, we believe we can get a reasonable estimate for the best K if we only use 20% of the dataset in
the evaluation phase. We can limit the K-means cluster computations according to these constraints by
specifying
Segmented Searches for the Best K In some cases we might have a first guess for the best K.
When we run the evaluations over an initial range k-min to k-max we find that the values of f (K) over
that range do not exhibit a clear and unambiguous minimum. Rather than re-run the evaluations over
the initial range k-min to k-max, we can simply rerun it over additional subranges.
For instance, we can run:
Advanced
The purpose of the Advanced tutorials is to demonstrate porting high-level
Machine Learning algorithms to the WhizzML language. Writing the scripts in
this section requires a thorough understanding of Machine Learning concepts.
To this end, lower level WhizzML concepts will not be explained in detail here.
For more basic instruction regarding the WhizzML language, refer to the Part I
or Part II tutorials, or the WhizzML Language Reference guide.
26
CHAPTER 5
Gradient Boosting
Gradient tree boosting has become one of the more popular algorithms in the machine learning world,
thanks in part to a number of open-source software packages as well as some high profile public successes.
The algorithm is, as bagged classifiers and random forests, an ensemble of weak learners, combined to
make a strong one. The primary difference with boosting is that each successive weak model is, in a
sense, trained on the mistakes of the previously learned models. With gradient boosting, a gradient is
calculated with respect to the current model collection, and each model represents a gradient “step” in
that direction.
Because the gradient steps are expressed as trees, one way to code this algorithm is using WhizzML to
specify and string together the various steps in the algorithm, and allow BigML’s computing infrastructure
to handle the heavy lifting of calculation over all points in the dataset and the individual modeling steps.
The full code for this tutorial is available in our whizzml examples repository 1 on Github.
Advanced Tutorial Alert! This tutorial is pretty advanced. It assumes a significant level of comfort
with WhizzML and with a bit of high-ish level mathematics. If you feel like you’re in over your head,
you can always head back to the Part I or the Part II sections to get your feet on firmer ground before
you continue here.
It might also be helpful to have some of the other WhizzML reference material 2 on hand to help you as
you follow along.
1 https://github.com/whizzml/examples/tree/master/gradient-boosting
2 https://bigml.com/whizzml
27
28 Chapter 5. Gradient Boosting
We then train a model (rather, k models, one for each class) to approximate this difference at each of the
training points, then uses the model to predict the value on the training data. Why is this necessary, as
we know the value of the objective at these points already (that’s how we trained the model!)? First, as
it’s important that we don’t overfit the training data, we’ll learn a “shallow” tree so the predictions will
be different from the true values, and should be a good proxy for the predictions we’d see on new data.
With the predictions for this gradient “step” in hand, we can sum them onto our current running sum of
gradient steps, then use the softmax transform to turn these sums into probabilities. We use the softmax
function here for more than just convenience; the objective function that we glossed over two paragraphs
ago has this softmax operation as an important part of it, so not using it here would be mathematically
invalid.
With these probabilities in hand, we can recompute the gradient at the training points and iterate the
algorithm. The algorithm stops generally if you reach some predetermined number of iterations or cease
to make improvement for a while (we’ll do the later here).
You’re left finally with an m × k matrix of models, where m is the number of gradient steps you took
and k is the number of classes. To predict probabilities for a new point, predict a score for that point
with each model, sum these scores “down the columns” so you’re left with k sums, one for each class,
then apply the softmax transform to get your per-class probabilities.
Yes, this is a lot of information. And we’re ignoring quite a bit; we’re not going to mess with learning
rates, shrinkage, or regularization. But you could if you wanted! Check out these slides 3 for more
information if you want to dig deeper.
All right, enough exposition. Let’s write some code.
;; The names for the fields containing the total scores (the running
;; sum of all gradient steps) at iteration `iteration`
(define (sum-names nclasses iteration)
(field-names nclasses iteration "sum"))
;; The names for the fields containing the scores at iteration `iteration`
3 https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
;; The field name for the gradients (the objective for each class) at
;; each iteration
(define (grad-names nclasses iteration)
(field-names nclasses iteration "gradient"))
As we make our way through the algorithm, we’re going to create a lot of new columns in our dataset.
We’ll create new columns for the gradients at each step, the running sums at each step, and the predictions
of the gradient trees at each step. These functions will return unique names for each type of column; one
for each class at any given iteration of the algorithm
Here’s how we’ll actually create all of those new columns. The top function make-fields takes a list
of column names and list of flatline expressions. It then creates a list of maps with the names and
expressions in each one. The bottom function takes this list, puts into a BigML resource request and
creates a copy of the argument dataset with the new columns appended. If input-ids is specified, it
retains only those fields from the original dataset.
Wait, you haven’t heard of flatline? That’s BigML’s DSL for dataset transformation. As you’ll see, using
WhizzML to compose flatline expressions that transform your data allows you to do a ton of interesting
things. You can get more familiar with flatline by perusing the user’s guide here. 4
;; Get the original input fields from the dataset, to make sure we use
;; the same fields to learn at each iteration.
(define (get-inputs fields)
(let (not-generated? (lambda (astr) (not (contains-string? boost-id astr)))
is-input? (lambda (fid) (not-generated? ((fields fid) "name"))))
(filter is-input? (keys fields))))
;; Get the total number of classes for the problem from the field
;; descriptor
(define (get-num-classes dataset obj-id)
(let (obj ((get-fields dataset) obj-id))
4 https://github.com/bigmlcom/flatline/blob/master/user-manual.md
Finally, a few miscellaneous helpers: get-num-classes does what it says on the box, gets the number of
classes in the objective field. get-inputs gets the original input fields for the dataset, a handy thing to
know when you’re adding a whole bunch of columns at each iteration, get-objectives gets the field ids
of the objective fields (one for each class) with the gradient columns at a given iteration.
;; Compute the gradient given the ground truth fields and the current
;; probabilities
(define (compute-gradient dataset nclasses iteration)
(let (next-names (grad-names nclasses iteration)
preds (if (> iteration 0)
(map (lambda (n) (flatline "(f {{n}})"))
(softmax-names nclasses iteration))
(repeat nclasses (str (/ 1 nclasses))))
tns (truth-names nclasses)
fexp (lambda (idx)
(let (actual (tns idx)
predicted (preds idx))
(flatline "(- (f {{actual}}) {predicted})")))
new-fields (make-fields next-names (map fexp (range nclasses))))
(add-fields dataset new-fields [])))
We first get the names of the columns where we’re going to put the gradient values in next-names.
Then we get the currently predicted probabilities for each class. If it’s an iteration other than the
first, we compose a flatline expression that pulls out the value from the appropriate column, given by
softmax-names.
We can then write a little internal function, fexp, which subtracts the two columns values. Of course, we
have to (map fexp (range nclasses)) so that we get an expression for each class, then use new-fields
and add-fields to add them to the dataset.
;; Learn a set of trees over the objective fields, one for each class
(define (learn-trees dataset nclasses iteration)
(let (fs (get-fields dataset)
iids (get-inputs fs)
oids (get-objectives fs nclasses iteration)
req {"dataset" dataset "input_fields" iids}
create (lambda (oid) (create-model (assoc req "objective_field" oid)))
ids (map create oids)
_ (wait-forever* ids))
ids))
Here we get the input and objectives for each model (remember, get-inputs returns the original input
fields for the data, minus the fields we’ve generated ourselves). The input fields will always be the same,
but we will learn one model per objective field (that is, one per class).
We do this with the inline function create, which takes the template req for the model request and adds
an objective field. We then (map create oids) to create one model for each objective in oids and wait
for them to complete.
;; Predict the value of the gradient for all points in the dataset
;; Need to predict one at a time so we can preserve all fields
(define (batch-predict dataset iteration mod-ids)
(let (pnames (pred-names (count mod-ids) iteration))
(loop (last-ds dataset mids mod-ids names pnames)
(if (empty? mids)
last-ds
(let (req {"all_fields" true
"output_dataset" true
"model" (head mids)
"dataset" last-ds
"prediction_name" (head names)}
bp (create-and-wait-batchprediction req)
new-ds ((fetch bp) "output_dataset_resource")
_ (wait-forever new-ds))
(recur new-ds (tail mids) (tail names)))))))
This code works by first getting names for the columns where we’ll put the predictions (pnames). We
then loop over these names and the corresponding model identifiers, creating a batchprediction for each
one. We’ll do these serially, as we want to preserve predictions from previous models when we make a
new one, so the dataset created by the first batchprediction will provide the input fields for the second,
and so on for each model.
Note that we have to create-and-wait-batchprediction and later wait separately for the created
dataset; the readiness of the batch prediction does not imply readiness for the output dataset resource.
;; Sum the last set of predictions with the current set of sums to get
;; new scores
(define (create-sums dataset nclasses iteration)
(let (this-preds (pred-names nclasses iteration)
this-sums (sum-names nclasses iteration)
last-sums (if (> iteration 1) (sum-names nclasses (- iteration 1)) [])
fexp (lambda (idx)
(let (this-pred (this-preds idx))
(if (empty? last-sums)
(flatline "(f {{this-pred}})")
(let (last-sum (last-sums idx))
(flatline "(+ (f {{this-pred}}) (f {{last-sum}}))")))))
new-fields (make-fields this-sums (map fexp (range nclasses))))
(add-fields dataset new-fields [])))
This looks a lot like the function we used to compute the gradient and follows the same general pattern.
First we pull out the names “source” and “destination” columns for the flatline operation (in this case,
last-sums and this-preds, and this-sums respectively.
We then create a function fexp that will take a class index i, pick out the ith name from each of the
aforementioned lists, and compose the flatline expression that adds the two columns. Of course, if we’re
on the first iteration of the algorithm ((empty? last-sums)) then the current sum is just the current
prediction.
We can then (map fexp (range classes)) to do this for each class, resulting in a new running sum for
each class. Finally, we use make-fields and add-fields to add the list of new columns to the dataset.
es i
pi = P sj
je
so that the probability vector is the vector of scores exponentiated, then normalized.
Again, we rely on flatline to do the heavy lifting. We get the source and destination columns as before in
this-sums and this-softmaxs. We then compose a flatline expression equivalent to the above equation.
Note that we do it by parts here for convenience, first defining an “exponentiator” in fl-exp, then the
denominator in exp-sum, and finally the closure that will make the final expression for each column in
fexp. One more invocation of new-fields and add-fields and we have our new probabilities.
Here we assume that we start with trained models, and execute an entire iteration using the functions
defined above, so that we come all the way back to the point where we can train models again.
All that’s left is to iterate that function until a stopping condition, along with a few set up and tear down
steps.
It seems like a lot, but the main bits are the stopping condition (we keep track of the last three total
gradient magnitudes and stop if they represent less than 1% of the total improvement) and also the fact
that we learn over a bootstrap sample of our original dataset. That is, at each iteration we learn over
some of the dataset and check improvement over the other part. As such, we have to call create-fields
on both the training and the test sets
5.9 Conclusion
We’ve implemented a “vanilla” version of gradient tree boosting in WhizzML. Hopefully, we’ve proven
along the way that it’s possible to implement many complex machine learning algorithms in WhizzML,
and thereby gain the power of BigML’s infrastructure behind your implementation.
Detecting Covariate Shift and Dataset Shift in new production data relative to previously trained predic-
tive models is always a challenge. BigML has discussed a method for doing this 1 based on the Matthews
Correlation Coefficient (Phi Coefficient) computed from the confusion matrix for a predictive model.
A tutorial also describes a WhizzML package for implementing this method (see the "Covariate Shift"
section). This tutorial describes an alternative method that can be easily implemented using WhizzML
that can be useful for some types of data.
In the alternative method we describe here, we use the WhizzML anomaly detection functions. The
method computes an average anomaly score of the production dataset relative to the model training
dataset as a measure of the covariate shift between the training dataset and the production dataset. An
anomaly detector is trained from the same dataset used to train the model. This anomaly detector is
then used to derive a batch anomaly score for the production dataset. Finally, the average value of that
batch anomaly score is computed as an indicator of covariate shift.
The full code for this tutorial is available in our whizzml examples repository 2 on Github.
Inputs:
• dst-id: (string) ID of the dataset to be sampled.
• rate: (float) A value between 0 and 1 that specifies the size of the bagged sample. For
example, 0.8 means that 80% of original dataset is in the bagged sample.
• oob: (boolean) Selects whether we want the bagged (false) chunk of data or the out of bag
(true) chunk. For example, if the rate is 0.75, and oob is false, we get 75% of the data. If
oob is true, we get the other 25%.
• seed: (string) A string used to make the sampling deterministic (repeatable)
Output: (string) ID of the new dataset object.
1 https://blog.bigml.com/2014/01/03/simple-machine-learning-to-detect-covariate-shift/
2 https://github.com/whizzml/examples/tree/master/anomaly-shift
34
Chapter 6. Anomaly-Based Covariate Shift 35
To build a practical anomaly-based covariate shift detector, we need to derive sample from the anomaly
detector training dataset and the production dataset. This helper routine wraps the basic dataset creation
function to isolate configuration of that function so that it can be easily modified to support other sampling
options in other applications. This basic version simply creates a deterministic sample determined by
seed parameter of a size determined jointly by rate and oob from the source dataset specified by dst-id.
Inputs:
• anomaly-id: (string) ID of the anomaly detector object.
• dst-id: (string) ID of the production dataset object.
Output: (string) ID of the created batchanomalyscore object.
We also will need to apply an anomaly detector derived from a training dataset to a production dataset.
As with the sample-dataset, this helper routine wraps the WhizzML batch anomaly score computation
function to isolate the configuration information so that it can be easily modified to use other evaluation
options. This basic version simply applies the anomaly detector specified by anomaly-id to the dataset
specified by dst-id and returns a BigML batchanomalyscore object. Because we specify all_fields
and output_dataset as true in the WhizzML function to create the batchanomalyscore object, the
returned metadata includes a reference to a dataset that includes the original production dataset with each
item annotated with an additional member that contains the anomaly score for the item and additional
metadata that includes the anomaly score results for the whole dataset.
Inputs:
• evdst-id: (string) ID of the batch anomaly score dataset object.
Output: (float) Average batch anomaly score from the results in the batch anomaly score dataset
object.
Finally, we encapsulate the computation of the average anomaly score for the production dataset as a
helper function that computes the average anomaly score from the annotated production dataset object
returned by anomaly-evaluation. The metadata for this augmented production dataset object includes
a ["objective_field" "id"] member that contains the name of the score-field for the member
with the object that has summary anomaly score results for the production dataset. The quotient
of the members ["fields" score-field "summary" "sum"] and ["fields" score-field "summary"
"population"] in that anomaly score member is returned as the average batch anomaly score for the
production dataset object evdst-id.
Inputs:
• train-dst: (string) ID of the training dataset.
• train-exc: (list) Fields to exclude from the training dataset.
• prod-dst: (string) ID of the production dataset.
• prod-exc: (list) Fields to exclude from the production dataset.
• seed: (string) A string used to make the sampling deterministic (see the "sample-dataset"
function).
• clean: (boolean) Delete intermediate datasets before exiting the function.
Output: (float) The average anomaly score between 0 and 1.
This top-level function combines the previous helper functions to compute the average anomaly score for
a single production dataset relative to an anomaly detector built from the training dataset.
The function accepts a training dataset train-dst and a production dataset prod-dst along with a
seed string to force deterministic sampling of both. 80% of the training dataset traino-dst is used to
train an anomaly detector anomaly, ignoring the fields in the list train-exc. 20% of the production
dataset prodo-dst is then evaluated with the anomaly detector to produce a BigML batchanomalyscore
object ev-id. The batchanomalyscore ev-id metadata has the ID evdst-id for the annotated version
of the production dataset. The metadata for this dataset includes the anomaly score information used
by avg-anomaly to compute an average anomaly score for the sample of the production dataset.
Finally, if clean is specified as true the intermediate objects created by the function are deleted before
the function returns the score.
scores-list)
scores-list))))
Inputs:
• train-dst: (string) ID of the training dataset.
• train-exc: (list) Fields to exclude from the training dataset.
• prod-dst: (string) ID of the production dataset.
• prod-exc: (list) Fields to exclude from the production dataset.
• seed: (string) A string used to make the sampling deterministic (see the "sample-dataset"
function).
• niter: (number) Number of iterations.
• clean: (boolean) Delete intermediate datasets before exiting the function.
• logf: (boolean) Enables logging.
Output: (list) A list of average anomaly scores between 0 and 1 for specified datasets over niter trials.
To facilitate logging and to illustrate the implementation of multiple train-evaluate iterations in way that
can be easily augmented in specific applications, we implement iteration over the anomaly-measure as
an explicit loop function.
The loop function inputs are an iteration count iter and a list of scores computed thus far scores-list.
Each iteration of the loop generates a unique seed value for sampling the training and production datasets
and computes the anomaly score with anomaly-measure. That score is appended to the scores-list.
If niter iterations have not been completed, the iteration count is updated and the loop repeated.
Inputs:
• train-dst: (string) ID of the training dataset.
• train-exc: (list) Fields to exclude from the training dataset.
• prod-dst: (string) ID of the production dataset.
• prod-exc: (list) Fields to exclude from the production dataset.
• seed: (string) A string used to make the sampling deterministic (see the "sample-dataset"
function).
• niter: (number) Number of iterations.
• clean: (boolean) Delete intermediate datasets before exiting the function.
Inputs:
• train-dst: (string) ID of the training dataset.
• train-exc: (list) Fields to exclude from the training dataset.
• prod-dst: (string) ID of the production dataset.
• prod-exc: (list) Fields to exclude from the production dataset.
• seed: (string) A string used to make the sampling deterministic (see the "sample-dataset"
function).
• niter: (number) Number of iterations.
• clean: (boolean) Delete intermediate datasets before exiting the function.
• logf: (boolean) Enables logging.
Output: (float) The average anomaly scores between 0 and 1 for the specified datasets for the niter
trials.
Finally, in this top-level function we use anomaly-measures to compute a list of niter average anomaly
scores for pairs of samples of the training dataset train-dst and the production dataset prod-dst. We
then compute the average of the scores in that list and return that as a single measure of the covariate
shift between the training dataset train-dst and the production dataset prod-dst.
6.2 Examples
We end this tutorial description with a few examples of how to use the top-level functions.
If we only want an estimate of the data shift computed from a single pair of samples from the training
dataset and the production dataset, we can use the top-level function:
In this example we don’t exclude any of the fields from the training dataset and specify that we would
like to preserve all of the intermediate objects created in the computation by specifying clean as false.
Suppose next that we are concerned our training or production datasets aren’t uniform in some sense.
Suppose also that the datasets include a dependent variable idvar that differs consistently between the
two datasets potentially affecting the anomaly score. In this case, we can use the top-level function that
generates a series of anomaly scores:
Here we specify we want a series of 20 train-evaluate iterations and that logging information should be
generated by specifying logf as true. To manage storage on our BigML account we also define clean
as true to delete all intermediate objects each iteration.
Finally, we can run a series of train-evaluate trials and then compute the average of the anomaly scores
using the final top-level function as:
In this case we specify the same sequence of anomaly score computations and then take the average
returned by the function as our estimate of covariate-shift between the training dataset train-dst and
the production dataset prod-dst.