Testing Machine Learning Systems - Code, Data and Models - Made With ML
Testing Machine Learning Systems - Code, Data and Models - Made With ML
Goku Mohandas
· ·
Repository
applied ML.
Subscribe
Intuition
Tests are a way for us to ensure that something works as intended. We're incentiviz ed to
implement tests and discover sources of error as early in the development cycle as
possible so that we can reduce increasing downstream costs and wasted time. Once
we've designed our tests, we can automatically execute them every time we implement
Types of tests
There are many four majors types of tests which are utiliz ed at di erent points in the
development cycle:
. Unit tests : tests on individual components that each have a single responsibility
(ex. function that lters a list).
. System tests : tests on the design of a system for expected outputs given inputs
. Acceptance tests : tests to verify that requirements have been met, usually referred
to as User Acceptance Testing (UAT).
. Regression tests : testing errors we've seen before to ensure new changes don't
reintroduce them.
Not e
There are many other types of functional and non-functional tests as well, such as
smoke tests (quick health checks), performance tests (load, stress), security tests,
etc. but we can g eneralize these under the system tests above.
The framework to use when composing tests is the Arrange Act Assert methodology.
Tip
Cleaning is an uno cial fourth step to this methodolog y because it's important to
not leave remnants of a previous state which may a ect subsequent tests. We can
In Python, there are many tools, such as unittest, pytest, etc., that allow us to easily
implement our tests while adhering to the Arrange Act Assert framework above. These
tools come with powerful built-in functionality such as parametriz ation, lters, and more,
Not e
When arranging our inputs and asserting our expected outputs, it's important to test
across the entire g ambit of inputs and outputs:
inpu t s: data types, format, leng th, edg e cases (min/max, small/larg e, etc.)
Best practices
Regardless of the framework we use, it's important to strongly tie testing into the
development process.
atomic : when creating unit components, we need to ensure that they have a single
responsibility so that we can easily test them. If not, we'll need to split them into
regression : we want to account for new errors we come across with a regression
test so we can ensure we don't reintroduce the same errors in the future.
coverage : we want to ensure that 100% of our codebase has been accounter for. This
doesn't mean writing a test for every single line of code but rather accounting for
we want to auto run tests for every commit. We'll learn how to do this locally using
Precommit and remotely (ie. main branch) via GitHub actions in subsequent lessons.
Test-driven development
Test-driven development (TDD) is the process where you write a test before completely
writing the functionality to ensure that tests are always written. This is in contrast to
writing functionality rst and then composing tests afterwards. Here are my thoughts on
this:
good to write tests as we progress, but it's not the representation of correctness.
initial time should be spent on design before ever getting into the code or tests.
using a test as guide doesn't mean that our functionality is error free.
Perfect coverage doesn't mean that our application is error free if those tests aren't
meaningful and don't encompass the eld of possible inputs, intermediates and
outputs. Therefore, we should work towards better design and agility when facing errors,
quickly resolving them and writing test cases around them to avoid them next time.
Wa rning
This topic is still hig hly debated and I' m only re ecting on my experience and what's
worked well for me at a larg e company (Apple), very early stag e startup and running a
company of my own. What's most important is that the team is producing reliable
Application
In our application, we'll be testing the code, data and models. Be sure to look inside
each of the di erent testing scripts after reading through the components below.
Not e
Alternatively, we could' ve org anized our tests by types of tests as well (unit,
integ ration, etc.) but I nd it more intuitive for navig ation by org anizing by how our
application is set up. We' ll learn about markers below which will allow us to run any
🧪 Pytest
We're going to be using pytest as our testing framework for it's powerful builtin features
Con guration
can also use our pyproject.toml le to con gure any other test path directories as well.
Once in the directory, pytest looks for python scripts starting with tests_*.py but we
1 # Pytest
2 [tool.pytest.ini_options]
3 testpaths = ["tests"]
4 python_files = "test_*.py"
Assertions
Let's see what a sample test and it's results look like. Assume we have a simple function
1 # food/fruits.py
2 def is_crisp(fruit):
3 if fruit:
4 fruit = fruit.lower()
5 if fruit in ["apple", "watermelon", "cherries"]:
6 return True
7 elif fruit in ["orange", "mango", "strawberry"]:
8 return False
9 else:
10 raise ValueError(f"{fruit} not in known list of fruits.")
11 return False
To test this function, we can use assert statements to map inputs with expected
outputs:
1 # tests/food/test_fruits.py
2 def test_is_crisp():
3 assert is_crisp(fruit="apple") # or == True
4 assert is_crisp(fruit="Apple")
5 assert not is_crisp(fruit="orange")
6 with pytest.raises(ValueError):
7 is_crisp(fruit=None)
8 is_crisp(fruit="pear")
Not e
We can also have assertions about exceptions like we do in lines 6-8 where all the
operations under the with statement are expected to raise the speci ed exception.
Execution
We can execute our tests above using several di erent levels of granularity:
Running our speci c test above would produce the following output:
Had any of our assertions in this test failed, we would see the failed assertions as well
as the expected output and the output we received from our function.
Not e
It's important to test for the variety of inputs and expected outputs that we outlined
above and to never assume that a test is trivial. In our example above, it's important
that we test for both "apple" and "Apple" in the event that our function didn' t account
for casing !
Classes
We can also test classes and their respective functions by creating test classes. Within
our test class, we can optionally de ne functions which will automatically be executed
1 class Fruit(object):
2 def __init__(self, name):
3 self.name = name
4
5
6 class TestFruit(object):
7 @classmethod
8 def setup_class(cls):
9 """Set up the state for any class instance."""
10 pass
11
12 @classmethod
13 def teardown_class(cls):
14 """Teardown the state created in setup_class."""
15 pass
16
17 def setup_method(self):
18 """Called before every method to setup any state."""
19 self.fruit = Fruit(name="apple")
20
21 def teardown_method(self):
22 """Called after every method to teardown any state."""
23 del self.fruit
24
25 def test_init(self):
26 assert self.fruit.name == "apple"
We can execute all the tests for our class by specifying the class name:
1 tests/food/test_fruits.py::TestFruit . [100%]
We use test classes to test all of our class modules such as LabelEncoder ,
Tokenizer , CNN , etc.
Parametrize
So far, in our tests, we've had to create individual assert statements to validate di erent
here because the inputs always feed into our functions as arguments and the outputs
are compared with our expected outputs. To remove this redundancy, pytest has the
outputs as parameters.
1 @pytest.mark.parametrize(
2 "fruit, crisp",
3 [
4 ("apple", True),
5 ("Apple", True),
6 ("orange", False),
7 ],
8 )
9 def test_is_crisp_parametrize(fruit, crisp):
10 assert is_crisp(fruit=fruit) == crisp
. [Line 2] : de ne the names of the parameters under the decorator, ex. "fruit, crisp"
. [Lines 3-7] : provide a list of combinations of values for the parameters from Step 1.
. [Line 10] : include necessary assert statements which will be executed for each of
the combinations in the list from Step 2.
In our application, we use parametriz ation to test components that require varied
1 @pytest.mark.parametrize(
2 "fruit, exception",
3 [
4 ("pear", ValueError),
5 ],
6 )
7 def test_is_crisp_exceptions(fruit, exception):
8 with pytest.raises(exception):
9 is_crisp(fruit=fruit)
Fixtures
Parametriz ation allows us to e ciently reduce redundancy inside test functions but
what about its inputs? Here, we can use pytest's builtin xture which is a function that is
executed before the test function. This signi cantly reduces redundancy when multiple
1 @pytest.fixture
2 def my_fruit():
3 fruit = Fruit(name="apple")
4 return fruit
5
6
7 def test_fruit(my_fruit):
8 assert my_fruit.name == "apple"
We can apply xtures to classes as well where the xture function will be invoked when
1 @pytest.mark.usefixtures("my_fruit")
2 class TestFruit:
3 ...
We use xtures to e ciently pass a set of inputs (ex. Pandas DataFrame) to di erent
1 @pytest.fixture
2 def df():
3 projects_fp = Path(config.DATA_DIR, "projects.json")
4 projects_dict = utils.load_dict(filepath=projects_fp)
5 df = pd.DataFrame(projects_dict)
6 return df
7
8
9 def test_split(df):
10 splits = split_data(df=df)
11 ...
Not e
Typically, when we have too many xtures in a particular test le, we can org anize
Markers
We've been able to execute our tests at various levels of granularity (all tests, script,
function, etc.) but we can create custom granularity by using markers. We've already
used one type of marker (parametriz e) but there are several other builtin markers as
well. For example, the skipif marker allows us to skip execution of a test if a condition is
met.
1 @pytest.mark.skipif(
2 not torch.cuda.is_available(),
3 reason="Full training tests require a GPU."
4 )
5 def test_training():
6 pass
We can also create our own custom markers with the exception of a few reserved
marker names.
1 @pytest.mark.fruits
2 def test_fruit(my_fruit):
3 assert my_fruit.name == "apple"
The proper way to use markers is to explicitly list the ones we've created in our
pyproject.toml le. Here we can specify that all markers must be de ned in this le with
the --strict-markers ag and then declare our markers (with some info about them) in
Once we do this, we can view all of our existing list of markers by executing pytest --
markers and we'll also receive an error when we're trying to use a new marker that's not
de ned here.
We use custom markers to label which of our test functions involve training so we can
1 @pytest.mark.training
2 def test_train_model():
3 experiment_name = "test_experiment"
4 run_name = "test_run"
5 result = runner.invoke()
6 ...
Not e
Another way to run custom tests is to use the -k ag when running pytest. The k
expression is much less strict compared to the marker expression where we can
Coverage
As we're developing tests for our application's components, it's important to know how
well we're covering our code base and to know if we've missed anything. We can use the
Coverage library to track and visualiz e how much of our codebase our tests account for.
With pytest, it's even easier to use this package thanks to the pytest-cov plugin.
generate the report in HTML format. When we run this, we'll see the tests from our tests
directory executing while the coverage plugin is keeping tracking of which lines in our
application are being executed. Once our tests are complete, we can view the
which parts were not covered by any tests. This is especially useful when we forget to
Thoug h we have 100% coverag e, this does not mean that our application is perfect.
Coverag e only indicates that a piece of code executed in a test, not necessarily that
every part of it was tested, let alone thoroug hly tested. Therefore, coverag e should
maintain coverag e at 100% so we can know when new functionality has yet to be
tested. In our CI/CD lesson, we' ll see how to use GitHub actions to make 100%
Exclusio ns
Sometimes it doesn't make sense to write tests to cover every single line in our
application yet we still want to account for these lines so we can maintain 100%
1 # Pytest coverage
2 [tool.coverage.run]
3 omit = ["app/main.py"] # sample API calls
The key here is that we were able to add justi cation to these exclusions through
Machine learning
Now that we have a foundation for testing traditional software, let's dive into testing our
🔢 Data
We've already tested the functions that act on our data through unit and integration
tests but we haven't tested the validity of the data itself. Once we de ne what our data
should look like, we can use (and add to) these expectations as our dataset grows.
Expectations
There are many dimensions to what our data is expected to look like. We'll brie y talk
about a few of them, including ones that may not directly be applicable to our task but,
rows / cols : the most basic expectation is validating the presence of samples
(rows) and features (columns). These can help identify mismatches between
individual values : we can also have expectations about the individual values of
speci c features.
missing values
feature value relationships with other feature values (ex. column 1 values must
aggregate values : we can also expectations about all the values of speci c
features.
detecting drift)
leverage the open-source library called Great Expectations. It's a fantastic library that
distributional, etc.) and allows us to create custom expectations as well. It also provides
modules to seamlessly connect with backend data sources such as local le systems,
S3, databases and even DAG runners. Let's explore the library by implementing the
First we'll load the data we'd like to apply our expectations on. We can load our data from
a variety of sources ( lesystem, S3, DB, etc.) which we can then wrap around a Dataset
[api, article,
A powerful web and
code, dataset,
1 2437 Rasoee mobile application
paper,
that ide...
research,...
Explore the
Top “Applied Data [article, deep-
innovative world
3 2435 Science” Papers learning, machine-
of Machine
from ECML-PK... learning, adv...
Learni...
Built-in
Once we have our data source wrapped in a Dataset module, we can compose and apply
1 # Presence of features
2 expected_columns = ["id", "title", "description", "tags"]
3 df.expect_table_columns_to_match_ordered_list(column_list=expected_columns)
4
5 # Unique
6 df.expect_column_values_to_be_unique(column="id")
7
8 # No null values
9 df.expect_column_values_to_not_be_null(column="title")
10 df.expect_column_values_to_not_be_null(column="description")
11 df.expect_column_values_to_not_be_null(column="tags")
12
13 # Type
14 df.expect_column_values_to_be_of_type(column="title", type_="str")
15 df.expect_column_values_to_be_of_type(column="description", type_="str")
16 df.expect_column_values_to_be_of_type(column="tags", type_="list")
17
18 # Data leaks
19 df.expect_compound_columns_to_be_unique(column_list=["title",
"description"])
Each of these expectations will create an output with details about success or failure,
expected and observed values, expectations raised, etc. For example, the expectation
1 {
2 "exception_info": {
3 "raised_exception": false,
4 "exception_traceback": null,
5 "exception_message": null
6 },
7 "success": true,
8 "meta": {},
9 "expectation_config": {
10 "kwargs": {
11 "column": "title",
12 "type_": "str",
13 "result_format": "BASIC"
14 },
15 "meta": {},
16 "expectation_type": "_expect_column_values_to_be_of_type__map"
17 },
18 "result": {
19 "element_count": 2032,
20 "missing_count": 0,
21 "missing_percent": 0.0,
22 "unexpected_count": 0,
23 "unexpected_percent": 0.0,
24 "unexpected_percent_nonmissing": 0.0,
25 "partial_unexpected_list": []
26 }
27 }
and this output if it failed (notice the counts and examples for what caused the failure):
1 {
2 "success": false,
3 "exception_info": {
4 "raised_exception": false,
5 "exception_traceback": null,
6 "exception_message": null
7 },
8 "expectation_config": {
9 "meta": {},
10 "kwargs": {
11 "column": "title",
12 "type_": "int",
13 "result_format": "BASIC"
14 },
15 "expectation_type": "_expect_column_values_to_be_of_type__map"
16 },
17 "result": {
18 "element_count": 2032,
19 "missing_count": 0,
20 "missing_percent": 0.0,
21 "unexpected_count": 2032,
22 "unexpected_percent": 100.0,
23 "unexpected_percent_nonmissing": 100.0,
24 "partial_unexpected_list": [
25 "How to Deal with Files in Google Colab: What You Need to Know",
26 "Machine Learning Methods Explained (+ Examples)",
27 "OpenMMLab Computer Vision",
28 "...",
29 ]
30 },
31 "meta": {}
32 }
We can group all the expectations together to create an Expectation Suite object which
1 # Expectation suite
2 expectation_suite = df.get_expectation_suite()
3 print(df.validate(expectation_suite=expectation_suite,
only_return_failures=True))
1 {
2 "success": true,
3 "results": [],
4 "statistics": {
5 "evaluated_expectations": 9,
6 "successful_expectations": 9,
7 "unsuccessful_expectations": 0,
8 "success_percent": 100.0
9 },
10 "evaluation_parameters": {}
11 }
Cust o m
Our tags feature column is a list of tags for each input. The Great Expectation's library
doesn't come equipped to process a list feature but we can easily do so by creating a
custom expectation.
. De ne expectation functions that can map to each individual row of the feature
appropriate decorator.
1 class CustomPandasDataset(ge.dataset.PandasDataset):
2 _data_asset_type = "CustomPandasDataset"
3
4 @ge.dataset.MetaPandasDataset.column_map_expectation
5 def expect_column_list_values_to_be_not_null(self, column):
6 return column.map(lambda x: None not in x)
7
8 @ge.dataset.MetaPandasDataset.column_map_expectation
9 def expect_column_list_values_to_be_unique(self, column):
10 return column.map(lambda x: len(x) == len(set(x)))
. Wrap data with the custom Dataset module and use the custom expectations.
1 df = CustomPandasDataset(projects_dict)
2 df.expect_column_values_to_not_be_null(column="tags")
3 df.expect_column_list_values_to_be_unique(column="tags")
Not e
There are various levels of abstraction (following a template vs. completely from
Expectations.
Projects
So far we've worked with the Great Expectations library at the Python script level but we
. Initializ e the Project using great_expectations init . This will interactively walk us
through setting up data sources, naming, etc. and set up a great_expectations
. De ne our custom module under the plugins directory and use it to de ne our data
1 datasources:
2 data:
3 class_name: PandasDatasource
4 data_asset_type:
5 module_name: custom_module.custom_dataset
6 class_name: CustomPandasDataset
7 module_name: great_expectations.datasource
8 batch_kwargs_generators:
9 subdir_reader:
10 class_name: SubdirReaderBatchKwargsGenerator
11 base_directory: ../assets/data
. Create expectations using the pro ler, which creates automatic expectations
based on the data, or we can also create our own expectations. All of this is done
great_expectations/expectations directory.
When using the automatic pro ler, you can choose which feature columns to
apply pro ling to. Since our tags feature is a list feature, we'll leave it commented
and create our own expectations using the suite edit command.
via Make le, or work ow orchestrator like Air ow, etc. We can also use the Great
Expectations GitHub Action to automate validating our data pipeline code when we
push a change. More on using these Checkpoints with pipelines in our work ows
lesson.
Data docs
automatically generates documentation for our tests. It also stores information about
validation runs and their results. We can launch the generate data documentation with
Best practices
We've applied expectations on our source dataset but there are many other key areas
to test the data as well. Throughout the ML development pipeline, we should test the
preprocessing, tokeniz ation, etc. We'll use these expectations to monitor new batches
of data and before combining them with our existing data assets.
Not e
Currently, these data processing steps are tied with our application code but in
future lessons, we' ll separate these into individual pipelines and use Great
orchestrated fashion.
🤖 Models
The other half of testing ML systems involves testing our models during training,
Training
We want to write tests iteratively while we're developing our training pipelines so we can
catch errors quickly. This is especially important because, unlike traditional software, ML
systems can run to completion without throwing any exceptions / errors but can
produce incorrect systems. We also want to catch errors quickly to save on time and
compute.
Over t on a batch
1 train(model)
2 assert learning_rate >= min_learning_rate
3 assert artifacts
On di erent devices
Not e
You can mark the compute intensive tests with a pytest marker and only execute
them when there is a chang e being made to system a ecting the model.
1 @pytest.mark.training
2 def test_train_model():
3 ...
Evaluation
When it comes to testing how well our model performs, we need to rst have our
priorities in line.
Overall
We want to ensure that our key metrics on the overall dataset improves with each
iteration of our model. Overall metrics include accuracy, precision, recall, f1, etc. and we
developers and domain experts will establish what the key metric(s) are while also
Slicing
Just inspecting the overall metrics isn't enough to deploy our new version to production.
There may be key slices of our dataset that we expect to do really well on (ie. minority
groups, large customers, etc.) and we need to ensure that their metrics are also
1 # tagifai/eval.py
2 from snorkel.slicing import slicing_function
3
4 @slicing_function()
5 def cv_transformers(x):
6 """Projects with the `computer-vision` and `transformers` tags."""
7 return all(tag in x.tags for tag in ["computer-vision", "transformers"])
Here we're using Snorkel's slicing_function to create our di erent slices. We can
visualiz e our slices by applying this slicing function to a relevant DataFrame using
slice_dataframe .
id text tags
[computer-vision,
hugging captions generate
1 15 huggingface, language-
realistic instagram ...
modeli...
id text tags
We can de ne even more slicing functions and create a slices record array using the
PandasSFApplier . The slices array has N (# of data points) items and each item has S (#
of slicing functions) items, indicating whether that data point is part of that slice. Think
of this record array as a masking layer for each slicing function on our data.
1 # tagifai/eval.py | get_performance()
2 from snorkel.slicing import PandasSFApplier
3
4 slicing_functions = [cv_transformers, short_text]
5 applier = PandasSFApplier(slicing_functions)
6 slices = applier.apply(df)
7 print (slices)
[(0, 0) (0, 1) (0, 0) (1, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0)
(1, 0) (0, 0) (0, 1) (0, 0) (0, 0) (1, 0) (0, 0) (0, 0) (0, 1) (0, 0)
...
(0, 0) (1, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0)
(0, 0) (0, 0) (1, 0) (0, 0) (0, 0) (0, 0) (1, 0)]
One we have our slices record array, we can compute the performance metrics for each
slice.
1 # tagifai/eval.py | get_performance()
2 for slice_name in slices.dtype.names:
3 mask = slices[slice_name].astype(bool)
4 metrics = precision_recall_fscore_support(y_true[mask], y_pred[mask],
average="micro")
Not e
Snorkel comes with a builtin slice scorer but we had to implemented a naive version
We can add these slice performance metrics to our larger performance report to
Ext ensio ns
We've explored user generated slices but there is currently quite a bit of research on
automatically generated slices and overall model robustness. A notable toolkit is the
Instead of passively observing slice performance, we could try and improve them.
Usually, a slice may exhibit poor performance when there are too few samples and so a
natural approach is to oversample. However, these methods change the underlying data
distribution and can cause issues with overall / other slices. It's also not scalable to train
a separate model for each unique slice and combine them via Mixture of Experts (MoE).
To combat all of these technical challenges and more, the Snorkel team introduced the
Slice Residual Attention Modules (SRAMs), which can sit on any backbone architecture
(ie. our CNN feature extractor) and learn slice-aware representations for the class
predictions.
Slice Residual Attention Modules (SRAMs)
Inference
When our model is deployed, most users will be using it for inference (directly /
This is the rst time we're not loading our components from in-memory so we want to
ensure that the required artifacts (model weights, encoders, con g, etc.) are all able to
be loaded.
Predict io n
Once we have our artifacts loaded, we're readying to test our prediction pipelines. We
should test samples with just one input, as well as a batch of inputs (ex. padding can
1 # tests/app/test_api.py | test_best_predict()
2 data = {
3 "run_id": "",
4 "texts": [
5 {"text": "Transfer learning with transformers for self-supervised
6 learning."},
7 {"text": "Generative adversarial networks in both PyTorch and
8 TensorFlow."},
9 ],
10 }
11 response = client.post("/predict", json=data)
12 assert response.json()["status-code"] == HTTPStatus.OK
13 assert response.json()["method"] == "POST"
assert len(response.json()["data"]["predictions"]) == len(data["texts"])
...
Besides just testing if the prediction pipelines work, we also want to ensure that they
work well. Behavioral testing is the process of testing input data and expected outputs
while treating the model as a black box. They don't necessarily have to be adversarial in
nature but more along the types of perturbations we'll see in the real world once our
model is deployed. A landmark paper on this topic is Beyond Accuracy: Behavioral Testing
of NLP Models with CheckList which breaks down behavioral testing into three types of
tests:
Be sure to explore the NLP Checklist packag e which simpli es and aug ments the
creation of these behavioral tests via functions, templates, pretrained models and
NLP Checklist
passed by a particular instance of a trained model. This report is then saved along with
the run's artifacts so we can use this information when choosing which model(s) to
deploy to production.
1 {
2 "score": 1.0,
3 "results": {
4 "passed": [
5 {
6 "input": {
7 "text": "Transformers have revolutionized the ML field.",
8 "tags": [
9 "transformers"
10 ]
11 },
12 "prediction": {
13 "input_text": "Transformers have revolutionized the ML field.",
14 "preprocessed_text": "transformers revolutionized ml field",
15 "predicted_tags": [
16 "natural-language-processing",
17 "transformers"
18 ]
19 },
20 "type": "INV"
21 },
22 ...
23 {
24 "input": {
25 "text": "graph neural networks have revolutionized machine
26 learning.",
27 "tags": [
28 "graph-neural-networks"
29 ]
30 },
31 "prediction": {
32 "input_text": "graph neural networks have revolutionized machine
33 learning.",
34 "preprocessed_text": "graph neural networks revolutionized
35 machine learning",
36 "predicted_tags": [
37 "graph-neural-networks",
38 "graphs"
39 ]
40 },
41 "type": "MFT"
42 }
43 ],
"failed": []
}
}
Wa rning
When you create additional behavioral tests, be sure to reevaluate all the models
you' re considering on the new set of tests so their scores can be compared. We can
do this since behavioral tests are not dependent on data or model versions and are
Sorted runs
We can combine our overall / slice metrics and our behavioral tests to create a holistic
evaluation report for each model run. We can then use this information to choose which
bash
Deployment
There are also a whole class of model tests that are beyond metrics or behavioral
testing and focus on the system as a whole. Many of them involve testing and
benchmarking the tradeo s (ex. latency, compute, etc.) we discussed from the
baselines lesson. These tests also need to performed across the di erent systems (ex.
devices) that our model may be on. For example, development may happen on a CPU but
the deployed model may be loaded on a GPU and there may be incompatible
components (ex. reparametriz ation) that may cause errors. As a rule of thumb, we
should test with the system speci cations that our production environment utiliz es.
Not e
We' ll automate tests on di erent devices in our CI/CD lesson where we' ll use GitHub
Actions to spin up our application with Docker Machine on cloud compute instances
Once we've tested our model's ability to perform in the production environment (o ine
t est s), we can run several types of o nline t est s to determine the quality of that
performance.
AB tests :
Shadow tests :
safe online evaluation as the new system's results are not served.
We'll conclude by talking about the similarities and distinctions between testing and
monitoring. They're both integral parts of the ML development pipeline and depend on
each other for iteration. Testing is assuring that our system (code, data and models)
behaves the way we intend at the current time t0. Whereas, monitoring is ensuring that
the conditions (ie. distributions) during development are maintained and also that the
tests that passed at t0 continue to hold true post deployment through tn. When this is
no longer true, we need to inspect more closely (retraining may not always x our root
problem).
With monitoring, there are quite a few distinct concerns that we didn't have to consider
determining model performance (rolling and window metrics on overall and slices of
data) using indirect signals (since labels may not be readily available).
in situations with large data, we need to know which data points to label and
We'll cover all of these concepts in much more depth (and code) in our monitoring
lesson.
Resources
Great Expectations
The ML Test Score: A Rubric for ML Production Readiness and Technical Debt
Reduction
Slices
1 @article{madewithml,
2 title = "Testing - Made With ML",
3 author = "Goku Mohandas",
4 url = "https://madewithml.com/courses/applied-ml/testing/"
5 year = "2021",
6 }