How To Structure An ML Project For Reproducibility
How To Structure An ML Project For Reproducibility
Motivation
Getting started is often the most challenging part when building ML
projects. How should you structure your repository? Which standards
should you follow? Will your teammates be able to reproduce the results of
your experimentations?
Instead of trying to find an ideal repository structure, wouldn’t it be nice to
have a template to get started?
cookiecutter https://github.com/khuyentran1401/data-
science-template
In the following few sections, we will detail some valuable features of this
template.
Create a Readable Structure
The structure of the project created from the template is standardized and
easy to understand.
.
├── data
│ ├── final # data after training the
model
│ ├── processed # data after processing
│ ├── raw # raw data
├── docs # documentation for your
project
├── .flake8 # configuration for code
formatter
├── .gitignore # ignore files that cannot
commit to Git
├── Makefile # store commands to set up
the environment
├── models # store models
├── notebooks # store notebooks
├── .pre-commit-config.yaml # configurations for pre-
commit
├── pyproject.toml # dependencies for poetry
├── README.md # describe your project
├── src # store source code
│ ├── __init__.py # make src a Python module
│ ├── config.py # store configs
│ ├── process.py # process data before
training model
│ ├── run_notebook.py # run notebook
│ └── train_model.py # train model
└── tests # store tests
├── __init__.py # make tests a Python
module
├── test_process.py # test functions for
process.py
└── test_train_model.py # test functions for
train_model.py
Efficiently Manage Dependencies
Poetry is a Python dependency management tool and is an alternative to
pip.
Avoid installing new packages that conflict with the existing packages
and more.
initialize_git:
@echo "Initializing git..."
git init
install:
@echo "Installing..."
poetry install
poetry run pre-commit install
activate:
@echo "Activating virtual environment"
poetry shell
download_data:
@echo "Downloading data..."
wget
https://gist.githubusercontent.com/khuyentran1401/a1abde0a
7d27d31c7dd08f34a2c29d8f/raw/da2b0f2c9743e102b9dfa6cd75e94
708d01640c9/Iris.csv -O data/raw/iris.csv
make setup
make activate
And a series of commands will be run! View the full Makefile here.
Rerun Only Modified Components of a
Pipeline
Make is also useful when you want to run a task whenever its dependencies
are modified.
As an example, let’s capture the connection between files in the following
diagram through a Makefile:
data/processed/xy.pkl: data/raw src/process.py
@echo "Processing data..."
python src/process.py
make models/svc.pkl
Processing data...
python src/process.py
Training model...
python src/train_model.py
$ make models/svc.pkl
make: `models/svc.pkl' is up to date.
Observe and Automate Your Code
This template leverages Prefect to:
You can access these features by simply turning your function into a Prefect
flow.
@flow
def process(
location: Location = Location(),
config: ProcessConfig = ProcessConfig(),
):
...
Enforce Type Hints At Runtime
Pydantic is a Python library for data validation by leveraging type
annotations.
Pydantic models enforce data types on flow parameters and validate their
values when a flow run is executed.
If the value of a field doesn’t match the type annotation, you will get an
error at runtime:
process(config=ProcessConfig(test_size='a'))
pydantic.error_wrappers.ValidationError: 1 validation
error for ProcessConfig
test_size
value is not a valid float (type=type_error.float)
However, manually checking these criteria before committing your code can
be tedious. pre-commit is a framework that allows you to identify issues in
your code before committing it.
You can add different plugins to your pre-commit pipeline. Once your files
are committed, they will be validated against these plugins. Unless all
checks pass, no code will be committed.
You can find all plugins used in this template in this .pre-commit-
config.yaml file.
Automatically Document Your Code
Data scientists often collaborate with other team members on a project.
Thus, it is essential to create good documentation for the project.
make docs_view
Output:
When creating a pull request on GitHub, the tests in your tests folder will
automatically run.
View the code for this workflow.
Conclusion
Congratulations! You have just learned how to use a template to create a
reusable and maintainable ML project. This template is meant to be flexible.
Feel free to adjust the project based on your applications.