0% found this document useful (0 votes)

56 views

How To Structure An ML Project For Reproducibility

This document describes a template for structuring machine learning projects for reproducibility and maintainability. The template uses tools like Poetry, Prefect, Pydantic, pre-commit, Makefile, and GitHub Actions to create a readable project structure, manage dependencies, automate tasks, enforce type checking, detect issues, document code, and automatically run tests. Using this template helps data scientists efficiently organize their work and ensure quality and reproducibility of their ML projects.

Uploaded by

funkrocknow3826

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views

How To Structure An ML Project For Reproducibility

Uploaded by

funkrocknow3826

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

How to Structure an ML Project for

Reproducibility and Maintainability

Motivation
Getting started is often the most challenging part when building ML
projects. How should you structure your repository? Which standards
should you follow? Will your teammates be able to reproduce the results of
your experimentations?
Instead of trying to find an ideal repository structure, wouldn’t it be nice to
have a template to get started?

That is why I created data-science-template, consolidating best practices I’ve

learned over the years about structuring data science projects.
This template allows you to:

Create a readable structure for your project

Efficiently manage dependencies in your project

Create short and readable commands for repeatable tasks

Rerun only modified components of a pipeline

Observe and automate your code

Enforce type hints at runtime

Check issues in your code before committing

Automatically document your code

Automatically run tests when committing your code

Tools Used in This Template
This template is lightweight and uses only tools that can generalize to
various use cases. Those tools are:

Poetry: manage Python dependencies

Prefect: orchestrate and observe your data pipeline

Pydantic: validate data using Python type annotations

pre-commit plugins: ensure your code is well-formatted, tested, and

documented, following best practices

Makefile: automate repeatable tasks using short commands

GitHub Actions: automate your CI/CD pipeline

pdoc: automatically create API documentation for your project

Usage
To download the template, start by installing Cookiecutter:

pip install cookiecutter

Create a project based on the template:

cookiecutter https://github.com/khuyentran1401/data-
science-template

Try out the project by following these instructions.

In the following few sections, we will detail some valuable features of this
template.
Create a Readable Structure
The structure of the project created from the template is standardized and
easy to understand.

Here is the summary of the roles of these files:

.
├── data
│ ├── final # data after training the
model
│ ├── processed # data after processing
│ ├── raw # raw data
├── docs # documentation for your
project
├── .flake8 # configuration for code
formatter
├── .gitignore # ignore files that cannot
commit to Git
├── Makefile # store commands to set up
the environment
├── models # store models
├── notebooks # store notebooks
├── .pre-commit-config.yaml # configurations for pre-
commit
├── pyproject.toml # dependencies for poetry
├── README.md # describe your project
├── src # store source code
│ ├── __init__.py # make src a Python module
│ ├── config.py # store configs
│ ├── process.py # process data before
training model
│ ├── run_notebook.py # run notebook
│ └── train_model.py # train model
└── tests # store tests
├── __init__.py # make tests a Python
module
├── test_process.py # test functions for
process.py
└── test_train_model.py # test functions for
train_model.py
Efficiently Manage Dependencies
Poetry is a Python dependency management tool and is an alternative to
pip.

With Poetry, you can:

Separate the main dependencies and the sub-dependencies into two

separate files (instead of storing all dependencies in requirements.txt)

Remove all unused sub-dependencies when removing a library

Avoid installing new packages that conflict with the existing packages

Package your project in several lines of code

and more.

Find the instruction on how to install Poetry here.

Create Short Commands for Repeatable
Tasks
Makefile allows you to create short and readable commands for tasks. You
can use Makefile to automate tasks such as setting up the environment:

initialize_git:
@echo "Initializing git..."
git init

install:
@echo "Installing..."
poetry install
poetry run pre-commit install

activate:
@echo "Activating virtual environment"
poetry shell

download_data:
@echo "Downloading data..."
wget
https://gist.githubusercontent.com/khuyentran1401/a1abde0a
7d27d31c7dd08f34a2c29d8f/raw/da2b0f2c9743e102b9dfa6cd75e94
708d01640c9/Iris.csv -O data/raw/iris.csv

setup: initialize_git install download_data

Now, whenever others want to set up the environment for your projects,
they just need to run the following:

make setup
make activate

And a series of commands will be run! View the full Makefile here.
Rerun Only Modified Components of a
Pipeline
Make is also useful when you want to run a task whenever its dependencies
are modified.
As an example, let’s capture the connection between files in the following
diagram through a Makefile:
data/processed/xy.pkl: data/raw src/process.py
@echo "Processing data..."
python src/process.py

models/svc.pkl: data/processed/xy.pkl src/train_model.py

@echo "Training model..."
python src/train_model.py

pipeline: data/processed/xy.pkl models/svc.pk

To create the file models/svc.pkl , you can run:

make models/svc.pkl

Since data/processed/xy.pkl and src/train_model.py are the prerequisites

of the models/svc.pkl target, make runs the recipes to create both
data/processed/xy.pkl and models/svc.pkl .

Processing data...
python src/process.py

Training model...
python src/train_model.py

If there are no changes in the prerequisite of models/svc.pkl, make will skip

updating models/svc.pkl .

$ make models/svc.pkl
make: `models/svc.pkl' is up to date.
Observe and Automate Your Code
This template leverages Prefect to:

Observe all your runs from the Prefect UI.

Among others, Prefect can help you:

Retry when your code fails

Schedule your code run

Send notifications when your flow fails

You can access these features by simply turning your function into a Prefect
flow.

from prefect import flow

@flow
def process(
location: Location = Location(),
config: ProcessConfig = ProcessConfig(),
):
...
Enforce Type Hints At Runtime
Pydantic is a Python library for data validation by leveraging type
annotations.

Pydantic models enforce data types on flow parameters and validate their
values when a flow run is executed.
If the value of a field doesn’t match the type annotation, you will get an
error at runtime:

process(config=ProcessConfig(test_size='a'))

pydantic.error_wrappers.ValidationError: 1 validation
error for ProcessConfig
test_size
value is not a valid float (type=type_error.float)

All Pydantic models are in the src/config.py file.

Detect Issues in Your Code Before
Committing
Before committing your Python code to Git, you need to make sure your
code:

passes unit tests

is organized

conforms to best practices and style guides

is documented

However, manually checking these criteria before committing your code can
be tedious. pre-commit is a framework that allows you to identify issues in
your code before committing it.
You can add different plugins to your pre-commit pipeline. Once your files
are committed, they will be validated against these plugins. Unless all
checks pass, no code will be committed.
You can find all plugins used in this template in this .pre-commit-
config.yaml file.
Automatically Document Your Code
Data scientists often collaborate with other team members on a project.
Thus, it is essential to create good documentation for the project.

To create API documentation based on docstrings of your Python files and

objects, run:

make docs_view

Output:

Save the output to docs...

pdoc src --http localhost:8080
Starting pdoc server on localhost:8080
pdoc server ready at http://localhost:8080

Now you can view the documentation on http://localhost:8080.

Automatically Run Tests
GitHub Actions allows you to automate your CI/CD pipelines, making it
faster to build, test, and deploy your code.

When creating a pull request on GitHub, the tests in your tests folder will
automatically run.
View the code for this workflow.
Conclusion
Congratulations! You have just learned how to use a template to create a
reusable and maintainable ML project. This template is meant to be flexible.
Feel free to adjust the project based on your applications.

Strategies JPMorgan Chase can use to GROW from a Billion Dollar Company to a Trillion Dollar Company like Alphabet Inc
No ratings yet
Strategies JPMorgan Chase can use to GROW from a Billion Dollar Company to a Trillion Dollar Company like Alphabet Inc
5 pages
FINS3648 Quiz Bank
No ratings yet
FINS3648 Quiz Bank
13 pages
Data Engineering - DDH Overview + Standards
No ratings yet
Data Engineering - DDH Overview + Standards
3 pages
Internet of Things With 8051 and ESP8266
No ratings yet
Internet of Things With 8051 and ESP8266
193 pages
IIB (v9 & v10) Continuous Integration-Maven-Jenkins
100% (1)
IIB (v9 & v10) Continuous Integration-Maven-Jenkins
19 pages
IB FIX Manual PDF
No ratings yet
IB FIX Manual PDF
81 pages
Upload An Approved Document, You Will Be Able To Download The Document
No ratings yet
Upload An Approved Document, You Will Be Able To Download The Document
25 pages
C Boe Taxes and Investing
No ratings yet
C Boe Taxes and Investing
27 pages
Fix-5.0 SP1 Vol-6
No ratings yet
Fix-5.0 SP1 Vol-6
434 pages
Algo Cert Study Plan 2018
No ratings yet
Algo Cert Study Plan 2018
3 pages
Cre Test 3
No ratings yet
Cre Test 3
33 pages
Ypax Gold Ypax Gold: Whitepaper V.1.0
No ratings yet
Ypax Gold Ypax Gold: Whitepaper V.1.0
28 pages
VPN Setup Guide
No ratings yet
VPN Setup Guide
22 pages
Updates On Global Credit Exposure Policy 2020 - KEDAR KULKARNI
No ratings yet
Updates On Global Credit Exposure Policy 2020 - KEDAR KULKARNI
11 pages
Financial Markets
No ratings yet
Financial Markets
5 pages
The Official Ubuntu Book Matthew Helmke 2024 Scribd Download
100% (4)
The Official Ubuntu Book Matthew Helmke 2024 Scribd Download
62 pages
Ib Fix Manual
No ratings yet
Ib Fix Manual
114 pages
BINANCE Claim Stolen Funds
No ratings yet
BINANCE Claim Stolen Funds
85 pages
Malcolm Sherrington - AlgorithmicTradingInR
No ratings yet
Malcolm Sherrington - AlgorithmicTradingInR
30 pages
Quantopian Platform
No ratings yet
Quantopian Platform
63 pages
Agent UML: Stefano Lorenzelli E-Mail: 1999s024@educ - Disi.unige - It
No ratings yet
Agent UML: Stefano Lorenzelli E-Mail: 1999s024@educ - Disi.unige - It
24 pages
RAG with OpenAI for Financial Analysis
No ratings yet
RAG with OpenAI for Financial Analysis
11 pages
Cracking The LinkedIn Data Scientist Interview - by Dan Lee - DataInterview - Medium
No ratings yet
Cracking The LinkedIn Data Scientist Interview - by Dan Lee - DataInterview - Medium
17 pages
Direct Edge FIX API Nov 2009
No ratings yet
Direct Edge FIX API Nov 2009
19 pages
Basics of Commercial Financing
No ratings yet
Basics of Commercial Financing
7 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
31 pages
Financial Markets
No ratings yet
Financial Markets
8 pages
Installing Renko Script As An EA PDF
No ratings yet
Installing Renko Script As An EA PDF
2 pages
NewHopeEA Evolution 3.2.6 - ENG
No ratings yet
NewHopeEA Evolution 3.2.6 - ENG
18 pages
FINA 3780 Chapter 6
No ratings yet
FINA 3780 Chapter 6
33 pages
Options As A Strategic Investment
No ratings yet
Options As A Strategic Investment
1,070 pages
Venture Capital: Funding Alternative For New Enterprises: JAFCO Investment (Hong Kong) LTD
No ratings yet
Venture Capital: Funding Alternative For New Enterprises: JAFCO Investment (Hong Kong) LTD
23 pages
Quantitative Analysis and Modelling
No ratings yet
Quantitative Analysis and Modelling
3 pages
CGAP Disclosure Guidelines
No ratings yet
CGAP Disclosure Guidelines
26 pages
Construct 3 Manual
No ratings yet
Construct 3 Manual
1,327 pages
Automatic Extraction and Identification of Chart Patterns Towards Financial Forecast
No ratings yet
Automatic Extraction and Identification of Chart Patterns Towards Financial Forecast
12 pages
Banking Code BCSBI PDF
No ratings yet
Banking Code BCSBI PDF
9 pages
An Initial Public Offering
No ratings yet
An Initial Public Offering
10 pages
Chapter 4 Interest Rate Formulas
No ratings yet
Chapter 4 Interest Rate Formulas
4 pages
Fix-5.0 SP1 Vol-7
No ratings yet
Fix-5.0 SP1 Vol-7
233 pages
HFT - Hardware Low Latency Techniques
No ratings yet
HFT - Hardware Low Latency Techniques
4 pages
Pandas Datareader Latest
No ratings yet
Pandas Datareader Latest
65 pages
Mantas Interface Oracle FLEXCUBE Universal Banking Release 11.3.0 (May) (2011) Oracle Part Number E51536-01
100% (1)
Mantas Interface Oracle FLEXCUBE Universal Banking Release 11.3.0 (May) (2011) Oracle Part Number E51536-01
24 pages
Magicoder - Source Code Is All You Need
No ratings yet
Magicoder - Source Code Is All You Need
16 pages
Zun Documentation: Release 10.0.1.dev3
No ratings yet
Zun Documentation: Release 10.0.1.dev3
82 pages
May2023 - Blockchain Technology and Applications
No ratings yet
May2023 - Blockchain Technology and Applications
6 pages
Deepseek
No ratings yet
Deepseek
11 pages
HTML Canvas Deep Dive
No ratings yet
HTML Canvas Deep Dive
49 pages
Docs Scrapy Org en Latest
No ratings yet
Docs Scrapy Org en Latest
382 pages
Cresciumento Instagram Livro Muito Bom
No ratings yet
Cresciumento Instagram Livro Muito Bom
45 pages
Operating System
No ratings yet
Operating System
60 pages
Pandas Datareader
No ratings yet
Pandas Datareader
31 pages
Automate Strategy Finding With LLM in Quant Invest
No ratings yet
Automate Strategy Finding With LLM in Quant Invest
13 pages
Fixed Income Analysis
100% (1)
Fixed Income Analysis
3 pages
Ang, Andrew; Timmermann, Allan . (2012). Regime Changes and Financial Markets. Annual Review of Financial Economics, 4(1), 313–337. Doi-10.1146:Annurev-financial-110311-101808
100% (1)
Ang, Andrew; Timmermann, Allan . (2012). Regime Changes and Financial Markets. Annual Review of Financial Economics, 4(1), 313–337. Doi-10.1146:Annurev-financial-110311-101808
28 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Visa Credit Card Acquisition Opinion Paper
No ratings yet
Visa Credit Card Acquisition Opinion Paper
11 pages
Machine Learning-1
No ratings yet
Machine Learning-1
64 pages
Bloomberg Market Concepts Guide PDF
0% (1)
Bloomberg Market Concepts Guide PDF
4 pages
CBM Basel III - Group 7
No ratings yet
CBM Basel III - Group 7
16 pages
Payment system A Clear and Concise Reference
From Everand
Payment system A Clear and Concise Reference
Gerardus Blokdyk
1/5 (1)
Stru of DS Project
No ratings yet
Stru of DS Project
24 pages
Python - Pandas DataFrame - Replace All Values in A Column, Based On Condition - Stack Overflow
No ratings yet
Python - Pandas DataFrame - Replace All Values in A Column, Based On Condition - Stack Overflow
2 pages
5 Powerful Scikit-Learn Examples - Towards Data Science
No ratings yet
5 Powerful Scikit-Learn Examples - Towards Data Science
10 pages
CALPUFF Version6 UserInstructions PDF
No ratings yet
CALPUFF Version6 UserInstructions PDF
873 pages
Emission Guide
No ratings yet
Emission Guide
46 pages
10.6 An Assessment of Air Quality in The Houston Region: Investigating The Ability To Infer Surface PM From Remote Sensing Measurements and Examining Possible Aerosol Sources
No ratings yet
10.6 An Assessment of Air Quality in The Houston Region: Investigating The Ability To Infer Surface PM From Remote Sensing Measurements and Examining Possible Aerosol Sources
8 pages
Lecture 3 SRE - RE Process
No ratings yet
Lecture 3 SRE - RE Process
20 pages
leanIX - Eam BPM Integration Signavio - EN
No ratings yet
leanIX - Eam BPM Integration Signavio - EN
8 pages
Module 5 SEPM
No ratings yet
Module 5 SEPM
72 pages
Netrika Cyber Security Advisory Services DTD 28th Feb 2023
No ratings yet
Netrika Cyber Security Advisory Services DTD 28th Feb 2023
26 pages
Romney - ch20 System Design Implementation and Operation
No ratings yet
Romney - ch20 System Design Implementation and Operation
153 pages
Ooad Unit 4
No ratings yet
Ooad Unit 4
18 pages
Topic 4 - 107
No ratings yet
Topic 4 - 107
5 pages
John Deere 5605 e 5705 PDF
No ratings yet
John Deere 5605 e 5705 PDF
352 pages
Ee 469
No ratings yet
Ee 469
39 pages
Uncharted Holding
No ratings yet
Uncharted Holding
1 page
Recent Advances in Risk Analysis and Management (RAM)
No ratings yet
Recent Advances in Risk Analysis and Management (RAM)
5 pages
Name: Satyanarayana Raju Mobile: 0742802228 I. Career Objective
No ratings yet
Name: Satyanarayana Raju Mobile: 0742802228 I. Career Objective
5 pages
Customer Development 1226595306870728 9
No ratings yet
Customer Development 1226595306870728 9
76 pages
Guide To Business Process Modelling: 1.1 Xyz 1.1 Xyz
No ratings yet
Guide To Business Process Modelling: 1.1 Xyz 1.1 Xyz
11 pages
Infor Sys Final Project
No ratings yet
Infor Sys Final Project
92 pages
Os101 Reviewer
No ratings yet
Os101 Reviewer
14 pages
Metalogic - Xlink
No ratings yet
Metalogic - Xlink
7 pages
G +Lavakumar+Test+Engineer+CV
No ratings yet
G +Lavakumar+Test+Engineer+CV
2 pages
Merchandise Distribution - EN - CUST - V147
No ratings yet
Merchandise Distribution - EN - CUST - V147
51 pages
Respuestas Curso SAP
No ratings yet
Respuestas Curso SAP
12 pages
Final Documentation
No ratings yet
Final Documentation
39 pages
Bca Minor Project Certificate "Issued From Project Guide"
No ratings yet
Bca Minor Project Certificate "Issued From Project Guide"
37 pages
Waterfall Model Final
No ratings yet
Waterfall Model Final
23 pages
CH 10
No ratings yet
CH 10
41 pages
06 Static Testing
No ratings yet
06 Static Testing
17 pages
Lean Production System (LPS)
No ratings yet
Lean Production System (LPS)
26 pages
Industrial Project I
No ratings yet
Industrial Project I
92 pages
Acronyms Master
No ratings yet
Acronyms Master
10 pages
Software Engineering Notes UNIT 1 Notes
No ratings yet
Software Engineering Notes UNIT 1 Notes
13 pages

How To Structure An ML Project For Reproducibility

Uploaded by

How To Structure An ML Project For Reproducibility

Uploaded by

How to Structure an ML Project for

Reproducibility and Maintainability

That is why I created data-science-template, consolidating best practices I’ve

Create a readable structure for your project

Efficiently manage dependencies in your project

Create short and readable commands for repeatable tasks

Rerun only modified components of a pipeline

Observe and automate your code

Enforce type hints at runtime

Check issues in your code before committing

Automatically document your code

Automatically run tests when committing your code

Poetry: manage Python dependencies

Pydantic: validate data using Python type annotations

pre-commit plugins: ensure your code is well-formatted, tested, and

Makefile: automate repeatable tasks using short commands

pdoc: automatically create API documentation for your project

pip install cookiecutter

Create a project based on the template:

Try out the project by following these instructions.

Here is the summary of the roles of these files:

With Poetry, you can:

Separate the main dependencies and the sub-dependencies into two

Remove all unused sub-dependencies when removing a library

Package your project in several lines of code

Find the instruction on how to install Poetry here.

setup: initialize_git install download_data

models/svc.pkl: data/processed/xy.pkl src/train_model.py

pipeline: data/processed/xy.pkl models/svc.pk

To create the file models/svc.pkl , you can run:

Since data/processed/xy.pkl and src/train_model.py are the prerequisites

If there are no changes in the prerequisite of models/svc.pkl, make will skip

Observe all your runs from the Prefect UI.

Retry when your code fails

Send notifications when your flow fails

from prefect import flow

All Pydantic models are in the src/config.py file.

passes unit tests

conforms to best practices and style guides

To create API documentation based on docstrings of your Python files and

Save the output to docs...

Now you can view the documentation on http://localhost:8080.

You might also like