DVC4ML
DVC4ML
2
What is DVC?[ref]
● Simple command line Git-like experience.
○ Does not require installing and maintaining any databases.
○ Does not depend on any proprietary online services.
● Makes projects reproducible and shareable; answers questions on how a model was built.
“DVC aims to replace spreadsheet and document sharing tools (such as Excel or Google Docs)
which are being used frequently as both knowledge repositories and team ledgers.
DVC also replaces both ad-hoc scripts to track, move, and deploy different model versions; as
well as ad-hoc data file suffixes and prefixes.”
dataset
10 GB
dataset
10 GB
12
Bring data to
local git local
local repository cache
workspace
dataset
10 GB
13
Simplify a team
collaboration
14
Image source: https://dvc.org/doc/use-cases/data-and-model-files-versioning
Getting Started
5
Install
● Install as a python package.
● Depending on remote storage you will use, you may want to install specific dependencies.
● Initializing a DVC project creates and automatically git add a few important files.
$ git commit -m "Initialize dvc project" Tell dvc what not to track (empty for now)
Data Versioning
8
Getting some data Use "tutorials/versioning/data.zip" instead
● Let’s download some data to train and validate a “cat VS dog” CNN classifier.
We use dvc get, which is like wget to download data/models from a remote dvc repo.
data
├── train
│ ├── dogs # 500 pictures
│ └── cats # 500 pictures
└── validation
├── dogs # 400 pictures
└── cats # 400 pictures
$ dvc add data/ Tell git not to track the data/ directory
100% Add|██████████|1/1 [00:30, 30.51s/file]
To track the changes with git, run:
DVC-generated, contains hash to track data/ :
git add .gitignore data.dvc
outs:
$ git add .gitignore data.dvc - md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir
$ git commit -m "Add first version of data/" path: data
$ git tag -a "v1.0" -m "data v1.0, 1000 images"
→ human readable, can be versioned with git!
○ DVC updates .gitignore to tell Git not to track the content of data/
○ The physical content of data/ —i.e. the jpg images— has been moved to a cache
(by default the cache is located in .dvc/cache/ but using a remote cache is possible!)
○ The files were linked back to the workspace so that it looks like nothing happened
(the user can configure the link type to use: hard link, soft link, reflink, copy)
Francesco Casalegno – DVC 10
Make changes to tracked data (add)
"tutorials/versioning/new-labels.zip"
● Let’s download some more data for our “cat VS dog” dataset.
Running dvc diff will confirm that dvc is aware the data has changed!
● To track the changes in our data with dvc, we follow the same procedure as before.
$ dvc checkout
M data/
$ dvc status
Data and pipelines are up to date.
13
Configure remote storage
● A remote storage is for dvc, what a GitHub is for git:
○ push and pull files from your workspace to the remote
○ easy sharing between developers
○ safe backup should you ever do a terrible mistake à la rm -rf *
● Many remote storages are supported (Google Drive, Amazon S3, Google Cloud, SSH, HDFS, HTTP, …)
But we (as for Git) nothing prevents us to use a “local remote”!
$ mkdir -p ~/tmp/dvc_storage
$ dvc remote add --default loc_remote ~/tmp/dvc_storage
Setting 'loc_remote' as a default remote.
[core]
remote = loc_remote
['remote "loc_remote"']
url = /root/tmp/dvc_storage
Francesco Casalegno – DVC 14
Storing, sharing, retrieving from storage
● Running basically dvc push uploads the content of the cache to the remote storage.
This is pretty much like git push.
$ dvc push
1800 files pushed
● Now, even if all the data is deleted from our workspace and cache, we can download it with dvc pull.
This is pretty much like git pull.
● When working outside of a DVC project —e.g. in automated ML model deployment— use dvg get
● When working inside of another DVC project, we want to keep the connection between the projects.
In this way, others can know where the data comes from and whether new versions are available.
● Note. For all these commands we can specify a git revision (sha, branch, or tag) with --rev <commit>.
Francesco Casalegno – DVC 16
Data Registries
● We can build a DVC project dedicated only to tracking and versioning datasets and models.
The repository would have all the metadata and history of changes in the different datasets.
● But versioning large data files for data science is great, is not all DVC can do:
DVC data pipelines capture how is data filtered, transformed, and used to train models!
train.pkl model.pkl
train.tsv
test.tsv test.pkl
data.xml
data_load.py
Load Data
data_split.py
Split Data
train.py
Train
eval.py Model
Evaluation
Source: Alex Kim, Optimizing Image Segmentation Projects with DVC, Iterative.ai
Use any executable script as a stage job
data_load.py
data_split.py
train.py
feature
data_load train evaluate
_extraction
Evaluation
raw_data.csv train_data.csv Model Report
test_data.csv
- artifacts
- pipelines
18
Tracking ML Pipelines
● Option A: run pipeline stages, then track output artifacts with dvc add
$ python src/prepare.py data/data.xml
$ dvc add data/prepared/train.tsv data/prepared/train.tsv
● Then we download some data + some code to prepare the data and train/evaluate a model
$ dvc get https://github.com/iterative/dataset-registry get-started/data.xml \
-o data/data.xml
$ dvc add data/data.xml
$ git add data/.gitignore data/data.xml.dvc & git commit -m "Add data, first version"
$ wget https://code.dvc.org/get-started/code.zip
$ unzip code.zip & rm -f code.zip
$ tree
YAML file with params for all the stages prepare:
.
split: 0.20
├── data
seed: 20170428
│ ├── data.xml
│ └── data.xml.dvc
featurize:
├── params.yaml
max_features: 500
└── src
ngrams: 1
├── evaluate.py
├── featurization.py
pipeline steps train:
├── prepare.py
seed: 20170428
├── requirements.txt
n_estimators: 50
└── train.py
Francesco Casalegno – DVC 21
Example Use "dvc stage add --run -v -f" instead!!
stages: prepare:
cmd: python src/prepare.py data/data.xml Matches the dvc.yaml file.
prepare:
cmd: python src/prepare.py data/data.xml deps:
- path: data/data.xml
Created and updated by DVC
deps: commands like dvc run.
- data/data.xml md5: a304afb96060aad90176268345e10355
- src/prepare.py - path: src/prepare.py
md5: 285af85d794bb57e5d09ace7209f3519 Describes latest pipeline state for:
params:
- prepare.seed params: 1. track intermediate and final
- prepare.split params.yaml: artifacts (like a .dvc file)
outs: prepare.seed: 20170428
prepare.split: 0.2 2. allow DVC to detect when stage
- data/prepared
outs: defs or dependencies changed,
Describe data pipelines, similar to how - params: data/prepared triggering re-run.
Makefiles work for building software. md5: 20b786b6e6f80e2b3fcf17827ad18597.dir
● Note: dependencies and artifacts are automatically tracked, no need to dvc add them!
Francesco Casalegno – DVC 22
Example
"dvc stage add --run -v -f " instead
● Then we run the featurize and train stages in the same way.
$ dvc run -n featurize \
-p featurize.max_features \
-p featurize.ngrams \
-d src/featurize.py \ featurization.py
-d data/prepared \
-o data/features \
python src/featurization.py data/prepared data/features
https://iterative.ai/blog/DVC-VS-Code-extension
Comparing experiments
● dvc params diff rev_1 rev_2 shows how parameters differ in two different git revisions/tags.
Without arguments, it shows how they differ in workspace vs. last commit.
● dvc metrics diff rev_1 rev_2 does the same for metrics.
● dvc plots diff rev_1 rev_2 does the same for plots.
https://iterative.ai/blog/DVC-VS-Code-extension
All experiments are versioned
28
Shared Development Server
● Disk space optimization
Avoid having 1 cache per user!
$ mkdir -p path_shared_cache/
$ mv .dvc/cache/* path_shared_cache/
30
Conclusions
● DVC is a version control system for large ML data and artifacts.
● DVC integrates with Git through *.dvc and dvc.lock files, to version files and pipelines, respectively.
● DVC repos can work as data registries, i.e. a middleware between cloud storage and ML projects
● To track raw ML data files, use dvc add—e.g. for input dataset.
● To track intermediate or final results of a ML pipeline, use dvc run—e.g. for model weights, dataset.
● Consider using a shared development server with a unified, shared external cache