0% found this document useful (0 votes)
94 views5 pages

DVC Cheatsheet

DVC (Data Version Control) is a system designed for managing machine learning projects, enabling users to track, share, and reproduce experiments. Key functionalities include initializing a repository, tracking experiments, defining pipelines, logging metrics, and managing remote storage. The cheat sheet provides commands for various operations, including adding files, pushing data, and visualizing pipelines.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views5 pages

DVC Cheatsheet

DVC (Data Version Control) is a system designed for managing machine learning projects, enabling users to track, share, and reproduce experiments. Key functionalities include initializing a repository, tracking experiments, defining pipelines, logging metrics, and managing remote storage. The cheat sheet provides commands for various operations, including adding files, pushing data, and visualizing pipelines.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

DVC (Data Version Control) Cheat Sheet

DVC is a version control system for machine learning projects, allowing you to track, share, and

reproduce your experiments.

---

1. Getting Started

Initialize DVC in a Repository:

dvc init

Add Files or Directories to DVC:

dvc add <file_or_directory>

Commit Changes:

1. Use Git to commit the .dvc file:

git add <file>.dvc .gitignore

git commit -m "Track data with DVC"

Configure Remote Storage:

dvc remote add -d myremote <remote_storage_url>

Push Data to Remote Storage:

dvc push
Pull Data from Remote Storage:

dvc pull

---

2. Tracking Experiments

Run an Experiment:

dvc repro

Track Parameters:

Specify parameters in a params.yaml file and link them to stages in the pipeline.

Example params.yaml:

learning_rate: 0.01

batch_size: 32

---

3. Pipelines

Define a Pipeline Stage:

dvc stage add -n <stage_name> -d <dependency> -o <output> <command>

Example:

dvc stage add -n train -d train.py -d data.csv -o model.pkl python train.py

Visualize the Pipeline:

dvc dag
Run the Entire Pipeline:

dvc repro

---

4. Metrics and Plots

Log Metrics:

Use a metrics.json or similar file to store metrics:

"accuracy": 0.95,

"loss": 0.05

Track the metrics file:

dvc metrics add metrics.json

Visualize Plots:

Use DVC to generate plots from tracked data files:

dvc plots show <file>

---

5. Versioning Data

Check File Status:


dvc status

Remove Data but Keep Track:

dvc remove <file>.dvc

Checkout Specific Versions:

git checkout <commit_hash>

dvc checkout

---

6. Sharing Projects

Push Project to Git and DVC Remote:

git push

dvc push

Clone a Repository and Retrieve Data:

git clone <repo_url>

dvc pull

---

7. Useful Commands

Show Pipeline Stages:

dvc stage list


Remove Cache:

dvc gc

Show Differences in Metrics:

dvc metrics diff

---

8. Remote Storage Options

DVC supports various remote storage backends:

- AWS S3: s3://bucket-name/path

- Google Drive: gdrive://<folder-id>

- Azure Blob Storage: azure://container-name/path

- SSH: ssh://user@server:/path

- Local Directory: /path/to/storage

Configure remotes using:

dvc remote add -d <name> <url>

---

9. Useful Links

- Official Documentation: https://dvc.org/doc

- DVC GitHub: https://github.com/iterative/dvc

You might also like