DVC is a tool for data version control and reproducible machine learning workflows. It allows users to initialize a DVC environment, add files under DVC control, run commands to generate outputs, and reproduce or modify the pipeline by pulling and pushing data between a local cache and remote storage. Common commands include dvc init, dvc add, dvc run, dvc repro, dvc push, dvc pull and dvc status.
DVC is a tool for data version control and reproducible machine learning workflows. It allows users to initialize a DVC environment, add files under DVC control, run commands to generate outputs, and reproduce or modify the pipeline by pulling and pushing data between a local cache and remote storage. Common commands include dvc init, dvc add, dvc run, dvc repro, dvc push, dvc pull and dvc status.
Download files from the remote storage https://github.com/iterative/dvc
Cheat Sheet $ dvc pull https://dvc.org/chat Download files from a specific .dvc file
$ dvc pull filename.dvc
Basics Initializing Checkout files from cache into working space Other Commands $ dvc checkout Initialize a DVC environment Set/unset cache directory location
$ dvc init The Pipeline $ dvc cache dir /path
Add transformations and generate a Commit outputs to cache
Remote Set up a remote to keep and share data files stage file from a given command $ dvc commit $ dvc run -d dependencyfile \ $ dvc remote add -d myremote /path *Use if you specified --no-commit in dvc add/run/repro -o outputfile python command.py *Possible remotes include local, s3, gs, azure, ssh, hdfs Config repository or global options and http. *Use --file to specify the name of the generated .dvc file. *Use --metrics to output a file containing the metric. $ dvc config Show all available remotes *Config the default remote using core.remote myremote $ dvc remote list Metrics *Config core (loglevel, remote), cache and state settings Collect and display project metrics Modify remote settings Fetch files from the remote to the local cache $ dvc metrics show $ dvc remote modify myremote $ dvc fetch file.dvc *Use --all to show the metrics in all branches. *Use if remote requires extra configuration Remove unused objects from cache Visualizing Adding Files $ dvc gc Show stages in a pipeline Add files under DVC control $ dvc pipeline show --ascii file.dvc Import file from URL to local directory $ dvc add filename *Add --commands or --outs to show more detail. $ dvc import url /path *Use --no-commit to stop adding the file to the cache. Show connected pipelines of DVC stages *Supported schemes include local, s3, gs, azure, ssh, hdfs Share Data and http. $ dvc pipeline list Push all data files to the remote storage Remove data files tracked by dvc
$ dvc push Reproducing $ dvc remove filename.dvc
Reproduce outputs defined in .dvc file Push outputs of a specific .dvc file Show changed stages in the pipeline $ dvc repro filename.dvc $ dvc push filename.dvc $ dvc status *Name a .dvc file “Dvcfile” to be use by dvc repro by default
Made by Carl Handlin based on the documentation for DVC at https://dvc.org/doc
(Ebook) Python for DevOps: Learn Ruthlessly Effective Automation by Noah Gift; Kennedy Behrman; Alfredo Deza; Grig Gheorghiu ISBN 9781492057697, 149205769X - The latest ebook version is now available for instant access
Software Containers: The Complete Guide to Virtualization Technology. Create, Use and Deploy Scalable Software with Docker and Kubernetes. Includes Docker and Kubernetes.