Jump to content

Data Version Control (software)

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Conscious AI (talk | contribs) at 13:25, 6 October 2022 (Features. Data management and pipelines were added). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
DVC
Original author(s)Dmitry Petrov
Developer(s)Iterative.ai
Initial releaseMay 4, 2017; 5 years ago
Stable release
2.24.0 / September 19, 2022; 16 days ago
Repositorygithub.com/iterative/dvc
Available inPython
LicenseApache - 2.0
Websitehttps://dvc.org/

DVC is a free and open-source, platform-agnostic version system for data, ML models, and experiments.[1] It is designed to make ML models shareable, experiments reproducible, and to track versions of models, data, and pipelines.[2][3][4]

[5] DVC works on top of Git repositories and cloud storage.[6][7]

The first (beta) version of DVC (DVC 0.6) was launched in May 2017. In May 2020, DVC 1.0 was publicly released by Iterative.ai.[8] [9]

Overview

DVC is designed to incorporate the best practices of software development into Machine Learning workflows.[10] It does this by extending the traditional software tool Git by cloud storages for datasets and ML models.[11]

Specifically, DVC makes Machine Learning operations:   

  • Codified: it codifies datasets and models by storing pointers to the data files in cloud storages.[4]
  • Reproducible: it makes it easy for users to reproduce experiments, and rebuild datasets from raw data.[12][13] These features also allow to automate the construction of datasets, the training, evaluation, and deployment of ML models.[14]

DVC and Git

DVC stores large files and datasets in separate storage, outside of git.[4] This storage can be on the user’s computer or hosted on any major cloud storage provider,[15][5] such as AWS S3, Google Cloud Storage, and Microsoft Azure Blob Storage.[16][17][18] DVC users may also set up a remote repository on any server and connect to it remotely.[4]

When a user stores their data and models in the remote repository, text file is created in their Git repository which points to the actual data in remote storage.[2][19]

Features

DVC's features can be divided into three categories: data management, pipelines, and experiment tracking.[20][21][17]

Data Management

Data and model versioning is the base layer[22] of DVC for large files, datasets, and machine learning models. It allows the use of a standard Git workflow, but without the need to store those files in the repository. Large files, directories and ML models are replaced with small metafiles, which in turn point to the original data. Data is stored separately, allowing data scientists to transfer large datasets or share a model with others.[6]

DVC enables data versioning through codification.[23] When a user creates metafiles, describing what datasets, ML artifacts and other features to track, DVC makes it possible to capture versions of data and models, create and restore from snapshots, record evolving metrics, switch between versions, etc.[6]

Unique versions of data files and directories are cached[24] in a systematic way (also preventing file duplication). The working datastore is separated from the user’s workspace to keep the project light, but stays connected via file links handled automatically by DVC.[25]

Pipelines

DVC provides a mechanism to define and execute pipelines.[26][27] Pipelines represent the process of building ML datasets and models, from how data is preprocessed to how models are trained and evaluated.[28] Pipelines can also be used to deploy models into production environments.

DVC pipeline is focused on the experimentation phase of the ML process. Users can run multiple copies of a DVC pipeline by cloning a Git repository with the pipeline or running ML experiments. They can also record the workflow as a pipeline, and reproduce it in the future.

Pipelines are represented in code as YAML configuration files.[29] These files define the stages of the pipeline and how data and information flows from one step to the next.

When a pipeline is run, the artifacts produced by that pipeline are registered in a dvc.lock file.[30] The lockfile records the stages that were run, and stores a hash of the resulting output for each stage. Not only is it a record of the execution of the pipeline, but is also useful when deciding which steps must be rerun on subsequent executions of the pipeline.[28][20]

References

  1. ^ Hewage Nipuni, Meedeniya Dulani (February 2022). "Machine Learning Operations: A Survey on MLOps Tool Support". ResearchGate.
  2. ^ a b Barrak Amine, Eghan Ellis E., Adams Bram (March 2021). "On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects". IEEE Xplore.
  3. ^ Wiggers, Kyle. "MLOps startup Iterative.ai nabs $20M". VentureBeat.
  4. ^ a b c d Ivancic, Kristijan. "Data Version Control With Python and DVC". Real Python.
  5. ^ a b "MLOps Company Iterative Achieves Significant Customer and Company Growth in 2021". Business Wire.
  6. ^ a b c Hall, Susan. "Iterative.ai: Git-Based Machine Learning Tools for ML Engineers". The New Stack.
  7. ^ "What is DVC?". MLOps Guide.
  8. ^ Petrov, Dmitry. "DVC 3 Years and 1.0 Pre-release". Iterative.ai.
  9. ^ Anadiotis, George. "Streamlining data science with open source: Data version control and continuous machine learning". ZDNET.
  10. ^ Petrov, Dmitry. "The Road to AI Hell Starts with Good MLOps Intentions". The New Stack.
  11. ^ Lardinois, Frederic. "Iterative raises $20M for its MLOps platform". TechCrunch.
  12. ^ "AITech interview with Dmitry Petrov, Co-Founder & CEO at Iterative.ai". AI Tech Park.
  13. ^ "Data Versioning for CD4ML – Part 2". AI Singapore.
  14. ^ Baena, Daniel. "How to build an efficient Machine Learning project workflow using Data Version Control (DVC)". Rappi Tech.
  15. ^ "DVC Documentation. remote add". dvc.org/doc.
  16. ^ Vizard, Michael. "Iterative.ai updates MLOps platform to streamline and support cloud provisioning". VentureBeat.
  17. ^ a b Kulkarni, Amit. "Tracking ML Experiments With Data Version Control". Analytics Vidhya.
  18. ^ Vergara, Ryan. "How to Get Started with Data Version Control (DVC)". HackerNoon.
  19. ^ Tran, Khuyen. "Introduction to DVC: Data Version Control Tool for Machine Learning Projects". Towards Data Science.
  20. ^ a b "Introduction to Data Version Control(DVC)". Kaggle.
  21. ^ Guerrapin, Basile. "Using DVC to create an efficient version control system for data projects". The Qonto Way.
  22. ^ "DVC Documentation. Get Started". dvc.org/doc.
  23. ^ "DVC Documentation. Versioning Data and Models". dvc.org/doc.
  24. ^ "DVC Documentation. Internal Directories and Files". dvc.org/doc.
  25. ^ "DVC Documentation. Large Dataset Optimization". dvc.org/doc.
  26. ^ "Working with Pipelines". MLOps Guide.
  27. ^ "DVC Documentation. Get Started: Data Pipelines". dvc.org/doc.
  28. ^ a b Idowu Samuel, Strüber Daniel, Berger Thorsten. "Asset Management in Machine Learning: A Survey". Astrophysics Data System (ADS).
  29. ^ "DVC Documentation. dvc.yaml". dvc.org/doc.
  30. ^ "DVC Documentation. dvc.lock file". dvc.org/doc.