Jump to content

Data Version Control (software)

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Conscious AI (talk | contribs) at 11:21, 6 October 2022 (DVC and Git were added). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
DVC
Original author(s)Dmitry Petrov
Developer(s)Iterative.ai
Initial releaseMay 4, 2017; 5 years ago
Stable release
2.24.0 / September 19, 2022; 16 days ago
Repositorygithub.com/iterative/dvc
Available inPython
LicenseApache - 2.0
Websitehttps://dvc.org/

DVC is a free and open-source, platform-agnostic version system for data, ML models, and experiments.[1] It is designed to make ML models shareable, experiments reproducible, and to track versions of models, data, and pipelines.[2][3][4]

[5] DVC works on top of Git repositories and cloud storage.[6][7]

The first (beta) version of DVC (DVC 0.6) was launched in May 2017. In May 2020, DVC 1.0 was publicly released by Iterative.ai.[8] [9]

Overview

DVC is designed to incorporate the best practices of software development into Machine Learning workflows.[10] It does this by extending the traditional software tool Git by cloud storages for datasets and ML models.[11]

Specifically, DVC makes Machine Learning operations:   

  • Codified: it codifies datasets and models by storing pointers to the data files in cloud storages.[4]
  • Reproducible: it makes it easy for users to reproduce experiments, and rebuild datasets from raw data.[12][13] These features also allow to automate the construction of datasets, the training, evaluation, and deployment of ML models.[14]

DVC and Git

DVC stores large files and datasets in separate storage, outside of git.[4] This storage can be on the user’s computer or hosted on any major cloud storage provider,[15][5] such as AWS S3, Google Cloud Storage, and Microsoft Azure Blob Storage.[16][17][18] DVC users may also set up a remote repository on any server and connect to it remotely.[4]

When a user stores their data and models in the remote repository, text file is created in their Git repository which points to the actual data in remote storage.[2][19]

Features

References

  1. ^ Hewage Nipuni, Meedeniya Dulani (February 2022). "Machine Learning Operations: A Survey on MLOps Tool Support". ResearchGate.
  2. ^ a b Barrak Amine, Eghan Ellis E., Adams Bram (March 2021). "On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects". IEEE Xplore.
  3. ^ Wiggers, Kyle. "MLOps startup Iterative.ai nabs $20M". VentureBeat.
  4. ^ a b c d Ivancic, Kristijan. "Data Version Control With Python and DVC". Real Python.
  5. ^ a b "MLOps Company Iterative Achieves Significant Customer and Company Growth in 2021". Business Wire.
  6. ^ Hall, Susan. "Iterative.ai: Git-Based Machine Learning Tools for ML Engineers". The New Stack.
  7. ^ "What is DVC?". MLOps Guide.
  8. ^ Petrov, Dmitry. "DVC 3 Years and 1.0 Pre-release". Iterative.ai.
  9. ^ Anadiotis, George. "Streamlining data science with open source: Data version control and continuous machine learning". ZDNET.
  10. ^ Petrov, Dmitry. "The Road to AI Hell Starts with Good MLOps Intentions". The New Stack.
  11. ^ Lardinois, Frederic. "Iterative raises $20M for its MLOps platform". TechCrunch.
  12. ^ "AITech interview with Dmitry Petrov, Co-Founder & CEO at Iterative.ai". AI Tech Park.
  13. ^ "Data Versioning for CD4ML – Part 2". AI Singapore.
  14. ^ Baena, Daniel. "How to build an efficient Machine Learning project workflow using Data Version Control (DVC)". Rappi Tech.
  15. ^ "DVC Documentation. remote add". dvc.org/doc.
  16. ^ Vizard, Michael. "Iterative.ai updates MLOps platform to streamline and support cloud provisioning". VentureBeat.
  17. ^ Kulkarni, Amit. "Tracking ML Experiments With Data Version Control". Analytics Vidhya.
  18. ^ Vergara, Ryan. "How to Get Started with Data Version Control (DVC)". HackerNoon.
  19. ^ Tran, Khuyen. "Introduction to DVC: Data Version Control Tool for Machine Learning Projects". Towards Data Science.