Data Version Control (software)
DVC | |
---|---|
Original author(s) | Dmitry Petrov |
Developer(s) | Iterative.ai |
Initial release | May 4, 2017; 5 years ago |
Stable release | 2.24.0
/ September 19, 2022; 16 days ago |
Repository | github.com/iterative/dvc |
Available in | Python |
License | Apache - 2.0 |
Website | https://dvc.org/ |
DVC is a free and open-source, platform-agnostic version system for data, ML models, and experiments.[1] It is designed to make ML models shareable, experiments reproducible, and to track versions of models, data, and pipelines.[2][3][4]
[5] DVC works on top of Git repositories and cloud storage.[6][7]
The first (beta) version of DVC (DVC 0.6) was launched in May 2017. In May 2020, DVC 1.0 was publicly released by Iterative.ai.[8] [9]
Overview
DVC is designed to incorporate the best practices of software development into Machine Learning workflows.[10] It does this by extending the traditional software tool Git by cloud storages for datasets and ML models.[11]
Specifically, DVC makes Machine Learning operations:
- Codified: it codifies datasets and models by storing pointers to the data files in cloud storages.[4]
- Reproducible: it makes it easy for users to reproduce experiments, and rebuild datasets from raw data.[12][13] These features also allow to automate the construction of datasets, the training, evaluation, and deployment of ML models.[14]
DVC and Git
DVC stores large files and datasets in separate storage, outside of git.[4] This storage can be on the user’s computer or hosted on any major cloud storage provider,[15][5] such as AWS S3, Google Cloud Storage, and Microsoft Azure Blob Storage.[16][17][18] DVC users may also set up a remote repository on any server and connect to it remotely.[4]
When a user stores their data and models in the remote repository, text file is created in their Git repository which points to the actual data in remote storage.[2][19]
Features
References
- ^ Hewage Nipuni, Meedeniya Dulani (February 2022). "Machine Learning Operations: A Survey on MLOps Tool Support". ResearchGate.
- ^ a b Barrak Amine, Eghan Ellis E., Adams Bram (March 2021). "On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects". IEEE Xplore.
- ^ Wiggers, Kyle. "MLOps startup Iterative.ai nabs $20M". VentureBeat.
- ^ a b c d Ivancic, Kristijan. "Data Version Control With Python and DVC". Real Python.
- ^ a b "MLOps Company Iterative Achieves Significant Customer and Company Growth in 2021". Business Wire.
- ^ Hall, Susan. "Iterative.ai: Git-Based Machine Learning Tools for ML Engineers". The New Stack.
- ^ "What is DVC?". MLOps Guide.
- ^ Petrov, Dmitry. "DVC 3 Years and 1.0 Pre-release". Iterative.ai.
- ^ Anadiotis, George. "Streamlining data science with open source: Data version control and continuous machine learning". ZDNET.
- ^ Petrov, Dmitry. "The Road to AI Hell Starts with Good MLOps Intentions". The New Stack.
- ^ Lardinois, Frederic. "Iterative raises $20M for its MLOps platform". TechCrunch.
- ^ "AITech interview with Dmitry Petrov, Co-Founder & CEO at Iterative.ai". AI Tech Park.
- ^ "Data Versioning for CD4ML – Part 2". AI Singapore.
- ^ Baena, Daniel. "How to build an efficient Machine Learning project workflow using Data Version Control (DVC)". Rappi Tech.
- ^ "DVC Documentation. remote add". dvc.org/doc.
- ^ Vizard, Michael. "Iterative.ai updates MLOps platform to streamline and support cloud provisioning". VentureBeat.
- ^ Kulkarni, Amit. "Tracking ML Experiments With Data Version Control". Analytics Vidhya.
- ^ Vergara, Ryan. "How to Get Started with Data Version Control (DVC)". HackerNoon.
- ^ Tran, Khuyen. "Introduction to DVC: Data Version Control Tool for Machine Learning Projects". Towards Data Science.