{"id":269,"date":"2026-04-03T13:22:57","date_gmt":"2026-04-03T13:22:57","guid":{"rendered":"https:\/\/blog.ngocha.biz\/?p=269"},"modified":"2026-04-03T13:22:57","modified_gmt":"2026-04-03T13:22:57","slug":"dvc-tutorial-for-beginners","status":"publish","type":"post","link":"https:\/\/blog.ngocha.biz\/?p=269","title":{"rendered":"Data Version Control (DVC) Tutorial For Beginners"},"content":{"rendered":"<p>One of the key concept in <a href=\"https:\/\/devopscube.com\/devops-to-mlops\/\" rel=\"noreferrer\">MLOps<\/a> is Data versioning and DVC is one of the key open source tool you should know to manage data versioning. <\/p>\n<p>We created this guide for beginners to understand and learn about DVC practically. Here is what you can learn from this guide.<\/p>\n<ul>\n<li>What is DVC, and how it works.<\/li>\n<li>How to configure remote storage for DVC.<\/li>\n<li>How to push, pull, and switch between different data versions.<\/li>\n<li>ML pipeline as code using <strong><code>dvc.yaml<\/code> <\/strong><\/li>\n<li>Understand how DVC integrates with CI\/CD and Airflow<\/li>\n<li>Garbage collection for DVC and more..<\/li>\n<\/ul>\n<p>Lets get started.<\/p>\n<h2 id=\"what-is-dvc\">What is DVC?<\/h2>\n<p><a href=\"https:\/\/dvc.org\/?ref=devopscube.com\" rel=\"noreferrer\">DVC (Data Version Control)<\/a> is an open source tool that can be used with version control tools like Git for versioning data. You can call it &#8220;<em>Git for Data<\/em>&#8220;.<\/p>\n<p><em>Why cant we use git for this?<\/em> Well, a 2GB training dataset or a 500MB model can&#8217;t live in a Git repository. This is where&nbsp;DVC&nbsp;comes in. <\/p>\n<p>DVC provides <strong>Git-like version control for data<\/strong>, models, and large files without storing the actual files in Git. It <strong>stores lightweight pointer files<\/strong> (<code>.dvc<\/code> files) in Git and the actual data resides in a remote storage (Eg., Amazon S3).<\/p>\n<p>Simply put, it is the <strong>bridge between your Git repo and storage<\/strong> where data resides. Meaning, Git tracks your code and&nbsp;<code>.dvc<\/code>&nbsp;pointer files. Your actual data resides in a remote storage like AWS s3. DVC manages the sync between the two.<\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">An important thing to understand is,&nbsp;<b><strong style=\"white-space: pre-wrap;\">DVC is just a lightweight CLI tool, <\/strong><\/b>not a service. It runs wherever you manage your Git repo. For example, your developer workstation or workflow systems like Airflow executor or inside a CI\/CD runner (GitHub Actions, Jenkins, etc.)<\/div>\n<\/div>\n<p>The following image illustrates how DVC fits in with local workstation, Github and remote storage. <\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/image-84.png\" class=\"kg-image\" alt=\"\" loading=\"lazy\" width=\"1422\" height=\"1406\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/03\/image-84.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2026\/03\/image-84.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/image-84.png 1422w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure>\n<h2 id=\"setup-prerequisites\">Setup Prerequisites<\/h2>\n<p>Before starting, make sure you have the following setup in your workstation.<\/p>\n<ul>\n<li>Python 3+<\/li>\n<li>GitHub repository <\/li>\n<li><a href=\"https:\/\/devopscube.com\/install-configure-aws-cli-linux\/\" rel=\"noreferrer\"><strong>AWS CLI<\/strong><\/a>&nbsp;configured with credentials<\/li>\n<li>AWS S3 bucket (e.g.,&nbsp;<code>ml-dvc-store<\/code> )<\/li>\n<li>IAM permissions: &nbsp;<code>s3:GetObject<\/code>,&nbsp;<code>s3:PutObject<\/code>,&nbsp;<code>s3:ListBucket<\/code>,&nbsp;<code>s3:DeleteObject<\/code><\/li>\n<\/ul>\n<div class=\"kg-card kg-callout-card kg-callout-card-yellow\">\n<div class=\"kg-callout-emoji\">\u26a0\ufe0f<\/div>\n<div class=\"kg-callout-text\"><b><strong style=\"white-space: pre-wrap;\">Security note:<\/strong><\/b> In projects, always create a dedicated IAM user or role for DVC access. Do not use your root account credentials. If you are on EC2 or EKS, use an IAM role instead of static credentials.<\/div>\n<\/div>\n<h2 id=\"set-up-dvc-with-a-repository-remote-storage\">Set Up DVC With a Repository &amp; Remote Storage<\/h2>\n<p>To understand how DVC works, we will set up DVC with AWS S3 as remote object storage and version a dataset using DVC and push it to S3.<\/p>\n<p>Here is what we are going to do.<\/p>\n<ul>\n<li>Initialize DVC in the Git repository<\/li>\n<li>Configure S3 as the remote storage<\/li>\n<li>Tell DVC to track the dataset (<code>dvc add<\/code>)<\/li>\n<li>Push the dataset to S3 (<code>dvc push<\/code>)<\/li>\n<li>Commit the <code>.dvc<\/code> pointer file to Git<\/li>\n<\/ul>\n<p>The following diagram illustrates what we are going to do.<\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/dvc03-1.png\" class=\"kg-image\" alt=\"DVC With Remote s3 Storage workflow\" loading=\"lazy\" width=\"2000\" height=\"1248\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/03\/dvc03-1.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2026\/03\/dvc03-1.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1600\/2026\/03\/dvc03-1.png 1600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w2400\/2026\/03\/dvc03-1.png 2400w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure>\n<p>Lets get started.<\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">Before moving forward, make sure your local space has read and write permission to your GitHub repository and AWS S3 bucket.<\/div>\n<\/div>\n<h3 id=\"step-1-clone-the-repository\">Step 1: Clone The Repository<\/h3>\n<p>You can configure DVC to any Git repository, all you need is the permission to get and push changes to it.<\/p>\n<p>For this guide, clone the <a href=\"https:\/\/github.com\/techiescamp\/mlops-for-devops?ref=devopscube.com\" rel=\"noreferrer\">MLOps For DevOps repository<\/a>. The dataset we will version is located at.<\/p>\n<pre><code class=\"language-bash\">phase-1-local-dev\/datasets\/employee_attrition.csv<\/code><\/pre>\n<p><strong>Fork the repository<\/strong> so that you can push DVC configs back to your own repository.<\/p>\n<pre><code class=\"language-bash\">https:\/\/github.com\/techiescamp\/mlops-for-devops.git<\/code><\/pre>\n<p>Let&#8217;s move on to the DVC setup.<\/p>\n<h3 id=\"step-2-install-dvc\">Step 2: Install DVC<\/h3>\n<p>Run the following commands to create a virtual environment named <code>dvc-env<\/code>.and activate it.<\/p>\n<pre><code class=\"language-bash\">python3 -m venv dvc-env\n\nsource dvc-env\/bin\/activate<\/code><\/pre>\n<p>Install DVC and DVC S3 Plugin.<\/p>\n<pre><code class=\"language-bash\">pip install dvc dvc-s3<\/code><\/pre>\n<ul>\n<li><strong>dvc: <\/strong>Install dvc tool<\/li>\n<li><strong>dvc-s3: <\/strong>Plugin of DVC used to store data in AWS S3 by dvc<\/li>\n<\/ul>\n<p>Run the following command to verify if dvc is installed.<\/p>\n<pre><code class=\"language-bash\">dvc --version<\/code><\/pre>\n<h3 id=\"step-3-initialize-dvc\">Step 3:  Initialize DVC<\/h3>\n<p>Run the following command inside the root of the mlops-for-devops repo folder.<\/p>\n<pre><code class=\"language-bash\">dvc init<\/code><\/pre>\n<p>This will create a <strong><code>.dvc<\/code><\/strong> folder with the configuration and <strong><code>gitignore<\/code><\/strong> file.<\/p>\n<p>You can see the following folder structure.<\/p>\n<pre><code class=\"language-bash\">.dvc\/\n \u251c\u2500\u2500 config\n \u2514\u2500\u2500 .gitignore<\/code><\/pre>\n<ul>\n<li><strong><code>.dvc\/config<\/code><\/strong> is the main DVC configuration file.<\/li>\n<li><strong><code>.dvc\/.gitignore<\/code><\/strong>  ignores DVC cache from Git.<\/li>\n<\/ul>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">You can also initialize DVC in a subdirectory using <code spellcheck=\"false\" style=\"white-space: pre-wrap;\">--subdir<\/code> flag, but it is not recommended for standard MLOps workflows. <\/div>\n<\/div>\n<h3 id=\"step-4-set-up-remote-storage-for-dvc\">Step 4: Set Up Remote Storage for DVC<\/h3>\n<p>Now, we need to configure an S3 bucket to DVC as remote storage.<\/p>\n<p>Run the following commands to add the bucket.<\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udccc<\/div>\n<div class=\"kg-callout-text\">Update your bucket name in the below command before running it.<\/div>\n<\/div>\n<pre><code class=\"language-bash\">dvc remote add -d ml-dataset s3:\/\/dcube-attrition-data\/datasets\n<\/code><\/pre>\n<p>This command tells DVC to store all versioned data in the given S3 location.<\/p>\n<ul>\n<li><strong>-d<\/strong>: Set as default remote store<\/li>\n<li><strong>ml-dataset<\/strong>: Alias for this remote store, you can use any name<\/li>\n<\/ul>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">Make sure bucket region is configured with AWS CLI, because DVC gets the bucket region from your workstations AWS config.<\/div>\n<\/div>\n<p>Now, verify if the remote was added.<\/p>\n<pre><code class=\"language-bash\">dvc remote list<\/code><\/pre>\n<p>You get a similar output as given below.<\/p>\n<pre><code class=\"language-bash\">ml-dataset      s3:\/\/dcube-attrition-data\/datasets  (default)<\/code><\/pre>\n<p>And, if you open the <strong><code>.dvc\/config<\/code><\/strong> file, you can see configurations similar to this.<\/p>\n<pre><code class=\"language-bash\">[core]\n    remote = ml-dataset\n['remote \"ml-dataset\"']\n    url = s3:\/\/dcube-attrition-data\/datasets\n<\/code><\/pre>\n<p>In this:<\/p>\n<ul>\n<li><strong>remote<\/strong>: The default remote storage alias.<\/li>\n<li><strong>url<\/strong>: URL of S3 bucket where DVC will store its data<\/li>\n<\/ul>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">If <code spellcheck=\"false\" style=\"white-space: pre-wrap;\">dvc-env<\/code> folder is not added to the .gitignore file, add it<\/div>\n<\/div>\n<p>Finally, push the DVC configurations to GitHub.<\/p>\n<pre><code class=\"language-bash\">git add .dvc\/config\ngit commit -m \"Initialize DVC with S3 remote\"\ngit push origin main<\/code><\/pre>\n<p>Now, your Git repository or a specific path of the repository is configured with DVC and its remote storage.<\/p>\n<h3 id=\"step-5-stop-tracking-dataset-in-git\">Step 5: Stop Tracking Dataset in Git<\/h3>\n<p><strong>Git and DVC cannot both track the same file<\/strong>. Git tracks the <strong><code>.dvc<\/code><\/strong> pointer file. DVC tracks the actual data.<\/p>\n<p>Use the following command to remove the dataset from Git&#8217;s tracking.<\/p>\n<pre><code class=\"language-bash\">git rm -r --cached phase-1-local-dev\/datasets\/employee_attrition.csv\n<\/code><\/pre>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-text\"><b><strong style=\"white-space: pre-wrap;\">Note: <\/strong><\/b>Here we are not deleting the dataset. We are only telling Git to stop tracking it.<\/div>\n<\/div>\n<h3 id=\"step-6-version-the-dataset\">Step 6: Version the Dataset<\/h3>\n<p>Let&#8217;s version the dataset and push it to remote storage.<\/p>\n<p>Our actual dataset that needs to be versioned is present at,<\/p>\n<pre><code class=\"language-bash\">phase-1-local-dev\/datasets\/employee_attrition.csv<\/code><\/pre>\n<p>We need to tell DVC to track the dataset using the following command.<\/p>\n<pre><code class=\"language-bash\">dvc add phase-1-local-dev\/datasets\/employee_attrition.csv\n<\/code><\/pre>\n<p>This will:<\/p>\n<ul>\n<li>Copy the <code>employee_attrition.csv<\/code> file to DVCs local cache (<code>.dvc\/cache\/<\/code>) and hashes it.<\/li>\n<li>Create a new <code>employee_attrition.csv.dvc<\/code> file to store its information<\/li>\n<li>And, add the <code>employee_attrition.csv<\/code> file name to the <strong><code>.gitignore<\/code><\/strong> file to avoid pushing the raw file to the git repository.<\/li>\n<\/ul>\n<p>You will see the following output.<\/p>\n<p>And, your directory structure will look like the following.<\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/image-92.png\" class=\"kg-image\" alt=\"dvc directory stucture\" loading=\"lazy\" width=\"1341\" height=\"714\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/03\/image-92.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2026\/03\/image-92.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/image-92.png 1341w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure>\n<p>The <code>employee_attrition.csv.dvc<\/code> pointer file looks like this.<\/p>\n<pre><code class=\"language-bash\">outs:\n- md5: 5911ebf0fa91033fb323989b7c6d7fbc\n  size: 9983309\n  hash: md5\n  path: employee_attrition.csv\n<\/code><\/pre>\n<p>Git tracks only this pointer file metadata, not the actual data file. The<strong> MD5 hash acts as the version identifier<\/strong> that DVC uses to fetch the exact file from S3.<\/p>\n<p>Now that we have enabled the tracking, next step is to push the data to the remote storage.<\/p>\n<h3 id=\"step-7-push-the-dataset-to-s3\">Step 7: Push the Dataset to S3<\/h3>\n<p>When you run <code>dvc push<\/code> , DVC uploads only the files that are tracked (via <code>.dvc<\/code> files) to the configured remote storage (S3 in our case). Since we added <code>employee_attrition.csv<\/code> to DVC, only that file will be pushed.<\/p>\n<p>Now, run the following command to push the dataset into the configured S3 bucket.<\/p>\n<pre><code class=\"language-bash\">dvc push\n<\/code><\/pre>\n<p>You will get the following output.<\/p>\n<pre><code class=\"language-bash\">Collecting                                                                             |1.00 [00:00,  214entry\/s]\nPushing\n1 file pushed <\/code><\/pre>\n<p>In S3, DVC stores files using a <strong>content-addressed layout. <\/strong>Meaning, each file is organized based on its MD5 hash as given below.<\/p>\n<pre><code class=\"language-bash\">s3:\/\/your-bucket\/\n  \u2514\u2500\u2500 files\/\n      \u2514\u2500\u2500 md5\/\n          \u2514\u2500\u2500 8f\/\n              \u2514\u2500\u2500 28b4894c8d5aac17cc23e68127a768<\/code><\/pre>\n<p>This is the same structure exists in your local cache at <code>.dvc\/cache\/<\/code>.<\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\"><b><strong style=\"white-space: pre-wrap;\">Why hash-based storage?<\/strong><\/b> The primary reason is deduplication. Meaning, if the file content is exactly the same, it is stored only once (Same content = same hash). Also hash serves as a version ID to track versions.<\/div>\n<\/div>\n<h3 id=\"step-8-commit-the-changes-to-git\">Step 8: Commit the changes to Git<\/h3>\n<p>This is the <strong>most important step<\/strong> in the DVC workflow. After pushing the data to S3, you mush commit the changes to Git and push it. Without the commit, the dataset version is not recorded. It means,<\/p>\n<ul>\n<li>Git will not track which version of data was used<\/li>\n<li>DVC cannot map the data version to your code<\/li>\n<\/ul>\n<p>Lets commit the changes.<\/p>\n<pre><code class=\"language-bash\">git add .\ngit commit -m \"Added Dataset version 1\"\ngit push origin main<\/code><\/pre>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\"><b><strong style=\"white-space: pre-wrap;\">Commit Message Is Important Here:<\/strong><\/b> You should use descriptive messages like <code spellcheck=\"false\" style=\"white-space: pre-wrap;\">Dataset: version 2 \u2014 added Q3 records<\/code> so you can identify versions at a glance in <code spellcheck=\"false\" style=\"white-space: pre-wrap;\">git log<\/code>. <\/div>\n<\/div>\n<p>Now, anyone can run <code>git checkout<\/code> + <code>dvc pull<\/code> and<strong> reproduce the exact setup<\/strong> in the future. Meaning, recreating the same project state (code, data, and configuration) so that you can rerun the experiment if needed.<\/p>\n<h2 id=\"pull-a-specific-dataset-version\">Pull a Specific Dataset Version<\/h2>\n<p>This is the <strong>core use case<\/strong> of why we use DVC. A data scientist, a CI\/CD runner, or an Airflow worker can <strong>pull any historical version of the dataset<\/strong><\/p>\n<p>To pull the latest version of the data tracked by DVC, simply run,<\/p>\n<pre><code class=\"language-bash\">dvc pull<\/code><\/pre>\n<p>Now, <em>what if you want to pull a specific version?<\/em> <\/p>\n<p>As discussed earlier, DVC versioning is tied to Git commits. The <code>.dvc<\/code> file stores the metadata for each version. So to get a specific dataset version, you need to,<\/p>\n<ul>\n<li>Check out the corresponding Git commit (or <code>.dvc<\/code> file)<\/li>\n<li>Then run <code>dvc pull<\/code><\/li>\n<\/ul>\n<p>For example, run git log to get the commit history in a short format.<\/p>\n<pre><code class=\"language-bash\">git log --oneline\n<\/code><\/pre>\n<p>This will list the commit ID with the appropriate commit message as shown below.<\/p>\n<pre><code class=\"language-bash\">adaaee9  Added Dataset version 2\n02d88b7  Added Dataset version 1<\/code><\/pre>\n<p>If you want <strong>version 1<\/strong>, use its commit ID and perform a checkout of dvc as given below. We are just checking out the .dvc file, not the entire branch.<\/p>\n<pre><code class=\"language-bash\">git checkout 02d88b7 -- employee_attrition.csv.dvc\n<\/code><\/pre>\n<p>This will restore the .dvc file, which has the version 1 details. Then run the pull command to pull the data from S3. It downloads the exact dataset version (version 1) from S3.<\/p>\n<pre><code class=\"language-bash\">dvc pull<\/code><\/pre>\n<p>Now, you check the dataset, and you can see version 1. <\/p>\n<p>And, if you want to switch back to the latest version, run the following command.<\/p>\n<pre><code class=\"language-bash\">git checkout HEAD -- employee_attrition.csv.dvc\ndvc pull<\/code><\/pre>\n<h2 id=\"creating-reproducible-dvc-pipelines-with-dvcyaml\">Creating Reproducible DVC Pipelines With <code>dvc.yaml<\/code><\/h2>\n<p>In the above steps, we used DVC commands to just version a dataset, but in an actual project, there will be more steps like cleaning, processing, etc., involved before versioning the dataset.<\/p>\n<p>Each step gets data from the previous step and gives an output file for the next step, and manually doing these steps every time is repeated work and wastes time.<\/p>\n<p>This is where <code>dvc.yaml<\/code>, helps you. Think of it as a <strong>Makefile for your ML pipeline. <\/strong>You define each processing step once, and DVC handles execution, dependency tracking, and output versioning automatically. You can call it your <strong>ML pipeline as code.<\/strong><\/p>\n<p>Given below is an example <code>dvc.yaml<\/code> file based on the first two steps of our data preparation:<\/p>\n<pre><code class=\"language-yaml\">stages:\n  ingest:\n    cmd: python -m data_preparation.01_ingestion\n    wdir: src\n    deps:\n      - ..\/datasets\/employee_attrition.csv\n      - data_preparation\/01_ingestion.py\n    outs:\n      - ..\/datasets\/processed\/raw_ingested.csv\n\n  validate:\n    cmd: python -m data_preparation.02_validation\n    wdir: src\n    deps:\n      - ..\/datasets\/processed\/raw_ingested.csv\n      - data_preparation\/02_validation.py\n    outs:\n      - ..\/datasets\/processed\/validated.csv<\/code><\/pre>\n<p>In this file, you can see the following fields:<\/p>\n<ul>\n<li><strong>stages<\/strong> &#8211; Every step of a process is defined inside this field.<\/li>\n<li><strong>cmd<\/strong> &#8211; Specify the command you want to run in the step.<\/li>\n<li><strong>wdir<\/strong> &#8211; To specify the work directory from where you want to run the command.<\/li>\n<li><strong>deps<\/strong> &#8211; Files required for the step.<\/li>\n<li><strong>outs<\/strong> &#8211; Save location of the output files each stage gives.<\/li>\n<\/ul>\n<p>To run the <strong><code>dvc.yaml<\/code><\/strong>, use the following command.<\/p>\n<pre><code class=\"language-bash\">dvc repro<\/code><\/pre>\n<p>This will run the steps specified inside the <strong><code>dvc.yaml<\/code><\/strong> file one by one.<\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">The steps will run only if the specified deps files are changed, if not changed, that specific step will be skipped.<\/div>\n<\/div>\n<p>After the steps inside the <code>dvc.yaml<\/code> files are run, DVC will create a <code>dvc.lock<\/code> file that contains every detail of the output file created by each step to track it. Then, push the created dataset to S3 using <code>dvc push<\/code> and commit the changes to Git.<\/p>\n<h2 id=\"quick-reference-common-dvc-commands\">Quick Reference: Common DVC Commands<\/h2>\n<p>Here is a quick reference of DVC commands.<\/p>\n<pre><code class=\"language-bash\"># Setup\ndvc init                        # initialize DVC in a git repo\ndvc remote add -d remote s3:\/\/  # set S3 as default remote\n\n# Tracking files\ndvc add data\/dataset.csv        # start tracking a file\ndvc add data\/                   # track an entire directory\n\n# Sync with remote\ndvc push                        # upload tracked files to S3\ndvc pull                        # download files from S3\ndvc fetch                       # download to cache without checkout\n\n# Pipelines\ndvc repro                       # run\/update the pipeline\ndvc dag                         # visualize the pipeline graph\ndvc status                      # check what's out of date\n\n# Experiment tracking\ndvc params diff                 # compare params across commits\ndvc metrics show                # display pipeline metrics\n<\/code><\/pre>\n<h2 id=\"how-dvc-works-in-a-real-mlops-pipeline\">How DVC Works in a Real MLOps Pipeline<\/h2>\n<p>What we have seen till now is<strong> mostly the developer side of things<\/strong>. We manually did dvc add, push, and commit to Git for versioning the data.  <\/p>\n<p>One common question that comes up when working with DVC is,<strong><em> Where Does DVC Actually Run in real projects? What does the workflow look like?<\/em><\/strong><\/p>\n<p>Well, when it comes to actual prodcution workflows, <strong>data is usually managed by CI\/CD pipelines<\/strong> or workflow orchestrators like Apache Airflow.<\/p>\n<h3 id=\"dvc-in-a-cicd-pipeline\">DVC in a CI\/CD Pipeline<\/h3>\n<p>The diagram given below shows how DVC works with a typical CI\/CD workflow.<\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/dvc02.png\" class=\"kg-image\" alt=\"DVC working in a CI\/CD pipeline\" loading=\"lazy\" width=\"2000\" height=\"1215\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/03\/dvc02.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2026\/03\/dvc02.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1600\/2026\/03\/dvc02.png 1600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w2400\/2026\/03\/dvc02.png 2400w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure>\n<p>A typical automated workflow with DVC looks like this.<\/p>\n<ul>\n<li>A pipeline gets triggered ( For example, a scheduled job, PR merge or new raw data upload)<\/li>\n<li>The workflow starts and the runners clones the repository to their work directory<\/li>\n<li>It pulls raw data from  the configured source storage<\/li>\n<li>Run the steps for data preparation<\/li>\n<li>Then <strong>run dvc add and dvc push<\/strong> to version, and push the new data to the remote  DVC storage.<\/li>\n<li>The <code>.dvc<\/code> pointer files are committed back to Git.<\/li>\n<\/ul>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">For this to work, your CI\/CD runners need read\/write access to both GitHub and the S3 bucket.<\/div>\n<\/div>\n<h3 id=\"dvc-in-an-airflow-dag\">DVC in an Airflow DAG<\/h3>\n<p>The following image illustrates the high level workflow from an Airflow perspective.<\/p>\n<figure class=\"kg-card kg-image-card kg-card-hascaption\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/image-89.png\" class=\"kg-image\" alt=\"\" loading=\"lazy\" width=\"1332\" height=\"1542\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/03\/image-89.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2026\/03\/image-89.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/image-89.png 1332w\" sizes=\"auto, (min-width: 720px) 720px\"><figcaption><span style=\"white-space: pre-wrap;\">Using DVC with Airflow DAGs<\/span><\/figcaption><\/figure>\n<p>And, all these things to work, your <strong>CI\/CD runners or Airflow workers need pull and push access to the Git<\/strong> and object storage you are going to use.<\/p>\n<h2 id=\"dvc-garbage-collection\">DVC Garbage Collection<\/h2>\n<p>Every <code>dvc push<\/code> adds a new version to S3. This means, the data size in your S3 DVC store keeps increasing over time and old version of DVC just sits there without any further use. To solve this, <strong>DVC provides a garbage collection command<\/strong> to remove files that are no longer referenced in the Git commit.<\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">To use DVC to delete old version from S3, you need to add <code spellcheck=\"false\" style=\"white-space: pre-wrap;\">s3:DeleteObject<\/code> permission to your IAM role.<\/div>\n<\/div>\n<p><strong>Always do a dry run first<\/strong> to see what will be deleted. This following command will list the files that will be deleted by using the garbage collection command.<\/p>\n<pre><code class=\"language-bash\">dvc gc --remote &lt;your_remote_name&gt; --all-commits --dry-run\n<\/code><\/pre>\n<p>Once you have confirmed the output, run the following actual command.<\/p>\n<pre><code class=\"language-bash\">dvc gc --remote &lt;your_remote_name&gt; --all-commits\n<\/code><\/pre>\n<p>This command not only removes old versions from the remote storage but also removes them from the local cache.<\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\"><b><strong style=\"white-space: pre-wrap;\">DVC Best Practice: <\/strong><\/b>Always keep atleast 4-6 versions and run the garbage collection command using cron jobs to clean the old versions. It is the same as using image retention policies in ECR.<\/div>\n<\/div>\n<h2 id=\"dvc-alternatives\">DVC Alternatives<\/h2>\n<p>DVC is not the only tool available for versioning data. Below are some of the alternatives to DVC.<\/p>\n<ul>\n<li><a href=\"https:\/\/lakefs.io\/?ref=devopscube.com\" rel=\"noreferrer\">LakeFS<\/a> &#8211; This versions the whole object storage instead of a single file, mostly suitable for large-scale dataset and data lake governance.<\/li>\n<li><a href=\"https:\/\/git-lfs.com\/?ref=devopscube.com\" rel=\"noreferrer\">GitLFS<\/a> &#8211; Saves the versioned dataset on the hosting providers (GitHub) own LFS storage and saves the pointer file in Git. It is used for simpler setups.<\/li>\n<\/ul>\n<h2 id=\"clean-up\">Clean Up<\/h2>\n<p>If you no longer need the DVC configurations, run the following command.<\/p>\n<pre><code class=\"language-bash\">dvc destroy<\/code><\/pre>\n<p>This removes the <code>.dvc\/<\/code> folder and all DVC configuration. Then commit and push the changes to clean up Git as well.<\/p>\n<h2 id=\"conclusion\">Conclusion<\/h2>\n<p>You now have a complete picture of how DVC works from initial setup to production integration. We looked at both developer experience and how DVC integrated with CI\/CD and workflow automation tools.<\/p>\n<p>In the next guide, we will build a full Airflow DAG on EKS that automates this entire workflow.<\/p>\n<p>Over to you!<\/p>\n<p>How are you versioning data now? Are you planning to use DVC in you projects? Comment below.<\/p>\n<hr>\n<p><strong>Ngu\u1ed3n:<\/strong> <a href=\"https:\/\/devopscube.com\/dvc-tutorial-for-beginners\/\" target=\"_blank\" rel=\"noopener noreferrer\">Data Version Control (DVC) Tutorial For Beginners \u2014 DevOpsCube<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Source: https:\/\/devopscube.com\/dvc-tutorial-for-beginners\/<\/p>\n","protected":false},"author":1,"featured_media":270,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-269","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-devops"],"_links":{"self":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/posts\/269","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=269"}],"version-history":[{"count":0,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/posts\/269\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/media\/270"}],"wp:attachment":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=269"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=269"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=269"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}