Skip to content

Commit

Permalink
Scipy poster (iterative#1503)
Browse files Browse the repository at this point in the history
* Scipy poster

* more descriptive url

* add fake link to exclude list
  • Loading branch information
elleobrien authored Jun 27, 2020
1 parent b87679a commit ea09b81
Show file tree
Hide file tree
Showing 4 changed files with 223 additions and 0 deletions.
222 changes: 222 additions & 0 deletions content/blog/2020-06-26-scipy-2020-dvc-poster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
---
title: 'Packaging data and machine learning models for sharing'
date: 2020-06-26
description: |
A virtual poster for SciPy 2020 about sharing versioned datasets and ML models with DVC.
descriptionLong: |
A virtual poster for SciPy 2020 about sharing versioned datasets and ML models with DVC.
picture: 2020-06-26/SciPy_2020.png
author: elle_obrien
commentsUrl: https://discuss.dvc.org/t/dvc-1-0-release/412
tags:
- Import
- SciPy
- Python
---

When I was doing my Ph.D., every time I published a paper I shared a public
GitHub repository with my dataset and scripts to reproduce my statistical
analyses. While it took a bit of work to get the repository in good shape for
sharing (cleaning up code, adding documentation), the process was
straightforward: upload everything to the repo!

But when I started working on deep learning projects, things got considerably
more complicated. For example, in a data journalism project I did with The
Pudding, I wanted to understand how hair style (particularly size!) changed over
the years. There were a lot of moving parts:

- A public dataset of yearbook photos released and maintained by Ginosar et al.
- A deep learning model I trained to segment the hair in yearbook photos
- A derivative dataset of "hair maps" for each photo in the original datasetr
- All the code to train the deep learning model and analyse the derivative
dataset

![](/uploads/images/2020-06-26/hairflow.png) _The parts of my big-hair-data
project: an original public dataset, a model for segmenting the images, a
derivative dataset of segment maps, and analysis scripts._

How would you share this with a collaborator, or open it up to the public?
Throwing it all in a GitHub repository was not an option. My model wouldn't fit
on GitHub because it was over the 100 MB size limit. I also wanted to preserve a
clear link between my derived dataset and the original- it should be obvious
exactly how I got the public dataset. And if that public dataset were to ever
change, I would ideally want it to be clear what version I used for my analyses.

This blog is about several different ways of "releasing" data science projects,
with an emphasis on preserving meaningful links about the origins of derived
data and models. I'm not making any strong assumptions about whether project
materials are relased within an organization (only to teammates, for example) or
to the whole internet.

Let's look at a few methods.

# Method One: artifacts in the cloud

When you work with big models and datasets, you often can't host them in a
GitHub repo. But you can put them in cloud storage, and then provide a script in
your GitHub repo to download them. For example, in the fantastic `gpt-2-simple`
project by Max Woolf, Max stores huge GPT-2 models in Google Drive and provides
a script to download a specified model to a user's local workspace if it isn't
already there.

Likewise, the Nvidia StyleGAN release provides a hardcoded URL to their model in
Google Drive storage. Both the `gpt-2-simple` and StyleGAN projects have custom
scripts to handle these big downloads, and largely thanks to the work of the
project maintainers, users only interact with the downloading process at a very
high level.

Considering some pros and cons of this approach:

| **Pros** | **Cons** |
| :-----------------------------------: | :----------------------------: |
| It's easy to put a model in a bucket | Hardcoded links are brittle |
| Works for pip packages | Need to write custom functions |
| No extra tools, just Python scripting | Downloads aren't versioned |

# Method Two: Hubs, Catalogs & Zoos

There are a (growing) number of websites willing to long-term host big models
and datasets, plus relevant meta-data, code, and publications. Some even allow
you to upload several versions of a project- it's not Git, for sure, but even
basic version control is something.

For example, [PyTorch Hub](https://pytorch.org/hub/) lets researchers publish
trained models developed in the PyTorch framework, along with code and papers.
It's easily searched and browsed, which makes projects discoverable.

For a dataset analog, Kaggle is similar- they host user-submitted datasets and
help other users find them. Both PyTorch Hub and Kaggle have APIs for
programmatically downloading artifacts.

| **Pros** | **Cons** |
| :----------------------: | :---------------------: |
| Browsable & discoverable | Centrally managed |
| Public | Public (no granularity) |
| Good with big models | Weak versioning support |

# Method Three: Packaging with DVC

DVC, or Data Version Control, is a Python project for extending Git version
control to large project artifacts like datasets and models. It's not a
replacement for Git- DVC works _with_ Git!

The basic idea is that your datasets and models are stored in a DVC repository,
which can be any cloud storage or server of your choice. DVC creates metadata
about file versions that can be tracked by Git and hosted on GitHub- so you can
share your datasets and models like any GitHub project, with all the benefits of
versioning. Let's look at a case study.

## Creating a DVC project

Say I have a project containing a dataset, model training code, and model.

```dvc
$ ls
data.csv
train.py
model.pkl
```

Say our model and dataset are large and we want to track them with DVC. For
remote storage, we want to use a personal S3 bucket. We would run:

```dvc
$ git init
$ dvc init
$ dvc remote add myremote s3://mybucket/myproject
$ dvc add data.csv model.pkl
$ dvc push
```

When I run these commands, I've initialized Git and DVC tracking. Next, I've set
a DVC repository- my S3 bucket. Then I've added `data.csv` and `model.pkl` to
DVC tracking. Finally, when I run `dvc push`, the model and dataset are pushed
to the S3 bucket. On my local machine, two meta-files are created:
`data.csv.dvc` and `model.pkl.dvc`. These can be tracked with Git!

```dvc
$ ls
data.csv.dvc
train.py
model.pkl.dvc
```

So after setting a remote Git repository, `git add`, `commit` and `push` like
usual (assuming you are a regualr Git user, that is):

```dvc
$ git remote add origin [email protected]:elle/myproject
$ git add . && git commit -m "first commit"
$ git push origin master
```

## Package management with DVC

Now let's say one of my teammates wants to access my work so far- specifically,
they want to see if another method for constructing features from raw data will
help model accuracy. I've given them permission to access my GitHub repository.
On their local machine, they'll run:

```dvc
$ dvc import https://github.com/elle/myproject data.csv model.pkl
```

This will download the latest version of the `data.csv` and `model.pkl`
artifacts to their local machine, as well as the DVC metafiles `data.csv.dvc`
and `model.pkl.csv` indicating the precise version and source.

Collaborators can also download artifacts from previous versions, releases, or
parallel feature branches of a project. For example, if I released a new version
of my project with a Git tag (say `v.2.0.1`), collaborators can run

```dvc
$ dvc get --rev v.2.0.1 \
https://github.com/elle/myproject data.csv
```

Lastly, because `dvc import` maintains a link between the downloaded artifacts
and my repository, collaborators can check for project updates with

```dvc
$ dvc update data.csv model.pkl
```

If new versions are detected, DVC automatically syncs the local workspace with
those versions.

## When should you do this?

In my own experience releasing a large public dataset with DVC, I've seen
several benefits:

- Within an hour, someone found data points I'd been missing. It was
straightforward to make a new release after patching this error.
- Several people modeled my dataset! Highly rewarding.
- Since GitHub is a widely used platform for code sharing, it's a natural fit
for open source scientific projects and has little overhead for potential
collaborators

To return to the pros and cons table:

| **Pros** | **Cons** |
| :------------------------------------------------: | :-----------------------------------------------------: |
| Git version your dataset | No GUI access to files in DVC remote |
| Granular sharing permissions | Collaborators need to use DVC |
| DVC abstracts away download scripts/hardcoded URLs | Can be serverless, but you need to manage cloud storage |

# The bottom line

Packaging models and datasets is a non-trivial part of the machine learning
workflow. DVC provides a method for giving users a Git-centric experience of
cloning or forking these artifacts, with an emphasis on _versioning artifacts_
and _abstracting away the processes of uploading, downloading, and storing
artifacts_. For projects with high complexity- like my hair project, which had
some gnarly dependencies and big artifacts- this kind of source control pays
off. If you don't know where your data came from or how it's been transformed,
it's impossible to be scientific.

Thanks for stopping by our virtual poster! I'm happy to take questions or
comments about how version control fits into the scientific workflow. Leave a
comment, reach out on Twitter, or send an email.
1 change: 1 addition & 0 deletions scripts/exclude-links.txt
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,4 @@ https://www.reddit.com/r/MachineLearning/comments/bx0apm/d_how_do_you_manage_you
https://www.youtube.com/embed/$
http://[email protected]/path
http://www.reddit.com/r/MachineLearning
https://github.com/elle/myproject
Binary file added static/uploads/images/2020-06-26/SciPy_2020.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/uploads/images/2020-06-26/hairflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit ea09b81

Please sign in to comment.