Replicate Version control for machine learning

You can store your experiment data in the cloud on Google Cloud Storage or Amazon S3.

This means you store results from multiple training machines in one place and collaborate with other people.

When you list experiments on your local machine, it will list all of the experiments that have run anywhere by anyone, so you can easily compare between them and download the results.

Log in to Google Cloud

To store data on Google Cloud Storage, you first need to install the Google Cloud SDK if you haven't already. Then, run gcloud init and follow its instructions:

  • Log in to your Google account when prompted.
  • If it asks to choose a project, pick the default or first option.
  • If it asks to pick a default region, hit enter and select 1 (you want a default region, but it doesn't matter which one).

Next, run this, because Cloud SDK needs you to log in a second time to let other applications use your Google Cloud account:

gcloud auth application-default login

Point Replicate at Google Cloud

Then, create replicate.yaml with this content, replacing [your model name] with your model's name, or some other unique string:

repository: "gs://replicate-repository-[your model name]"

Log in to Amazon Web Services

First, you need to install and configure the Amazon Web Services CLI if you haven't already. The easiest way to do this on macOS is to install it with Homebrew:

  1. If you haven't already, install Homebrew.
  2. Run brew install awscli
  3. If you haven't already got an Amazon Web Services account, sign up for Amazon Web Services.
  4. Run aws configure and follow these instructions to configure it. You will need to get an access key ID and secret access key and set the region to us-east-1 (unless you want to use a particular region). You don't need to set an output format, or profile.

If you're on another platform, follow these instructions to install the Amazon Web Services CLI.

Point Replicate at Amazon S3

Then, create replicate.yaml with this content, replacing [your model name] with your model's name, or some other unique string:

repository: "s3://replicate-repository-[your model name]"

Now, when you run your training script, calling experiment.checkpoint() will upload all your working directory and checkpoint metadata to this bucket.

If you're following along from the tutorial, run python train.py again. This time it will save your experiments to cloud storage. (It takes a second to save each checkpoint, so press Ctrl-C after a few epochs if you don't want to wait.)

Now, when you run replicate ls, it will list experiments from the bucket.

Migrate data

If you switch from local file storage to remote cloud storage, you will not see your local experiments any longer. Unlike Git which stores data both locally and remotely, Replicate only stores data in one location.

It is easy to migrate locations, if you ever need to, because Replicate just stores its data as plain files. To migrate from local file storage to the cloud, just copy the .replicate directory, or wherever you store your data, to your Amazon S3 or Google Cloud Storage bucket.

What's next

You might want to take a look at:

Let’s build this together

Everyone uses version control for software, but it’s much less common in machine learning.

This causes all sorts of problems: people are manually keeping track of things in spreadsheets, model weights are scattered on S3, and results can’t be reproduced. Somebody who wrote a model has left the team? Bad luck – nothing’s written down and you’ve probably got to start from scratch.

So why isn’t everyone using Git? Git doesn’t work well with machine learning. It can’t handle large files, it can’t handle key/value metadata like metrics, and it can’t record information automatically from inside a training script. There are some solutions for these things, but they feel like band-aids.

We spent a year talking to people in the ML community about this, and this is what we found out:

  • We need a native version control system for ML. It’s sufficiently different to normal software that we can’t just put band-aids on existing systems.
  • It needs to be small, easy to use, and extensible. We found people struggling to migrate to “AI Platforms”. We believe tools should do one thing well and combine with other tools to produce the system you need.
  • It needs to be open source. There are a number of proprietary solutions, but something so foundational needs to be built by and for the ML community.

We need your help to make this a reality. If you’ve built this for yourself, or are just interested in this problem, join us to help build a better system for everyone.

Join our Discord chat  or  Get involved on GitHub


Sign up for occasional email updates about the project and the community:

Core team

Ben Firshman

Product at Docker, creator of Docker Compose.

Andreas Jansson

ML infrastructure and research at Spotify.

We also built arXiv Vanity, which lets you read arXiv papers as responsive web pages.

Replicate Version control for machine learning

````