Skip to content

Commit

Permalink
added Country211 and Rendered SST2 dataset info
Browse files Browse the repository at this point in the history
  • Loading branch information
jongwook committed Sep 24, 2021
1 parent c13005f commit efe8cbb
Show file tree
Hide file tree
Showing 3 changed files with 24 additions and 1 deletion.
12 changes: 12 additions & 0 deletions data/country211.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# The Country211 Dataset

In the paper, we used an image classification dataset called Country211, to evaluate the model's capability on geolocation. To do so, we filtered the YFCC100m dataset that have GPS coordinate corresponding to a [ISO-3166 country code](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) and created a balanced dataset by sampling 150 train images, 50 validation images, and 100 test images images for each country.

The following command will download an 11GB archive countaining the images and extract into a subdirectory `country211`:

```bash
wget https://openaipublic.azureedge.net/clip/data/country211.tgz
tar zxvf country211.tgz
```

These images are a subset of the YFCC100m dataset. Use of the underlying media files is subject to the Creative Commons licenses chosen by their creators/uploaders. For more information about the YFCC100M dataset, visit [the official website](https://multimediacommons.wordpress.com/yfcc100m-core-dataset/).
11 changes: 11 additions & 0 deletions data/rendered-sst2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# The Rendered SST2 Dataset

In the paper, we used an image classification dataset called Rendered SST2, to evaluate the model's capability on optical character recognition. To do so, we rendered the sentences in the [Standford Sentiment Treebank v2](https://nlp.stanford.edu/sentiment/treebank.html) dataset and used those as the input to the CLIP image encoder.

The following command will download a 131MB archive countaining the images and extract into a subdirectory `rendered-sst2`:

```bash
wget https://openaipublic.azureedge.net/clip/data/rendered-sst2.tgz
tar zxvf rendered-sst2.tgz
```

2 changes: 1 addition & 1 deletion data/yfcc100m.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ The subset contains 14,829,396 images, about 15% of the full dataset, which have

We provide the list of (line number, photo identifier, photo hash) of each image contained in this subset. These correspond to the first three columns in the dataset's metadata TSV file.

```
```bash
wget https://openaipublic.azureedge.net/clip/data/yfcc100m_subset_data.tsv.bz2
bunzip2 yfcc100m_subset_data.tsv.bz2
```
Expand Down

0 comments on commit efe8cbb

Please sign in to comment.