Skip to content

Commit

Permalink
Merge branch 'OFA-Sys:main' into feature/vqa
Browse files Browse the repository at this point in the history
  • Loading branch information
yangapku authored Sep 20, 2022
2 parents c8740d7 + bb8e2c8 commit cf0faff
Show file tree
Hide file tree
Showing 19 changed files with 1,031 additions and 81 deletions.
54 changes: 17 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,24 +10,11 @@ This source code is licensed under the Apache 2.0 license found in the LICENSE f
<br>
<p>
<br>

<p align="center">
<a href="https://github.com/huggingface/transformers/blob/master/LICENSE">
<img alt="GitHub" src="https://img.shields.io/github/license/huggingface/transformers.svg?color=blue">
</a>
<a href="https://huggingface.co/ofa-sys">
<img alt="spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
</a>
<a href="colab.md"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="DOI"></a>
<a href="modelscope.md">ModelScope</a>&nbsp | &nbsp<a href="checkpoints.md">Checkpoints</a>&nbsp | &nbsp<a href="colab.md">Colab</a>&nbsp | &nbsp<a href="https://huggingface.co/ofa-sys">Demo</a>&nbsp | &nbsp<a href="http://arxiv.org/abs/2202.03052">Paper </a>&nbsp | &nbspBlog
</p>

<h4 align="center">
<p>
<a href="http://arxiv.org/abs/2202.03052">Paper</a> |
<b>Blog</b>
<p>
</h4>
<br></br>

<p align="center">
<br>
<img src="examples/demo.gif" width="800" />
Expand All @@ -36,49 +23,42 @@ This source code is licensed under the Apache 2.0 license found in the LICENSE f

[colab]: <https://colab.research.google.com/assets/colab-badge.svg>

OFA is a unified sequence-to-sequence pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (**finetuning** and **prompt tuning** are supported):
* **Image Captioning** (e.g., Microsoft COCO Caption, see [Leaderboard](https://competitions.codalab.org/competitions/3221#results))
* **Visual Question Answering** (e.g., [VQA 2.0](https://eval.ai/web/challenges/challenge-page/830/leaderboard/2278))
* **Referring Expression Comprehension** (e.g., [RefCOCO](https://paperswithcode.com/sota/referring-expression-comprehension-on-refcoco), [RefCOCO+](https://paperswithcode.com/sota/referring-expression-comprehension-on-refcoco-1), and [RefCOCOg](https://paperswithcode.com/sota/referring-expression-comprehension-on-1))
* **Visual Entailment** (e.g., [SNLI-VE](https://paperswithcode.com/sota/visual-entailment-on-snli-ve-test))
* **Text-to-Image Generation** (e.g., MSCOCO)
* **Text Classification** (e.g., GLUE) and **Text Generation** (e.g., [text summarization](https://paperswithcode.com/sota/text-summarization-on-gigaword))
* **Image Classification** (e.g., [ImageNet](https://paperswithcode.com/sota/self-supervised-image-classification-on-1))
* ......

In this doc, we provide:
* **Step-by-step** instructions for **pretraining** and **finetuning** (including almost **all tasks** presented in the paper);
* **Pretrained** and **finetuned** checkpoints (check [official ckpt](checkpoints.md) or [huggingface ckpt](https://huggingface.co/OFA-Sys) for what you need for what you need), and model cards with experimental results;
* ......
OFA is a unified sequence-to-sequence pretrained model (support **English** and **Chinese**) that unifies modalities (i.e., cross-modality, vision, language) and tasks (**finetuning** and **prompt tuning** are supported): image captioning (1st at the [MSCOCO Leaderboard](https://competitions.codalab.org/competitions/3221#results)), VQA ([link](https://eval.ai/web/challenges/challenge-page/830/leaderboard/2278)), visual grounding, text-to-image generation, text classification, text generation, image classification, etc. We provide **step-by-step** instructions for pretraining and finetuning and corresponding checkpoints (check official ckpt \[[EN](checkpoints.md)|[CN](checkpoints_cn.md)\] or [huggingface ckpt](https://huggingface.co/OFA-Sys)).

We sincerely welcome contributions to our project. Feel free to contact us or send us issues / PRs!
<br></br>


# Online Demos
We provide online demo via Hugging Face Spaces for you to interact with our pretrained and finetuned models. Below are the links to the demos:
* [Image Captioning](https://huggingface.co/spaces/OFA-Sys/OFA-Image_Caption)
* [Visual Grounding](https://huggingface.co/spaces/OFA-Sys/OFA-Visual_Grounding)
* [Visual Question Answering](https://huggingface.co/spaces/OFA-Sys/OFA-Visual_Question_Answering)
* [Text-to-Image Generation](https://huggingface.co/spaces/OFA-Sys/OFA-Text2Image_Generation)
* [Generic Interface](https://huggingface.co/spaces/OFA-Sys/OFA-Generic_Interface)
* Image Captioning \[[ModelScope](https://modelscope.cn/#/models/damo/ofa_image-caption_coco_large_en/summary) | [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Image_Caption)\]
* Visual Grounding \[[ModelScope](https://modelscope.cn/#/models/damo/ofa_visual-grounding_refcoco_large_en/summary) | [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Visual_Grounding)\]
* Visual Question Answering \[[ModelScope](https://modelscope.cn/#/models/damo/ofa_visual-question-answering_pretrain_large_en/summary) | [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Visual_Question_Answering)\]
* Text-to-Image Generation \[[ModelScope](https://modelscope.cn/#/models/damo/ofa_text-to-image-synthesis_coco_large_en/summary) | [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Text2Image_Generation)\]
* Generic Interface \[[Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Generic_Interface)\]

Also we provide Colab notebooks for you to better perceive the procedures. Click [here](colab.md) to check them out!
<br></br>

# Use in Huggingface Transformers
We support the inference of OFA in Huggingface Transformers. Check the [README](transformers.md) and [Colab Notebook](https://colab.research.google.com/drive/1Ho81RBV8jysZ7e0FhsSCk_v938QeDuy3?usp=sharing) for more information. Codes are released in this branch https://github.com/OFA-Sys/OFA/tree/feature/add_transformers
<br><br>


# News
* 2022.8.5: Released support of **prompt tuning** for OFA (temporarily maintained at `feature/prompt_tuning`). Check our paper [here](https://arxiv.org/abs/2208.02532)!
# News
* 2022.8.16: Released the **Chinese** version of OFA. **OFA-CN** needs only switching to `bpe_dir=../../utils/BERT_CN_dict` and `bpe=bert` and using our provided Chinese checkpoints in [checkpoints_cn.md](checkpoints_cn.md). Temporarily, we only provide base-size and large-size pretrained checkpoints and finetuned checkpoints on [MUGE Caption](https://tianchi.aliyun.com/muge) and the Chinese version of RefCOCO(-/+/g) (to release soon).
* 2022.8.5: Released support of **prompt tuning** for OFA. Check our paper [here](https://arxiv.org/abs/2208.02532)! Please see the [prompt_tuning.md](prompt_tuning.md) for further details.
* 2022.7.7: Updated support of OFA on **huggingface transformers** (fixed bugs in forward, add sequence generator from Fairseq to ensure performance, etc.). Refer to the doc [transformers.md](transformers.md) and the branch `feature/add_transformers`.
* 2022.6.17: Released the pretrained checkpoint of **OFA-Huge**. To use it, set `--arch=ofa_huge` in the script.
* 2022.5.15: OFA was accepted by **ICML 2022**
* 2022.4.28: Add support of inference on **huggingface transformers**. For how to use it, please refer to the doc [transformers.md](transformers.md) and our [huggingface models](https://huggingface.co/OFA-Sys).
* 2022.4.16: Released lightweight pretrained models **OFA-Medium** (~93M params) and **OFA-Tiny** (~33M params) in [checkpoints.md](checkpoints.md). To use them, you just need to load the corresponding checkpoint and set `--arch=ofa_medium` or `--arch=ofa_tiny` in the scripts.
* 2022.3.23: Added [Encouraging Loss](https://arxiv.org/pdf/2110.06537.pdf) as a feature. See [README_EncouragingLoss.md](README_EncouragingLoss.md). Leveraging this feature, OFA-Large has achieved improved results in both VQA (**test-std acc: 80.67**) and Image Classification (**test acc: 85.6**) recently.

<details>
<summary><b>More News</b></summary>
<p>
<ul>
<li>2022.3.23: Added [Encouraging Loss](https://arxiv.org/pdf/2110.06537.pdf) as a feature. See [README_EncouragingLoss.md](README_EncouragingLoss.md). Leveraging this feature, OFA-Large has achieved improved results in both VQA (**test-std acc: 80.67**) and Image Classification (**test acc: 85.6**) recently.</li>
<li>2022.3.21: Released codes for pretraining OFA.</li>
<li>2022.3.18: Released the finetuned <b>OFA-Base</b> (~180M parameters) checkpoints and running scripts for vision & language tasks, including: <b>Caption (146.4 CIDEr), VQA (78.07 on test-std), SNLI-VE (89.3 on dev), RefCOCO (90.67 on testA), RefCOCO+ (87.15 on testA) and RefCOCOg (82.31 on test-u)</b>.</li>
<li>2022.3.11: Released the finetuning & inference code/checkpoints for <b>Gigaword</b>.</li>
Expand Down
82 changes: 82 additions & 0 deletions checkpoints_cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Checkpoints (OFA-CN)

We provide checkpoints of OFA-CN, which is the Chinese version of OFA. We provide Base-size and Large-size models, including pretrained and finetuned models on image captioning and referring expression comprehension. Note that we translated the texts in the RefCOCO(-/+/g) datasets and finetuned OFA-CN on them. We plan to release the related new datasets in the near future.
<br>

## Checkpoints
Below we provide the links for downloading the Chinese OFA checkpoints.

### Pretraining
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_cn_large.pt"> Pretrained checkpoint (OFA-CN-Large) </a> (~443M parameters)
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_cn_base.pt "> Pretrained checkpoint (OFA-CN-Base) </a> (~160M parameters)

### Finetuning (OFA-Large)
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/caption_cn_large.pt"> Finetuned checkpoint for MUGE Caption (Stage 1) </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcoco_cn_large.pt"> Finetuned checkpoint for RefCOCO-CN </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocoplus_cn_large.pt"> Finetuned checkpoint for RefCOCO+-CN </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocog_cn_large.pt"> Finetuned checkpoint for RefCOCOg-CN </a>

### Finetuning (OFA-Base)
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/caption_cn_base.pt"> Finetuned checkpoint for MUGE Caption (Stage 1) </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcoco_cn_base.pt"> Finetuned checkpoint for RefCOCO-CN </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocoplus_cn_base.pt"> Finetuned checkpoint for RefCOCO+-CN </a>
* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocog_cn_base.pt"> Finetuned checkpoint for RefCOCOg-CN </a>
<br>

## Model Card
Below we provide the basic information of the base-size and large-size OFA-CN.

<table border="1" width="100%">
<tr align="center">
<th>Model</th><th>#Params</th><th>Backbone</th><th>Hidden Size</th><th>Intermediate Size</th><th>#Heads</th><th>#Enc. Layers</th><th>#Dec. Layers</th>
</tr>
<tr align="center">
<td>OFA<sub>Base</sub><td>160M</td><td>ResNet101</td><td>768</td></td><td>3072</td><td>12</td><td>6</td><td>6</td>
</tr>
<tr align="center">
<td>OFA<sub>Large</sub></td><td>443M</td><td>ResNet152</td><td>1024</td></td><td>4096</td><td>16</td><td>12</td><td>12</td>
</tr>
</tr>
</table>
<br>

## Results
Below we provide the results of OFA-CN and the baselines for comparison.

### [MUGE Caption]("https://tianchi.aliyun.com/muge")
<table border="1" width="100%">
<tr align="center">
<td>Model</td><td>BLEU@4</td><td>ROUGE-L</td><td>CIDEr-D</td>
</tr>
<tr align="center">
<td>Trm </td><td>7.33</td><td>51.51</td><td>11.00</td>
</tr>
<tr align="center">
<td>M6</td><td>16.19</td><td>55.06</td><td>30.75</td>
</tr>
<tr align="center">
<td>OFA<sub>Base</sub></td><td>26.23</td><td>58.95</td><td>50.70</td>
</tr>
<tr align="center">
<td>OFA<sub>Large</sub></td><td><b>27.32</b></td><td><b>59.20</b></td><td><b>53.51</b></td>
</tr>
</table>

### RefCOCO-CN Series
<table border="1" width="100%">
<tr align="center">
<td>Model</td><td>RefCOCO(val/testA/testB)</td><td>RefCOCO+(val/testA/testB)</td><td>RefCOCOg(val/test-u)</td>
</tr>
<tr align="center">
<td>OFA<sub>Base</sub>(random-init)</td><td>30.13/35.07/25.03</td><td>17.89/20.90/15.83</td><td>20.30/20.45</td>
</tr>
<tr align="center">
<td>OFA<sub>Base</sub></td><td>82.18/86.07/<b>76.68</b></td><td>69.38/77.26/60.14</td><td><b>73.57/72.53</b></td>
</tr>
<tr align="center">
<td>OFA<sub>Large</sub></td><td><b>82.84/86.54</b>/76.50</td><td><b>71.30/78.56/61.85</b></td><td>71.96/71.30</td>
</tr>
</table>
<br>


3 changes: 2 additions & 1 deletion colab.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

We provide Colab notebooks of different downstream tasks for you guys to enjoy OFA. See below.

* [Image Captioning in Huggingface Transformers](https://colab.research.google.com/drive/1Ho81RBV8jysZ7e0FhsSCk_v938QeDuy3?usp=sharing)
* [Generic Interface](https://colab.research.google.com/drive/1jogyZ-2rdHU3XxZOf3TBfhex1XHqX-1m?usp=sharing#scrollTo=s9Vni6YUZOpC) (using different instructions to perform various tasks with just one model.)
* [Image Captioning](https://colab.research.google.com/drive/1Q4eNhhhLcgOP4hHqwZwU1ijOlabgve1W?usp=sharing)
* [Referring Expression Comprehension](https://colab.research.google.com/drive/1AHQNRdaUpRTgr3XySHSlba8aXwBAjwPB?usp=sharing)
* [Open-Domain Visual Question Answering](https://colab.research.google.com/drive/14v6OQe_MxV_HMnsiKfnEeMR1UMqhzZNb?usp=sharing)
* [Open-Domain Visual Question Answering](https://colab.research.google.com/drive/1lsMsF-Vum3MVyXwSVF5E-Y23rHFvj_3y?usp=sharing)
6 changes: 5 additions & 1 deletion evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

from utils import checkpoint_utils
from utils.eval_utils import eval_step, merge_results
from utils.zero_shot_utils import zero_shot_step

logging.basicConfig(
format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
Expand Down Expand Up @@ -131,7 +132,10 @@ def main(cfg: DictConfig, **kwargs):
sample = utils.move_to_cuda(sample) if use_cuda else sample
sample = utils.apply_to_sample(apply_half, sample) if cfg.common.fp16 else sample
with torch.no_grad():
result, scores = eval_step(task, generator, models, sample, **kwargs)
if kwargs["zero_shot"]:
result, scores = zero_shot_step(task, generator, models, sample)
else:
result, scores = eval_step(task, generator, models, sample, **kwargs)
results += result
score_sum += sum(scores) if scores is not None else 0
score_cnt += len(scores) if scores is not None else 0
Expand Down
2 changes: 1 addition & 1 deletion models/ofa/ofa.py
Original file line number Diff line number Diff line change
Expand Up @@ -425,7 +425,7 @@ def ofa_medium_architecture(args):


@register_model_architecture("ofa", "ofa_tiny")
def ofa_medium_architecture(args):
def ofa_tiny_architecture(args):
args.encoder_embed_dim = getattr(args, "encoder_embed_dim", 256)
args.encoder_ffn_embed_dim = getattr(args, "encoder_ffn_embed_dim", 4 * 256)
args.encoder_layers = getattr(args, "encoder_layers", 4)
Expand Down
28 changes: 17 additions & 11 deletions models/ofa/unify_multihead_attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,8 @@ def forward(
self_attn_mask: Optional[Tensor] = None,
before_softmax: bool = False,
need_head_weights: bool = False,
attn_bias: Optional[Tensor] = None
attn_bias: Optional[Tensor] = None,
prompt_kv: Optional[Tensor] = None
) -> Tuple[Tensor, Optional[Tensor]]:
"""Input shape: Time x Batch x Channel
Expand Down Expand Up @@ -314,7 +315,7 @@ def forward(

if key_padding_mask is not None:
assert key_padding_mask.size(0) == bsz
assert key_padding_mask.size(1) == src_len
assert key_padding_mask.size(1) == k.size(1)

if self.add_zero_attn:
assert v is not None
Expand All @@ -335,14 +336,19 @@ def forward(
],
dim=1,
)

if prompt_kv is not None:
prompt_k, prompt_v = prompt_kv.split(1)
prompt_k = prompt_k.squeeze(0).reshape(k.size(0), -1, k.size(2))
prompt_v = prompt_v.squeeze(0).reshape(v.size(0), -1, v.size(2))
k = torch.cat([prompt_k, k], dim=1)
v = torch.cat([prompt_v, v], dim=1)
attn_weights = torch.bmm(q, k.transpose(1, 2))
attn_weights = self.apply_sparse_mask(attn_weights, tgt_len, src_len, bsz)
attn_weights = self.apply_sparse_mask(attn_weights, tgt_len, k.size(1), bsz)

assert list(attn_weights.size()) == [bsz * self.num_heads, tgt_len, src_len]
assert list(attn_weights.size()) == [bsz * self.num_heads, tgt_len, k.size(1)]

if attn_bias is not None:
attn_weights += attn_bias
attn_weights[:, :, -src_len:] += attn_bias[:, :, -src_len:]

if attn_mask is not None:
attn_mask = attn_mask.unsqueeze(0)
Expand All @@ -351,12 +357,12 @@ def forward(
attn_weights += attn_mask

if self_attn_mask is not None:
self_attn_mask = self_attn_mask.unsqueeze(1).expand(bsz, self.num_heads, tgt_len, src_len)
attn_weights += self_attn_mask.contiguous().view(bsz * self.num_heads, tgt_len, src_len)
self_attn_mask = self_attn_mask.unsqueeze(1).expand(bsz, self.num_heads, tgt_len, k.size(1))
attn_weights += self_attn_mask.contiguous().view(bsz * self.num_heads, tgt_len, k.size(1))

if key_padding_mask is not None:
# don't attend to padding symbols
attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, k.size(1))
if not is_tpu:
attn_weights = attn_weights.masked_fill(
key_padding_mask.unsqueeze(1).unsqueeze(2).to(torch.bool),
Expand All @@ -366,7 +372,7 @@ def forward(
attn_weights = attn_weights.transpose(0, 2)
attn_weights = attn_weights.masked_fill(key_padding_mask, float("-inf"))
attn_weights = attn_weights.transpose(0, 2)
attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, k.size(1))

if before_softmax:
return attn_weights, v
Expand Down Expand Up @@ -394,7 +400,7 @@ def forward(
attn_weights: Optional[Tensor] = None
if need_weights:
attn_weights = attn_weights_float.view(
bsz, self.num_heads, tgt_len, src_len
bsz, self.num_heads, tgt_len, k.size(1)
).transpose(1, 0)
if not need_head_weights:
# average attention weights over heads
Expand Down
Loading

0 comments on commit cf0faff

Please sign in to comment.