Merge branch 'OFA-Sys:main' into feature/vqa

sunshinewhy · Sep 20, 2022 · cf0faff · cf0faff
2 parents c8740d7 + bb8e2c8
commit cf0faff
Show file tree

Hide file tree

Showing 19 changed files with 1,031 additions and 81 deletions.
diff --git a/README.md b/README.md
@@ -10,24 +10,11 @@ This source code is licensed under the Apache 2.0 license found in the LICENSE f
     <br>
 <p>
 <br>
+
 <p align="center">
-    <a href="https://github.com/huggingface/transformers/blob/master/LICENSE">
-        <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/transformers.svg?color=blue">
-    </a>
-    <a href="https://huggingface.co/ofa-sys">
-        <img alt="spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
-    </a>
-    <a href="colab.md"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="DOI"></a>
+        <a href="modelscope.md">ModelScope</a>&nbsp ｜ &nbsp<a href="checkpoints.md">Checkpoints</a>&nbsp ｜ &nbsp<a href="colab.md">Colab</a>&nbsp ｜ &nbsp<a href="https://huggingface.co/ofa-sys">Demo</a>&nbsp ｜ &nbsp<a href="http://arxiv.org/abs/2202.03052">Paper </a>&nbsp ｜ &nbspBlog
 </p>
 
-<h4 align="center">
-    <p>
-        <a href="http://arxiv.org/abs/2202.03052">Paper</a> |
-        <b>Blog</b>
-    <p>
-</h4>
-<br></br>
-
 <p align="center">
     <br>
     <img src="examples/demo.gif" width="800" />
@@ -36,49 +23,42 @@ This source code is licensed under the Apache 2.0 license found in the LICENSE f
 
 [colab]: <https://colab.research.google.com/assets/colab-badge.svg>
 
-OFA is a unified sequence-to-sequence pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (**finetuning** and **prompt tuning** are supported):
-* **Image Captioning** (e.g., Microsoft COCO Caption, see [Leaderboard](https://competitions.codalab.org/competitions/3221#results))
-* **Visual Question Answering** (e.g., [VQA 2.0](https://eval.ai/web/challenges/challenge-page/830/leaderboard/2278))
-* **Referring Expression Comprehension** (e.g., [RefCOCO](https://paperswithcode.com/sota/referring-expression-comprehension-on-refcoco), [RefCOCO+](https://paperswithcode.com/sota/referring-expression-comprehension-on-refcoco-1), and [RefCOCOg](https://paperswithcode.com/sota/referring-expression-comprehension-on-1))
-* **Visual Entailment** (e.g., [SNLI-VE](https://paperswithcode.com/sota/visual-entailment-on-snli-ve-test))
-* **Text-to-Image Generation** (e.g., MSCOCO)
-* **Text Classification** (e.g., GLUE) and **Text Generation** (e.g., [text summarization](https://paperswithcode.com/sota/text-summarization-on-gigaword))
-* **Image Classification** (e.g., [ImageNet](https://paperswithcode.com/sota/self-supervised-image-classification-on-1))
-* ......
-
-In this doc, we provide: 
-* **Step-by-step** instructions for **pretraining** and **finetuning** (including almost **all tasks** presented in the paper);
-* **Pretrained** and **finetuned** checkpoints (check [official ckpt](checkpoints.md) or [huggingface ckpt](https://huggingface.co/OFA-Sys) for what you need for what you need), and model cards with experimental results;
-* ......
+OFA is a unified sequence-to-sequence pretrained model (support **English** and **Chinese**) that unifies modalities (i.e., cross-modality, vision, language) and tasks (**finetuning** and **prompt tuning** are supported): image captioning (1st at the [MSCOCO Leaderboard](https://competitions.codalab.org/competitions/3221#results)), VQA ([link](https://eval.ai/web/challenges/challenge-page/830/leaderboard/2278)), visual grounding, text-to-image generation, text classification, text generation, image classification, etc. We provide **step-by-step** instructions for pretraining and finetuning and corresponding checkpoints (check official ckpt \[[EN](checkpoints.md)|[CN](checkpoints_cn.md)\] or [huggingface ckpt](https://huggingface.co/OFA-Sys)).
 
 We sincerely welcome contributions to our project. Feel free to contact us or send us issues / PRs!
 <br></br>
 
 
 # Online Demos
 We provide online demo via Hugging Face Spaces for you to interact with our pretrained and finetuned models. Below are the links to the demos:
-* [Image Captioning](https://huggingface.co/spaces/OFA-Sys/OFA-Image_Caption)
-* [Visual Grounding](https://huggingface.co/spaces/OFA-Sys/OFA-Visual_Grounding)
-* [Visual Question Answering](https://huggingface.co/spaces/OFA-Sys/OFA-Visual_Question_Answering)
-* [Text-to-Image Generation](https://huggingface.co/spaces/OFA-Sys/OFA-Text2Image_Generation)
-* [Generic Interface](https://huggingface.co/spaces/OFA-Sys/OFA-Generic_Interface)
+* Image Captioning \[[ModelScope](https://modelscope.cn/#/models/damo/ofa_image-caption_coco_large_en/summary)  |  [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Image_Caption)\]
+* Visual Grounding \[[ModelScope](https://modelscope.cn/#/models/damo/ofa_visual-grounding_refcoco_large_en/summary) | [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Visual_Grounding)\]
+* Visual Question Answering \[[ModelScope](https://modelscope.cn/#/models/damo/ofa_visual-question-answering_pretrain_large_en/summary) | [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Visual_Question_Answering)\]
+* Text-to-Image Generation \[[ModelScope](https://modelscope.cn/#/models/damo/ofa_text-to-image-synthesis_coco_large_en/summary) | [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Text2Image_Generation)\]
+* Generic Interface \[[Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Generic_Interface)\]
 
 Also we provide Colab notebooks for you to better perceive the procedures. Click [here](colab.md) to check them out!
 <br></br>
 
+# Use in Huggingface Transformers
+We support the inference of OFA in Huggingface Transformers. Check the [README](transformers.md) and [Colab Notebook](https://colab.research.google.com/drive/1Ho81RBV8jysZ7e0FhsSCk_v938QeDuy3?usp=sharing) for more information. Codes are released in this branch https://github.com/OFA-Sys/OFA/tree/feature/add_transformers
+<br><br>
+
 
-# News 
-* 2022.8.5: Released support of **prompt tuning** for OFA (temporarily maintained at `feature/prompt_tuning`). Check our paper [here](https://arxiv.org/abs/2208.02532)!
+# News
+* 2022.8.16: Released the **Chinese** version of OFA. **OFA-CN** needs only switching to `bpe_dir=../../utils/BERT_CN_dict` and `bpe=bert` and using our provided Chinese checkpoints in [checkpoints_cn.md](checkpoints_cn.md). Temporarily, we only provide base-size and large-size pretrained checkpoints and finetuned checkpoints on [MUGE Caption](https://tianchi.aliyun.com/muge) and the Chinese version of RefCOCO(-/+/g) (to release soon). 
+* 2022.8.5: Released support of **prompt tuning** for OFA. Check our paper [here](https://arxiv.org/abs/2208.02532)! Please see the [prompt_tuning.md](prompt_tuning.md) for further details.
 * 2022.7.7: Updated support of OFA on **huggingface transformers** (fixed bugs in forward, add sequence generator from Fairseq to ensure performance, etc.). Refer to the doc [transformers.md](transformers.md) and the branch `feature/add_transformers`. 
 * 2022.6.17: Released the pretrained checkpoint of **OFA-Huge**. To use it, set `--arch=ofa_huge` in the script.
 * 2022.5.15: OFA was accepted by **ICML 2022**
 * 2022.4.28: Add support of inference on **huggingface transformers**. For how to use it, please refer to the doc [transformers.md](transformers.md) and our [huggingface models](https://huggingface.co/OFA-Sys). 
 * 2022.4.16: Released lightweight pretrained models **OFA-Medium** (~93M params) and **OFA-Tiny** (~33M params) in [checkpoints.md](checkpoints.md). To use them, you just need to load the corresponding checkpoint and set `--arch=ofa_medium` or `--arch=ofa_tiny` in the scripts.
-* 2022.3.23: Added [Encouraging Loss](https://arxiv.org/pdf/2110.06537.pdf) as a feature. See [README_EncouragingLoss.md](README_EncouragingLoss.md). Leveraging this feature, OFA-Large has achieved improved results in both VQA (**test-std acc: 80.67**) and Image Classification (**test acc: 85.6**) recently.
+
 <details>
     <summary><b>More News</b></summary>
     <p>
         <ul>
+            <li>2022.3.23: Added [Encouraging Loss](https://arxiv.org/pdf/2110.06537.pdf) as a feature. See [README_EncouragingLoss.md](README_EncouragingLoss.md). Leveraging this feature, OFA-Large has achieved improved results in both VQA (**test-std acc: 80.67**) and Image Classification (**test acc: 85.6**) recently.</li>
             <li>2022.3.21: Released codes for pretraining OFA.</li>
             <li>2022.3.18: Released the finetuned <b>OFA-Base</b> (~180M parameters) checkpoints and running scripts for vision & language tasks, including: <b>Caption (146.4 CIDEr), VQA (78.07 on test-std), SNLI-VE (89.3 on dev), RefCOCO (90.67 on testA), RefCOCO+ (87.15 on testA) and RefCOCOg (82.31 on test-u)</b>.</li>
             <li>2022.3.11: Released the finetuning & inference code/checkpoints for <b>Gigaword</b>.</li>

diff --git a/checkpoints_cn.md b/checkpoints_cn.md
@@ -0,0 +1,82 @@
+# Checkpoints (OFA-CN)
+
+We provide checkpoints of OFA-CN, which is the Chinese version of OFA. We provide Base-size and Large-size models, including pretrained and finetuned models on image captioning and referring expression comprehension. Note that we translated the texts in the RefCOCO(-/+/g) datasets and finetuned OFA-CN on them. We plan to release the related new datasets in the near future. 
+<br>
+
+## Checkpoints
+Below we provide the links for downloading the Chinese OFA checkpoints.
+
+### Pretraining
+* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_cn_large.pt"> Pretrained checkpoint (OFA-CN-Large) </a> (~443M parameters)
+* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_cn_base.pt "> Pretrained checkpoint (OFA-CN-Base) </a> (~160M parameters)
+
+### Finetuning (OFA-Large)
+* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/caption_cn_large.pt"> Finetuned checkpoint for MUGE Caption (Stage 1) </a>
+* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcoco_cn_large.pt"> Finetuned checkpoint for RefCOCO-CN </a>
+* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocoplus_cn_large.pt"> Finetuned checkpoint for RefCOCO+-CN </a>
+* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocog_cn_large.pt"> Finetuned checkpoint for RefCOCOg-CN </a>
+
+### Finetuning (OFA-Base)
+* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/caption_cn_base.pt"> Finetuned checkpoint for MUGE Caption (Stage 1) </a>
+* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcoco_cn_base.pt"> Finetuned checkpoint for RefCOCO-CN </a>
+* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocoplus_cn_base.pt"> Finetuned checkpoint for RefCOCO+-CN </a>
+* <a href="https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcocog_cn_base.pt"> Finetuned checkpoint for RefCOCOg-CN </a>
+<br>
+
+## Model Card
+Below we provide the basic information of the base-size and large-size OFA-CN. 
+
+<table border="1" width="100%">
+    <tr align="center">
+        <th>Model</th><th>#Params</th><th>Backbone</th><th>Hidden Size</th><th>Intermediate Size</th><th>#Heads</th><th>#Enc. Layers</th><th>#Dec. Layers</th>
+    </tr>
+    <tr align="center">
+        <td>OFA<sub>Base</sub><td>160M</td><td>ResNet101</td><td>768</td></td><td>3072</td><td>12</td><td>6</td><td>6</td>
+    </tr>
+    <tr align="center">
+        <td>OFA<sub>Large</sub></td><td>443M</td><td>ResNet152</td><td>1024</td></td><td>4096</td><td>16</td><td>12</td><td>12</td>
+    </tr>
+    </tr>
+</table>
+<br>
+
+## Results
+Below we provide the results of OFA-CN and the baselines for comparison. 
+
+### [MUGE Caption]("https://tianchi.aliyun.com/muge")
+<table border="1" width="100%">
+    <tr align="center">
+        <td>Model</td><td>BLEU@4</td><td>ROUGE-L</td><td>CIDEr-D</td>
+    </tr>
+    <tr align="center">
+        <td>Trm </td><td>7.33</td><td>51.51</td><td>11.00</td>
+    </tr>
+    <tr align="center">
+        <td>M6</td><td>16.19</td><td>55.06</td><td>30.75</td>
+    </tr>
+    <tr align="center">
+        <td>OFA<sub>Base</sub></td><td>26.23</td><td>58.95</td><td>50.70</td>
+    </tr>
+    <tr align="center">
+        <td>OFA<sub>Large</sub></td><td><b>27.32</b></td><td><b>59.20</b></td><td><b>53.51</b></td>
+    </tr>
+</table>
+
+### RefCOCO-CN Series
+<table border="1" width="100%">
+    <tr align="center">
+        <td>Model</td><td>RefCOCO(val/testA/testB)</td><td>RefCOCO+(val/testA/testB)</td><td>RefCOCOg(val/test-u)</td>
+    </tr>
+    <tr align="center">
+        <td>OFA<sub>Base</sub>(random-init)</td><td>30.13/35.07/25.03</td><td>17.89/20.90/15.83</td><td>20.30/20.45</td>
+    </tr>
+    <tr align="center">
+        <td>OFA<sub>Base</sub></td><td>82.18/86.07/<b>76.68</b></td><td>69.38/77.26/60.14</td><td><b>73.57/72.53</b></td>
+    </tr>
+    <tr align="center">
+        <td>OFA<sub>Large</sub></td><td><b>82.84/86.54</b>/76.50</td><td><b>71.30/78.56/61.85</b></td><td>71.96/71.30</td>
+    </tr>
+</table>
+<br>
+
+
diff --git a/colab.md b/colab.md
@@ -2,7 +2,8 @@
 
 We provide Colab notebooks of different downstream tasks for you guys to enjoy OFA. See below.
 
+* [Image Captioning in Huggingface Transformers](https://colab.research.google.com/drive/1Ho81RBV8jysZ7e0FhsSCk_v938QeDuy3?usp=sharing)
 * [Generic Interface](https://colab.research.google.com/drive/1jogyZ-2rdHU3XxZOf3TBfhex1XHqX-1m?usp=sharing#scrollTo=s9Vni6YUZOpC) (using different instructions to perform various tasks with just one model.)
 * [Image Captioning](https://colab.research.google.com/drive/1Q4eNhhhLcgOP4hHqwZwU1ijOlabgve1W?usp=sharing)
 * [Referring Expression Comprehension](https://colab.research.google.com/drive/1AHQNRdaUpRTgr3XySHSlba8aXwBAjwPB?usp=sharing)
-* [Open-Domain Visual Question Answering](https://colab.research.google.com/drive/14v6OQe_MxV_HMnsiKfnEeMR1UMqhzZNb?usp=sharing)
+* [Open-Domain Visual Question Answering](https://colab.research.google.com/drive/1lsMsF-Vum3MVyXwSVF5E-Y23rHFvj_3y?usp=sharing)
diff --git a/evaluate.py b/evaluate.py
@@ -18,6 +18,7 @@
 
 from utils import checkpoint_utils
 from utils.eval_utils import eval_step, merge_results
+from utils.zero_shot_utils import zero_shot_step
 
 logging.basicConfig(
     format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
@@ -131,7 +132,10 @@ def main(cfg: DictConfig, **kwargs):
         sample = utils.move_to_cuda(sample) if use_cuda else sample
         sample = utils.apply_to_sample(apply_half, sample) if cfg.common.fp16 else sample
         with torch.no_grad():
-            result, scores = eval_step(task, generator, models, sample, **kwargs)
+            if kwargs["zero_shot"]:
+                result, scores = zero_shot_step(task, generator, models, sample)
+            else:
+                result, scores = eval_step(task, generator, models, sample, **kwargs)
         results += result
         score_sum += sum(scores) if scores is not None else 0
         score_cnt += len(scores) if scores is not None else 0

diff --git a/models/ofa/ofa.py b/models/ofa/ofa.py
@@ -425,7 +425,7 @@ def ofa_medium_architecture(args):
 
 
 @register_model_architecture("ofa", "ofa_tiny")
-def ofa_medium_architecture(args):
+def ofa_tiny_architecture(args):
     args.encoder_embed_dim = getattr(args, "encoder_embed_dim", 256)
     args.encoder_ffn_embed_dim = getattr(args, "encoder_ffn_embed_dim", 4 * 256)
     args.encoder_layers = getattr(args, "encoder_layers", 4)

diff --git a/models/ofa/unify_multihead_attention.py b/models/ofa/unify_multihead_attention.py
@@ -127,7 +127,8 @@ def forward(
         self_attn_mask: Optional[Tensor] = None,
         before_softmax: bool = False,
         need_head_weights: bool = False,
-        attn_bias: Optional[Tensor] = None
+        attn_bias: Optional[Tensor] = None,
+        prompt_kv: Optional[Tensor] = None
     ) -> Tuple[Tensor, Optional[Tensor]]:
         """Input shape: Time x Batch x Channel
 
@@ -314,7 +315,7 @@ def forward(
 
         if key_padding_mask is not None:
             assert key_padding_mask.size(0) == bsz
-            assert key_padding_mask.size(1) == src_len
+            assert key_padding_mask.size(1) == k.size(1)
 
         if self.add_zero_attn:
             assert v is not None
@@ -335,14 +336,19 @@ def forward(
                     ],
                     dim=1,
                 )
-
+        if prompt_kv is not None:
+            prompt_k, prompt_v = prompt_kv.split(1)
+            prompt_k = prompt_k.squeeze(0).reshape(k.size(0), -1, k.size(2))
+            prompt_v = prompt_v.squeeze(0).reshape(v.size(0), -1, v.size(2))
+            k = torch.cat([prompt_k, k], dim=1)
+            v = torch.cat([prompt_v, v], dim=1)
         attn_weights = torch.bmm(q, k.transpose(1, 2))
-        attn_weights = self.apply_sparse_mask(attn_weights, tgt_len, src_len, bsz)
+        attn_weights = self.apply_sparse_mask(attn_weights, tgt_len, k.size(1), bsz)
 
-        assert list(attn_weights.size()) == [bsz * self.num_heads, tgt_len, src_len]
+        assert list(attn_weights.size()) == [bsz * self.num_heads, tgt_len, k.size(1)]
 
         if attn_bias is not None:
-            attn_weights += attn_bias
+            attn_weights[:, :, -src_len:] += attn_bias[:, :, -src_len:]
 
         if attn_mask is not None:
             attn_mask = attn_mask.unsqueeze(0)
@@ -351,12 +357,12 @@ def forward(
             attn_weights += attn_mask
 
         if self_attn_mask is not None:
-            self_attn_mask = self_attn_mask.unsqueeze(1).expand(bsz, self.num_heads, tgt_len, src_len)
-            attn_weights += self_attn_mask.contiguous().view(bsz * self.num_heads, tgt_len, src_len)
+            self_attn_mask = self_attn_mask.unsqueeze(1).expand(bsz, self.num_heads, tgt_len, k.size(1))
+            attn_weights += self_attn_mask.contiguous().view(bsz * self.num_heads, tgt_len, k.size(1))
 
         if key_padding_mask is not None:
             # don't attend to padding symbols
-            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, k.size(1))
             if not is_tpu:
                 attn_weights = attn_weights.masked_fill(
                     key_padding_mask.unsqueeze(1).unsqueeze(2).to(torch.bool),
@@ -366,7 +372,7 @@ def forward(
                 attn_weights = attn_weights.transpose(0, 2)
                 attn_weights = attn_weights.masked_fill(key_padding_mask, float("-inf"))
                 attn_weights = attn_weights.transpose(0, 2)
-            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, k.size(1))
 
         if before_softmax:
             return attn_weights, v
@@ -394,7 +400,7 @@ def forward(
         attn_weights: Optional[Tensor] = None
         if need_weights:
             attn_weights = attn_weights_float.view(
-                bsz, self.num_heads, tgt_len, src_len
+                bsz, self.num_heads, tgt_len, k.size(1)
             ).transpose(1, 0)
             if not need_head_weights:
                 # average attention weights over heads