merge changes from the simplification of AltDiffusion loading

hbyangxianyun · Nov 26, 2022 · 46eb235 · 46eb235
2 parents 54ac82d + 25b49a6
commit 46eb235
Show file tree

Hide file tree

Showing 18 changed files with 938 additions and 283 deletions.
diff --git a/README.md b/README.md
@@ -34,9 +34,8 @@ The code is partially based on [GLM](https://github.com/THUDM/GLM), [Transformer
 - [Quick Started](#quick-start)
     - [Load model and tokenizer](#load-model-and-tokenizer)
     - [Predictor](#predictor)
-    - [NER task](#ner-task)
-    - [Title generation task](#title-generation-task)
-    - [Semantic matching task](#semantic-matching-task)
+    - [Text-to-image generation task](/examples/AltDiffusion/README.md)
+
 - [Pretrained Models and examples](#pretrained-models-and-examples)
 - [Tutorials](#tutorials)
 - [Contributing](#contributing)
@@ -123,6 +122,9 @@ for text in test_data:
 
 ## Pretrained Models and examples
 
+* [Text_image_matching with AltCLIP](/examples/AltCLIP/README.md)
+* [Text-to-image generation with AltDiffusion](/examples/AltDiffusion/README.md)
+* [Blank_Filling_QA with GLM ](/docs/TUTORIAL_11_GLM_BLANK_FILLING_QA.md)
 * [Blank_Filling_QA with GLM ](/docs/TUTORIAL_11_GLM_BLANK_FILLING_QA.md)
 * [Title Generation with GLM ](/docs/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md)
 * [Poetry generation with GLM-large-ch](docs/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md)

diff --git a/README_zh.md b/README_zh.md
@@ -34,9 +34,8 @@
 - [快速上手](#快速上手)
     - [加载模型和分词器](#加载模型和分词器)
     - [使用预测器](#使用预测器)
-    - [命名实体识别任务示例](#命名实体识别任务示例 )
-    - [标题生成任务示例](#标题生成任务示例)
-    - [语义相似度匹配任务示例](#语义相似度匹配任务示例)
+    - [文生图任务示例](/examples/AltDiffusion/README.md)
+
 - [预训练模型以及样例](#预训练模型以及样例)
 - [教程](#教程)
 - [贡献代码](#贡献代码)
@@ -190,6 +189,8 @@ for text_pair in test_data:
 ```
 
 # 预训练模型以及样例
+* [AltCLIP图文匹配](/examples/AltCLIP/README.md)
+* [AltDiffusion文生图](/examples/AltDiffusion/README.md)
 * [GLM-large-ch用户完形填空问答](/doc_zh/TUTORIAL_11_GLM_BLANK_FILLING_QA.md)
 * [GLM-large-ch用于诗歌生成](doc_zh/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md)
 * [GLM-large-ch用于标题生成](doc_zh/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md)

diff --git a/examples/AltCLIP/hf_altclip/modeling_kd.py b/examples/AltCLIP/hf_altclip/modeling_kd.py
diff --git a/examples/EVA_CLIP/README.md b/examples/EVA_CLIP/README.md
@@ -0,0 +1,130 @@
+# Contrastive Language-Image Pre-Training with [EVA](https://github.com/baaivision/EVA) (EVA-CLIP)
+
+## Model Card
+
+| model name | #param. | precision | data  |  batch size | IN-1K zero-shot top-1 |
+|:-----------:|:------:|:------:|:------:|:------:|:------:|
+| `eva-clip` | 1.3B | `fp16` | [LAION-400M](https://laion.ai/laion-400-open-dataset/) | 41K | 78.5
+
+
+To our knowledge, EVA-CLIP is the largest performant open-sourced CLIP model evaluated via zero-shot classification performance.
+
+For more details of EVA-CLIP, please refer to Section 2.3.5 of [paper](https://arxiv.org/pdf/2211.07636.pdf).
+
+## Performance
+
+| dataset | acc1 | acc5 | mean_per_class_recall  | 
+|:-----------:|:------:|:------:|:------:|
+| `imagenet1k` | 78.53 | 95.51 | 78.51 |
+| `imagenet-a` | 73.59 | 90.93 | 69.97 |
+| `imagenet-r` | 92.5 | 98.24 | 91.19 |
+| `imagenet-sketch` | 67.31 | 89.07 | 67.31 |
+| `imagenetv2` | 71.52 | 92.11 | 71.56 |
+| `objectnet` | 72.33 | 89.37 | 70.88 |
+
+## Usage
+
+```python
+import torch
+from PIL import Image
+from flagai.auto_model.auto_loader import AutoLoader
+from flagai.data.dataset.mm.clip_dataset import clip_transform
+
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+loader = AutoLoader(task_name="txt_img_matching", #contrastive learning
+                    model_name="eva-clip")
+
+model = loader.get_model()
+model.eval()
+model.to(device)
+tokenizer = loader.get_tokenizer()
+transform = clip_transform(img_size=model.visual.image_size)
+
+def download_image(url):
+    urllib_request = urllib.request.Request(
+        url,
+        data=None,
+        headers={"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0"},
+    )
+    with urllib.request.urlopen(urllib_request, timeout=10) as r:
+        img_stream = io.BytesIO(r.read())
+    return img_stream
+
+def inference():
+    # local image
+    # image = Image.open(/path/to/image)
+    # online image
+    image = Image.open(download_image("https://bkimg.cdn.bcebos.com/pic/4610b912c8fcc3ce2d02315d9d45d688d53f209a?x-bce-process=image/watermark,image_d2F0ZXIvYmFpa2UxMTY=,g_7,xp_5,yp_5"))
+    image = transform(image).unsqueeze(0).to(device)
+    text = tokenizer.tokenize_as_tensor(["a tomato", "a cat"]).to(device)
+
+    with torch.no_grad():
+        image_features = model.encode_image(image)
+        text_features = model.encode_text(text)
+        text_probs = (image_features @ text_features.T).softmax(dim=-1)
+
+    print(text_probs.cpu().numpy()[0].tolist()) # [1.0, 0.0]
+```
+
+## Zero-Shot Prediction
+The code below performs zero-shot prediction using EVA_CLIP. This example takes an image from the CIFAR-100 dataset, and predicts the most likely labels among the 100 textual labels from the dataset.
+
+```python
+import os
+import torch
+from torchvision.datasets import CIFAR100
+from flagai.auto_model.auto_loader import AutoLoader
+from flagai.data.dataset.mm.clip_dataset import clip_transform
+
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+loader = AutoLoader(task_name="txt_img_matching", #contrastive learning
+                    model_name="eva-clip")
+
+model = loader.get_model()
+model.eval()
+model.to(device)
+tokenizer = loader.get_tokenizer()
+transform = clip_transform(img_size=model.visual.image_size)
+
+# Download the dataset
+cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)
+
+# Prepare the inputs
+image, class_id = cifar100[3637]
+image_input = transform(image).unsqueeze(0).to(device)
+text_inputs = torch.cat([tokenizer.tokenize_as_tensor(f"a photo of a {c}") for c in cifar100.classes]).to(device)
+
+# Calculate features
+with torch.no_grad():
+    image_features = model.encode_image(image_input)
+    text_features = model.encode_text(text_inputs)
+
+# Pick the top 5 most similar labels for the image
+image_features /= image_features.norm(dim=-1, keepdim=True)
+text_features /= text_features.norm(dim=-1, keepdim=True)
+similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
+values, indices = similarity[0].topk(5)
+
+# Print the result
+print("\nTop predictions:\n")
+for value, index in zip(values, indices):
+    print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")
+
+```
+The output will look like the following (the exact numbers may be slightly different depending on the compute device):
+```bash
+Top predictions:
+
+           snake: 100.00%
+          turtle: 0.00%
+     caterpillar: 0.00%
+            worm: 0.00%
+         leopard: 0.00%
+```
+
+## Acknowledgement
+
+EVA-CLIP is built with [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip) and [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark).
+Thanks for their awesome works!