Skip to content

Commit

Permalink
merge changes from the simplification of AltDiffusion loading
Browse files Browse the repository at this point in the history
  • Loading branch information
Zac Liu committed Nov 26, 2022
2 parents 54ac82d + 25b49a6 commit 46eb235
Show file tree
Hide file tree
Showing 18 changed files with 938 additions and 283 deletions.
8 changes: 5 additions & 3 deletions README.md
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,8 @@ The code is partially based on [GLM](https://github.com/THUDM/GLM), [Transformer
- [Quick Started](#quick-start)
- [Load model and tokenizer](#load-model-and-tokenizer)
- [Predictor](#predictor)
- [NER task](#ner-task)
- [Title generation task](#title-generation-task)
- [Semantic matching task](#semantic-matching-task)
- [Text-to-image generation task](/examples/AltDiffusion/README.md)

- [Pretrained Models and examples](#pretrained-models-and-examples)
- [Tutorials](#tutorials)
- [Contributing](#contributing)
Expand Down Expand Up @@ -123,6 +122,9 @@ for text in test_data:

## Pretrained Models and examples

* [Text_image_matching with AltCLIP](/examples/AltCLIP/README.md)
* [Text-to-image generation with AltDiffusion](/examples/AltDiffusion/README.md)
* [Blank_Filling_QA with GLM ](/docs/TUTORIAL_11_GLM_BLANK_FILLING_QA.md)
* [Blank_Filling_QA with GLM ](/docs/TUTORIAL_11_GLM_BLANK_FILLING_QA.md)
* [Title Generation with GLM ](/docs/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md)
* [Poetry generation with GLM-large-ch](docs/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md)
Expand Down
7 changes: 4 additions & 3 deletions README_zh.md
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,8 @@
- [快速上手](#快速上手)
- [加载模型和分词器](#加载模型和分词器)
- [使用预测器](#使用预测器)
- [命名实体识别任务示例](#命名实体识别任务示例 )
- [标题生成任务示例](#标题生成任务示例)
- [语义相似度匹配任务示例](#语义相似度匹配任务示例)
- [文生图任务示例](/examples/AltDiffusion/README.md)

- [预训练模型以及样例](#预训练模型以及样例)
- [教程](#教程)
- [贡献代码](#贡献代码)
Expand Down Expand Up @@ -190,6 +189,8 @@ for text_pair in test_data:
```

# 预训练模型以及样例
* [AltCLIP图文匹配](/examples/AltCLIP/README.md)
* [AltDiffusion文生图](/examples/AltDiffusion/README.md)
* [GLM-large-ch用户完形填空问答](/doc_zh/TUTORIAL_11_GLM_BLANK_FILLING_QA.md)
* [GLM-large-ch用于诗歌生成](doc_zh/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md)
* [GLM-large-ch用于标题生成](doc_zh/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md)
Expand Down
192 changes: 0 additions & 192 deletions examples/AltCLIP/hf_altclip/modeling_kd.py

This file was deleted.

130 changes: 130 additions & 0 deletions examples/EVA_CLIP/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Contrastive Language-Image Pre-Training with [EVA](https://github.com/baaivision/EVA) (EVA-CLIP)

## Model Card

| model name | #param. | precision | data | batch size | IN-1K zero-shot top-1 |
|:-----------:|:------:|:------:|:------:|:------:|:------:|
| `eva-clip` | 1.3B | `fp16` | [LAION-400M](https://laion.ai/laion-400-open-dataset/) | 41K | 78.5


To our knowledge, EVA-CLIP is the largest performant open-sourced CLIP model evaluated via zero-shot classification performance.

For more details of EVA-CLIP, please refer to Section 2.3.5 of [paper](https://arxiv.org/pdf/2211.07636.pdf).

## Performance

| dataset | acc1 | acc5 | mean_per_class_recall |
|:-----------:|:------:|:------:|:------:|
| `imagenet1k` | 78.53 | 95.51 | 78.51 |
| `imagenet-a` | 73.59 | 90.93 | 69.97 |
| `imagenet-r` | 92.5 | 98.24 | 91.19 |
| `imagenet-sketch` | 67.31 | 89.07 | 67.31 |
| `imagenetv2` | 71.52 | 92.11 | 71.56 |
| `objectnet` | 72.33 | 89.37 | 70.88 |

## Usage

```python
import torch
from PIL import Image
from flagai.auto_model.auto_loader import AutoLoader
from flagai.data.dataset.mm.clip_dataset import clip_transform

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

loader = AutoLoader(task_name="txt_img_matching", #contrastive learning
model_name="eva-clip")

model = loader.get_model()
model.eval()
model.to(device)
tokenizer = loader.get_tokenizer()
transform = clip_transform(img_size=model.visual.image_size)

def download_image(url):
urllib_request = urllib.request.Request(
url,
data=None,
headers={"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0"},
)
with urllib.request.urlopen(urllib_request, timeout=10) as r:
img_stream = io.BytesIO(r.read())
return img_stream

def inference():
# local image
# image = Image.open(/path/to/image)
# online image
image = Image.open(download_image("https://bkimg.cdn.bcebos.com/pic/4610b912c8fcc3ce2d02315d9d45d688d53f209a?x-bce-process=image/watermark,image_d2F0ZXIvYmFpa2UxMTY=,g_7,xp_5,yp_5"))
image = transform(image).unsqueeze(0).to(device)
text = tokenizer.tokenize_as_tensor(["a tomato", "a cat"]).to(device)

with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
text_probs = (image_features @ text_features.T).softmax(dim=-1)

print(text_probs.cpu().numpy()[0].tolist()) # [1.0, 0.0]
```

## Zero-Shot Prediction
The code below performs zero-shot prediction using EVA_CLIP. This example takes an image from the CIFAR-100 dataset, and predicts the most likely labels among the 100 textual labels from the dataset.

```python
import os
import torch
from torchvision.datasets import CIFAR100
from flagai.auto_model.auto_loader import AutoLoader
from flagai.data.dataset.mm.clip_dataset import clip_transform

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

loader = AutoLoader(task_name="txt_img_matching", #contrastive learning
model_name="eva-clip")

model = loader.get_model()
model.eval()
model.to(device)
tokenizer = loader.get_tokenizer()
transform = clip_transform(img_size=model.visual.image_size)

# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)

# Prepare the inputs
image, class_id = cifar100[3637]
image_input = transform(image).unsqueeze(0).to(device)
text_inputs = torch.cat([tokenizer.tokenize_as_tensor(f"a photo of a {c}") for c in cifar100.classes]).to(device)

# Calculate features
with torch.no_grad():
image_features = model.encode_image(image_input)
text_features = model.encode_text(text_inputs)

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

```
The output will look like the following (the exact numbers may be slightly different depending on the compute device):
```bash
Top predictions:

snake: 100.00%
turtle: 0.00%
caterpillar: 0.00%
worm: 0.00%
leopard: 0.00%
```

## Acknowledgement

EVA-CLIP is built with [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip) and [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark).
Thanks for their awesome works!
Loading

0 comments on commit 46eb235

Please sign in to comment.