Skip to content

Commit

Permalink
First commit
Browse files Browse the repository at this point in the history
  • Loading branch information
nxf1111 committed Mar 25, 2024
1 parent fb4293e commit 02117ef
Show file tree
Hide file tree
Showing 53 changed files with 7,151 additions and 1 deletion.
Empty file modified .gitignore
100644 → 100755
Empty file.
Empty file modified LICENSE
100644 → 100755
Empty file.
85 changes: 84 additions & 1 deletion README.md
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,2 +1,85 @@
# OmDet
# OmDet-Turbo

<p align="center">
<a href="https://arxiv.org/abs/2403.06892"><strong> [Paper 📄] </strong></a> <a href=https://arxiv.org/abs/2403.06892"><strong> [Model 🗂️] </strong></a>
</p>
<p align="center">
Fast and accurate open-vocabulary end-to-end object detection
</p>

***
## 🗓️ Updates
* 03/25/2024: Inference code and a pretrained OmDet-Turbo-Tiny model released.
* 03/12/2024: Github open-source project creted

***
## 🔗 Related Works
If you are interested in our research, we welcome you to explore our other wonderful projects.

🔆 [How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection](https://arxiv.org/abs/2308.13177)(AAAI24)

🔆 [OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network](https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cvi2.12268)(IET Computer Vision)

***
## 📖 Introduction
This repository is the official PyTorch implementation for **OmDet-Turbo**, a fast transformer-based open-vocabulary object detection model.

**⭐️Highlights**
1. **OmDet-Turbo** is a transformer-based real-time open-vocabulary
detector that combines strong OVD capabilities with fast inference speed.
This model addresses the challenges of efficient detection in open-vocabulary
scenarios while maintaining high detection performance.
2. We introduce the **Efficient Fusion Head**, a swift multimodal fusion module
designed to alleviate the computational burden on the encoder and reduce
the time consumption of the head with ROI.
3. OmDet-Turbo-Base model, achieves state-of-the-art zero-shot performance on the ODinW and OVDEval datasets, with AP scores
of **30.1** and **26.86**, respectively.
4. The inference speed of OmDetTurbo-Base on the COCO val2017 dataset reach **100.2** FPS on an A100 GPU.

For more details, check out our paper **[Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head](https://arxiv.org/abs/2403.06892)**
<img src="docs/turbo_model.jpeg" alt="model_structure" width="100%">


***
## ⚡️ Inference Speed
Comparison of inference speeds for each component in tiny-size model.
<img src="docs/speed_compare.jpeg" alt="speed" width="100%">

***
## 🛠️ How To Install
Follow the [Installation Instructions](install.md) to set up the environments for OmDet-Turbo

***
## 🚀 How To Run
1. Download our pretrained model and the [CLIP](https://huggingface.co/omlab/OmDet-Turbo_tiny_SWIN_T/resolve/main/ViT-B-16.pt?download=true) checkpoints.
2. Create a folder named **resources**, put downloaded models into this folder.
3. Run **run_demo.py**, the images with predicted results will be saved at **./outputs** folder.

We already added language cache while inferring with **run_demo.py**. For more details, please open and check **run_demo.py** scripts.


***
## 📦 Model Zoo
The performance of COCO and LVIS are evaluated under zero-shot setting.

Model | Backbone | Pre-Train Data | COCO | LVIS | FPS (pytorch/trt) |Weight
-- |--------|-----------------| -- | -- |-------------------| --
OmDet-Turbo-Tiny| Swin-T | O365,GoldG | 42.5 | 30.3 | 21.5/140.0 | [weight](https://huggingface.co/omlab/OmDet-Turbo_tiny_SWIN_T/tree/main)

***
## 📝 Main Results
<img src="docs/main_results.png" alt="main_result" width="100%">


***
## Citation
Please consider citing our papers if you use our projects:

```
@article{zhao2024real,
title={Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head},
author={Zhao, Tiancheng and Liu, Peng and He, Xuan and Zhang, Lu and Lee, Kyusong},
journal={arXiv preprint arXiv:2403.06892},
year={2024}
}
```
80 changes: 80 additions & 0 deletions configs/OmDet-Turbo_tiny_SWIN_T.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
MODEL:
META_ARCHITECTURE: OmDetV2Turbo
DEPLOY_MODE: true
SWIN:
OUT_FEATURES:
- 1
- 2
- 3
SIZE: T
USE_CHECKPOINT: false
BACKBONE:
NAME: build_swintransformer_backbone
LANGUAGE_BACKBONE:
MODEL_TYPE: "clip"
LANG_DIM: 512
DEVICE: cuda
FUSE_TYPE: merged_attn
TRANSFORMER_DECODER: ELADecoder
TRANSFORMER_ENCODER: ELAEncoder
HEAD: DINOHead
ELAEncoder:
act: gelu
depth_mult: 1.0
dim_feedforward: 2048
encoder_layer: TransformerLayer
eval_size: null
expansion: 1.0
feat_strides:
- 8
- 16
- 32
hidden_dim: 256
in_channels:
- 192
- 384
- 768
num_encoder_layers: 1
pe_temperature: 10000
use_encoder_idx:
- 2
PIXEL_MEAN:
- 123.675
- 116.28
- 103.53
PIXEL_STD:
- 58.395
- 57.12
- 57.375
ELADecoder:
activation: relu
backbone_feat_channels:
- 256
- 256
- 256
box_noise_scale: 1.0
cls_type: cosine
dim_feedforward: 2048
dropout: 0.0
eps: 0.01
eval_idx: -1
eval_size: null
feat_strides:
- 8
- 16
- 32
hidden_dim: 256
label_noise_ratio: 0.5
learnt_init_query: false
nhead: 8
num_decoder_layers: 6
num_decoder_points: 4
num_denoising: 100
num_levels: 3
num_queries: 900
position_embed_type: sine
WEIGHTS: resources/swin_tiny_patch4_window7_224.pkl
INPUT:
FORMAT: RGB
MAX_SIZE_TEST: 640
MIN_SIZE_TEST: 640
Binary file added docs/main_results.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/speed_compare.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/turbo_model.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
34 changes: 34 additions & 0 deletions install.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Install
## Requirements

* CUDA>=11.8

* Python>=3.9

Create Python environments.
```bash
conda create -n omdet python=3.9
```
Activate the environment:
```bash
conda activate omdet
```

* Pytorch>=2.1.0, Torchvision>=0.17.1

If your CUDA version is 11.8, you can install Pytorch as following:
```bash
conda install pytorch==2.1.0 torchvision==0.17.1 pytorch-cuda=11.8 -c pytorch -c nvidia
```

* detectron2>=0.6.0:

Install detectron2:
```bash
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
```

* Other requirements
```bash
pip install -r requirements.txt
```
Empty file added omdet/__init__.py
Empty file.
Empty file added omdet/infernece/__init__.py
Empty file.
57 changes: 57 additions & 0 deletions omdet/infernece/base_engine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
import torch
from PIL import Image
import requests
import io
import base64
from detectron2.data.detection_utils import _apply_exif_orientation, convert_PIL_to_numpy
import numpy as np


def get_output_shape(oldh: int, oldw: int, short_edge_length: int, max_size: int):
"""
Compute the output size given input size and target short edge length.
"""
h, w = oldh, oldw
size = short_edge_length * 1.0
scale = size / min(h, w)
if h < w:
newh, neww = size, scale * w
else:
newh, neww = scale * h, size
if max(newh, neww) > max_size:
scale = max_size * 1.0 / max(newh, neww)
newh = newh * scale
neww = neww * scale
neww = int(neww + 0.5)
newh = int(newh + 0.5)
return (newh, neww)


class BaseEngine(object):
def _load_data(self, src_type, cfg, data, return_transform=False):
if src_type == 'local':
image_data = [Image.open(x) for x in data]

elif src_type == 'url':
image_data = []
for x in data:
temp = Image.open(io.BytesIO(requests.get(x).content))
image_data.append(temp)

else:
raise Exception("Unknown mode {}.".format(src_type))

input_data = []
transforms = []
for x in image_data:
width, height = x.size
pil_image = x.resize((cfg.INPUT.MIN_SIZE_TEST, cfg.INPUT.MIN_SIZE_TEST), Image.BILINEAR)
image = convert_PIL_to_numpy(pil_image, cfg.INPUT.FORMAT)

image = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))
input_data.append({"image": image, "height": height, "width": width})

if return_transform:
return input_data, transforms
else:
return input_data
95 changes: 95 additions & 0 deletions omdet/infernece/det_engine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
import os
import torch
from typing import List, Union, Dict
from omdet.utils.tools import chunks
from detectron2.checkpoint import DetectionCheckpointer
from detectron2.config import get_cfg
from detectron2.engine import DefaultTrainer as Trainer
from omdet.utils.cache import LRUCache
from omdet.infernece.base_engine import BaseEngine
from detectron2.utils.logger import setup_logger
from omdet.omdet_v2_turbo.config import add_omdet_v2_turbo_config


class DetEngine(BaseEngine):
def __init__(self, model_dir='resources/', device='cpu', batch_size=10):
self.model_dir = model_dir
self._models = LRUCache(10)
self.device = device
self.batch_size = batch_size
self.logger = setup_logger(name=__name__)

def _init_cfg(self, cfg, model_id):
cfg.MODEL.WEIGHTS = os.path.join(self.model_dir, model_id+'.pth')
cfg.MODEL.DEVICE = self.device
cfg.INPUT.MAX_SIZE_TEST = 640
cfg.INPUT.MIN_SIZE_TEST = 640
cfg.MODEL.DEPLOY_MODE = True
cfg.freeze()
return cfg

def count_parameters(self, model):
return sum(p.numel() for p in model.parameters())

def _load_model(self, model_id):
if not self._models.has(model_id):
cfg = get_cfg()
add_omdet_v2_turbo_config(cfg)
cfg.merge_from_file(os.path.join('configs', model_id+'.yaml'))
cfg = self._init_cfg(cfg, model_id)
model = Trainer.build_model(cfg)
self.logger.info("Model:\n{}".format(model))
DetectionCheckpointer(model).load(cfg.MODEL.WEIGHTS)
print("Loading a OmDet model {}".format(cfg.MODEL.WEIGHTS))
model.eval()
model.to(cfg.MODEL.DEVICE)
print("Total parameters: {}".format(self.count_parameters(model)))
self._models.put(model_id, (model, cfg))

return self._models.get(model_id)

def inf_predict(self, model_id,
data: List,
task: Union[str, List],
labels: List[str],
src_type: str = 'local',
conf_threshold: float = 0.5,
nms_threshold: float = 0.5
):

if len(task) == 0:
raise Exception("Task cannot be empty.")

model, cfg = self._load_model(model_id)

resp = []
flat_labels = labels

with torch.no_grad():
for batch in chunks(data, self.batch_size):
batch_image = self._load_data(src_type, cfg, batch)
for img in batch_image:
img['label_set'] = labels
img['tasks'] = task

batch_y = model(batch_image, score_thresh=conf_threshold, nms_thresh=nms_threshold)

for z in batch_y:
temp = []
instances = z['instances'].to('cpu')
instances = instances[instances.scores > conf_threshold]

for idx, pred in enumerate(zip(instances.pred_boxes, instances.scores, instances.pred_classes)):
(x, y, xx, yy), conf, cls = pred
conf = float(conf)
cls = flat_labels[int(cls)]

temp.append({'xmin': int(x),
'ymin': int(y),
'xmax': int(xx),
'ymax': int(yy),
'conf': conf,
'label': cls})
resp.append(temp)

return resp
Empty file added omdet/modeling/__init__.py
Empty file.
1 change: 1 addition & 0 deletions omdet/modeling/backbone/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from omdet.modeling.backbone import (convnext, swint)
Loading

0 comments on commit 02117ef

Please sign in to comment.