forked from om-ai-lab/OmDet
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
53 changed files
with
7,151 additions
and
1 deletion.
There are no files selected for viewing
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,85 @@ | ||
# OmDet | ||
# OmDet-Turbo | ||
|
||
<p align="center"> | ||
<a href="https://arxiv.org/abs/2403.06892"><strong> [Paper 📄] </strong></a> <a href=https://arxiv.org/abs/2403.06892"><strong> [Model 🗂️] </strong></a> | ||
</p> | ||
<p align="center"> | ||
Fast and accurate open-vocabulary end-to-end object detection | ||
</p> | ||
|
||
*** | ||
## 🗓️ Updates | ||
* 03/25/2024: Inference code and a pretrained OmDet-Turbo-Tiny model released. | ||
* 03/12/2024: Github open-source project creted | ||
|
||
*** | ||
## 🔗 Related Works | ||
If you are interested in our research, we welcome you to explore our other wonderful projects. | ||
|
||
🔆 [How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection](https://arxiv.org/abs/2308.13177)(AAAI24) | ||
|
||
🔆 [OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network](https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cvi2.12268)(IET Computer Vision) | ||
|
||
*** | ||
## 📖 Introduction | ||
This repository is the official PyTorch implementation for **OmDet-Turbo**, a fast transformer-based open-vocabulary object detection model. | ||
|
||
**⭐️Highlights** | ||
1. **OmDet-Turbo** is a transformer-based real-time open-vocabulary | ||
detector that combines strong OVD capabilities with fast inference speed. | ||
This model addresses the challenges of efficient detection in open-vocabulary | ||
scenarios while maintaining high detection performance. | ||
2. We introduce the **Efficient Fusion Head**, a swift multimodal fusion module | ||
designed to alleviate the computational burden on the encoder and reduce | ||
the time consumption of the head with ROI. | ||
3. OmDet-Turbo-Base model, achieves state-of-the-art zero-shot performance on the ODinW and OVDEval datasets, with AP scores | ||
of **30.1** and **26.86**, respectively. | ||
4. The inference speed of OmDetTurbo-Base on the COCO val2017 dataset reach **100.2** FPS on an A100 GPU. | ||
|
||
For more details, check out our paper **[Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head](https://arxiv.org/abs/2403.06892)** | ||
<img src="docs/turbo_model.jpeg" alt="model_structure" width="100%"> | ||
|
||
|
||
*** | ||
## ⚡️ Inference Speed | ||
Comparison of inference speeds for each component in tiny-size model. | ||
<img src="docs/speed_compare.jpeg" alt="speed" width="100%"> | ||
|
||
*** | ||
## 🛠️ How To Install | ||
Follow the [Installation Instructions](install.md) to set up the environments for OmDet-Turbo | ||
|
||
*** | ||
## 🚀 How To Run | ||
1. Download our pretrained model and the [CLIP](https://huggingface.co/omlab/OmDet-Turbo_tiny_SWIN_T/resolve/main/ViT-B-16.pt?download=true) checkpoints. | ||
2. Create a folder named **resources**, put downloaded models into this folder. | ||
3. Run **run_demo.py**, the images with predicted results will be saved at **./outputs** folder. | ||
|
||
We already added language cache while inferring with **run_demo.py**. For more details, please open and check **run_demo.py** scripts. | ||
|
||
|
||
*** | ||
## 📦 Model Zoo | ||
The performance of COCO and LVIS are evaluated under zero-shot setting. | ||
|
||
Model | Backbone | Pre-Train Data | COCO | LVIS | FPS (pytorch/trt) |Weight | ||
-- |--------|-----------------| -- | -- |-------------------| -- | ||
OmDet-Turbo-Tiny| Swin-T | O365,GoldG | 42.5 | 30.3 | 21.5/140.0 | [weight](https://huggingface.co/omlab/OmDet-Turbo_tiny_SWIN_T/tree/main) | ||
|
||
*** | ||
## 📝 Main Results | ||
<img src="docs/main_results.png" alt="main_result" width="100%"> | ||
|
||
|
||
*** | ||
## Citation | ||
Please consider citing our papers if you use our projects: | ||
|
||
``` | ||
@article{zhao2024real, | ||
title={Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head}, | ||
author={Zhao, Tiancheng and Liu, Peng and He, Xuan and Zhang, Lu and Lee, Kyusong}, | ||
journal={arXiv preprint arXiv:2403.06892}, | ||
year={2024} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
MODEL: | ||
META_ARCHITECTURE: OmDetV2Turbo | ||
DEPLOY_MODE: true | ||
SWIN: | ||
OUT_FEATURES: | ||
- 1 | ||
- 2 | ||
- 3 | ||
SIZE: T | ||
USE_CHECKPOINT: false | ||
BACKBONE: | ||
NAME: build_swintransformer_backbone | ||
LANGUAGE_BACKBONE: | ||
MODEL_TYPE: "clip" | ||
LANG_DIM: 512 | ||
DEVICE: cuda | ||
FUSE_TYPE: merged_attn | ||
TRANSFORMER_DECODER: ELADecoder | ||
TRANSFORMER_ENCODER: ELAEncoder | ||
HEAD: DINOHead | ||
ELAEncoder: | ||
act: gelu | ||
depth_mult: 1.0 | ||
dim_feedforward: 2048 | ||
encoder_layer: TransformerLayer | ||
eval_size: null | ||
expansion: 1.0 | ||
feat_strides: | ||
- 8 | ||
- 16 | ||
- 32 | ||
hidden_dim: 256 | ||
in_channels: | ||
- 192 | ||
- 384 | ||
- 768 | ||
num_encoder_layers: 1 | ||
pe_temperature: 10000 | ||
use_encoder_idx: | ||
- 2 | ||
PIXEL_MEAN: | ||
- 123.675 | ||
- 116.28 | ||
- 103.53 | ||
PIXEL_STD: | ||
- 58.395 | ||
- 57.12 | ||
- 57.375 | ||
ELADecoder: | ||
activation: relu | ||
backbone_feat_channels: | ||
- 256 | ||
- 256 | ||
- 256 | ||
box_noise_scale: 1.0 | ||
cls_type: cosine | ||
dim_feedforward: 2048 | ||
dropout: 0.0 | ||
eps: 0.01 | ||
eval_idx: -1 | ||
eval_size: null | ||
feat_strides: | ||
- 8 | ||
- 16 | ||
- 32 | ||
hidden_dim: 256 | ||
label_noise_ratio: 0.5 | ||
learnt_init_query: false | ||
nhead: 8 | ||
num_decoder_layers: 6 | ||
num_decoder_points: 4 | ||
num_denoising: 100 | ||
num_levels: 3 | ||
num_queries: 900 | ||
position_embed_type: sine | ||
WEIGHTS: resources/swin_tiny_patch4_window7_224.pkl | ||
INPUT: | ||
FORMAT: RGB | ||
MAX_SIZE_TEST: 640 | ||
MIN_SIZE_TEST: 640 |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# Install | ||
## Requirements | ||
|
||
* CUDA>=11.8 | ||
|
||
* Python>=3.9 | ||
|
||
Create Python environments. | ||
```bash | ||
conda create -n omdet python=3.9 | ||
``` | ||
Activate the environment: | ||
```bash | ||
conda activate omdet | ||
``` | ||
|
||
* Pytorch>=2.1.0, Torchvision>=0.17.1 | ||
|
||
If your CUDA version is 11.8, you can install Pytorch as following: | ||
```bash | ||
conda install pytorch==2.1.0 torchvision==0.17.1 pytorch-cuda=11.8 -c pytorch -c nvidia | ||
``` | ||
|
||
* detectron2>=0.6.0: | ||
|
||
Install detectron2: | ||
```bash | ||
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git' | ||
``` | ||
|
||
* Other requirements | ||
```bash | ||
pip install -r requirements.txt | ||
``` |
Empty file.
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
import torch | ||
from PIL import Image | ||
import requests | ||
import io | ||
import base64 | ||
from detectron2.data.detection_utils import _apply_exif_orientation, convert_PIL_to_numpy | ||
import numpy as np | ||
|
||
|
||
def get_output_shape(oldh: int, oldw: int, short_edge_length: int, max_size: int): | ||
""" | ||
Compute the output size given input size and target short edge length. | ||
""" | ||
h, w = oldh, oldw | ||
size = short_edge_length * 1.0 | ||
scale = size / min(h, w) | ||
if h < w: | ||
newh, neww = size, scale * w | ||
else: | ||
newh, neww = scale * h, size | ||
if max(newh, neww) > max_size: | ||
scale = max_size * 1.0 / max(newh, neww) | ||
newh = newh * scale | ||
neww = neww * scale | ||
neww = int(neww + 0.5) | ||
newh = int(newh + 0.5) | ||
return (newh, neww) | ||
|
||
|
||
class BaseEngine(object): | ||
def _load_data(self, src_type, cfg, data, return_transform=False): | ||
if src_type == 'local': | ||
image_data = [Image.open(x) for x in data] | ||
|
||
elif src_type == 'url': | ||
image_data = [] | ||
for x in data: | ||
temp = Image.open(io.BytesIO(requests.get(x).content)) | ||
image_data.append(temp) | ||
|
||
else: | ||
raise Exception("Unknown mode {}.".format(src_type)) | ||
|
||
input_data = [] | ||
transforms = [] | ||
for x in image_data: | ||
width, height = x.size | ||
pil_image = x.resize((cfg.INPUT.MIN_SIZE_TEST, cfg.INPUT.MIN_SIZE_TEST), Image.BILINEAR) | ||
image = convert_PIL_to_numpy(pil_image, cfg.INPUT.FORMAT) | ||
|
||
image = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1))) | ||
input_data.append({"image": image, "height": height, "width": width}) | ||
|
||
if return_transform: | ||
return input_data, transforms | ||
else: | ||
return input_data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
import os | ||
import torch | ||
from typing import List, Union, Dict | ||
from omdet.utils.tools import chunks | ||
from detectron2.checkpoint import DetectionCheckpointer | ||
from detectron2.config import get_cfg | ||
from detectron2.engine import DefaultTrainer as Trainer | ||
from omdet.utils.cache import LRUCache | ||
from omdet.infernece.base_engine import BaseEngine | ||
from detectron2.utils.logger import setup_logger | ||
from omdet.omdet_v2_turbo.config import add_omdet_v2_turbo_config | ||
|
||
|
||
class DetEngine(BaseEngine): | ||
def __init__(self, model_dir='resources/', device='cpu', batch_size=10): | ||
self.model_dir = model_dir | ||
self._models = LRUCache(10) | ||
self.device = device | ||
self.batch_size = batch_size | ||
self.logger = setup_logger(name=__name__) | ||
|
||
def _init_cfg(self, cfg, model_id): | ||
cfg.MODEL.WEIGHTS = os.path.join(self.model_dir, model_id+'.pth') | ||
cfg.MODEL.DEVICE = self.device | ||
cfg.INPUT.MAX_SIZE_TEST = 640 | ||
cfg.INPUT.MIN_SIZE_TEST = 640 | ||
cfg.MODEL.DEPLOY_MODE = True | ||
cfg.freeze() | ||
return cfg | ||
|
||
def count_parameters(self, model): | ||
return sum(p.numel() for p in model.parameters()) | ||
|
||
def _load_model(self, model_id): | ||
if not self._models.has(model_id): | ||
cfg = get_cfg() | ||
add_omdet_v2_turbo_config(cfg) | ||
cfg.merge_from_file(os.path.join('configs', model_id+'.yaml')) | ||
cfg = self._init_cfg(cfg, model_id) | ||
model = Trainer.build_model(cfg) | ||
self.logger.info("Model:\n{}".format(model)) | ||
DetectionCheckpointer(model).load(cfg.MODEL.WEIGHTS) | ||
print("Loading a OmDet model {}".format(cfg.MODEL.WEIGHTS)) | ||
model.eval() | ||
model.to(cfg.MODEL.DEVICE) | ||
print("Total parameters: {}".format(self.count_parameters(model))) | ||
self._models.put(model_id, (model, cfg)) | ||
|
||
return self._models.get(model_id) | ||
|
||
def inf_predict(self, model_id, | ||
data: List, | ||
task: Union[str, List], | ||
labels: List[str], | ||
src_type: str = 'local', | ||
conf_threshold: float = 0.5, | ||
nms_threshold: float = 0.5 | ||
): | ||
|
||
if len(task) == 0: | ||
raise Exception("Task cannot be empty.") | ||
|
||
model, cfg = self._load_model(model_id) | ||
|
||
resp = [] | ||
flat_labels = labels | ||
|
||
with torch.no_grad(): | ||
for batch in chunks(data, self.batch_size): | ||
batch_image = self._load_data(src_type, cfg, batch) | ||
for img in batch_image: | ||
img['label_set'] = labels | ||
img['tasks'] = task | ||
|
||
batch_y = model(batch_image, score_thresh=conf_threshold, nms_thresh=nms_threshold) | ||
|
||
for z in batch_y: | ||
temp = [] | ||
instances = z['instances'].to('cpu') | ||
instances = instances[instances.scores > conf_threshold] | ||
|
||
for idx, pred in enumerate(zip(instances.pred_boxes, instances.scores, instances.pred_classes)): | ||
(x, y, xx, yy), conf, cls = pred | ||
conf = float(conf) | ||
cls = flat_labels[int(cls)] | ||
|
||
temp.append({'xmin': int(x), | ||
'ymin': int(y), | ||
'xmax': int(xx), | ||
'ymax': int(yy), | ||
'conf': conf, | ||
'label': cls}) | ||
resp.append(temp) | ||
|
||
return resp |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from omdet.modeling.backbone import (convnext, swint) |
Oops, something went wrong.