First commit

CaramelMario · Mar 25, 2024 · 02117ef · 02117ef
1 parent fb4293e
commit 02117ef
Show file tree

Hide file tree

Showing 53 changed files with 7,151 additions and 1 deletion.
diff --git a/.gitignore b/.gitignore
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -1,2 +1,85 @@
-# OmDet
+# OmDet-Turbo
+
+<p align="center">
+ <a href="https://arxiv.org/abs/2403.06892"><strong> [Paper 📄] </strong></a> <a href=https://arxiv.org/abs/2403.06892"><strong> [Model 🗂️] </strong></a>
+</p>
+<p align="center">
 Fast and accurate open-vocabulary end-to-end object detection
+</p>
+
+***
+## 🗓️ Updates
+* 03/25/2024: Inference code and a pretrained OmDet-Turbo-Tiny model released.
+* 03/12/2024: Github open-source project creted
+
+***
+## 🔗 Related Works
+If you are interested in our research, we welcome you to explore our other wonderful projects.
+
+🔆 [How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection](https://arxiv.org/abs/2308.13177)(AAAI24)
+
+🔆 [OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network](https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cvi2.12268)(IET Computer Vision)
+
+***
+## 📖 Introduction
+This repository is the official PyTorch implementation for **OmDet-Turbo**, a fast transformer-based open-vocabulary object detection model.
+
+**⭐️Highlights**
+1. **OmDet-Turbo** is a transformer-based real-time open-vocabulary
+detector that combines strong OVD capabilities with fast inference speed.
+This model addresses the challenges of efficient detection in open-vocabulary
+scenarios while maintaining high detection performance.
+2. We introduce the **Efficient Fusion Head**, a swift multimodal fusion module
+designed to alleviate the computational burden on the encoder and reduce
+the time consumption of the head with ROI. 
+3. OmDet-Turbo-Base model, achieves state-of-the-art zero-shot performance on the ODinW and OVDEval datasets, with AP scores
+of **30.1** and **26.86**, respectively. 
+4. The inference speed of OmDetTurbo-Base on the COCO val2017 dataset reach **100.2** FPS on an A100 GPU.
+
+For more details, check out our paper **[Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head](https://arxiv.org/abs/2403.06892)**
+<img src="docs/turbo_model.jpeg" alt="model_structure" width="100%">
+
+
+***
+## ⚡️ Inference Speed
+Comparison of inference speeds for each component in tiny-size model.
+<img src="docs/speed_compare.jpeg" alt="speed" width="100%">
+
+***
+## 🛠️ How To Install 
+Follow the [Installation Instructions](install.md) to set up the environments for OmDet-Turbo
+
+***
+## 🚀 How To Run
+1. Download our pretrained model and the [CLIP](https://huggingface.co/omlab/OmDet-Turbo_tiny_SWIN_T/resolve/main/ViT-B-16.pt?download=true) checkpoints.
+2. Create a folder named **resources**, put downloaded models into this folder.
+3. Run **run_demo.py**, the images with predicted results will be saved at **./outputs** folder.
+
+We already added language cache while inferring with **run_demo.py**. For more details, please open and check **run_demo.py** scripts. 
+
+
+***
+## 📦 Model Zoo
+The performance of COCO and LVIS are evaluated under zero-shot setting.
+
+Model | Backbone | Pre-Train Data  | COCO | LVIS | FPS (pytorch/trt) |Weight 
+-- |--------|-----------------| -- | -- |-------------------| --
+OmDet-Turbo-Tiny| Swin-T | O365,GoldG | 42.5 | 30.3 | 21.5/140.0 |  [weight](https://huggingface.co/omlab/OmDet-Turbo_tiny_SWIN_T/tree/main)     
+
+***
+## 📝 Main Results
+<img src="docs/main_results.png" alt="main_result" width="100%">
+
+
+***
+## Citation
+Please consider citing our papers if you use our projects:
+
+```
+@article{zhao2024real,
+  title={Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head},
+  author={Zhao, Tiancheng and Liu, Peng and He, Xuan and Zhang, Lu and Lee, Kyusong},
+  journal={arXiv preprint arXiv:2403.06892},
+  year={2024}
+}
+```
diff --git a/configs/OmDet-Turbo_tiny_SWIN_T.yaml b/configs/OmDet-Turbo_tiny_SWIN_T.yaml
@@ -0,0 +1,80 @@
+MODEL:
+  META_ARCHITECTURE: OmDetV2Turbo
+  DEPLOY_MODE: true
+  SWIN:
+    OUT_FEATURES:
+      - 1
+      - 2
+      - 3
+    SIZE: T
+    USE_CHECKPOINT: false
+  BACKBONE:
+    NAME: build_swintransformer_backbone
+  LANGUAGE_BACKBONE:
+    MODEL_TYPE: "clip"
+    LANG_DIM: 512
+  DEVICE: cuda
+  FUSE_TYPE: merged_attn
+  TRANSFORMER_DECODER: ELADecoder
+  TRANSFORMER_ENCODER: ELAEncoder
+  HEAD: DINOHead
+  ELAEncoder:
+    act: gelu
+    depth_mult: 1.0
+    dim_feedforward: 2048
+    encoder_layer: TransformerLayer
+    eval_size: null
+    expansion: 1.0
+    feat_strides:
+    - 8
+    - 16
+    - 32
+    hidden_dim: 256
+    in_channels:
+    - 192
+    - 384
+    - 768
+    num_encoder_layers: 1
+    pe_temperature: 10000
+    use_encoder_idx:
+    - 2
+  PIXEL_MEAN:
+  - 123.675
+  - 116.28
+  - 103.53
+  PIXEL_STD:
+  - 58.395
+  - 57.12
+  - 57.375
+  ELADecoder:
+    activation: relu
+    backbone_feat_channels:
+    - 256
+    - 256
+    - 256
+    box_noise_scale: 1.0
+    cls_type: cosine
+    dim_feedforward: 2048
+    dropout: 0.0
+    eps: 0.01
+    eval_idx: -1
+    eval_size: null
+    feat_strides:
+    - 8
+    - 16
+    - 32
+    hidden_dim: 256
+    label_noise_ratio: 0.5
+    learnt_init_query: false
+    nhead: 8
+    num_decoder_layers: 6
+    num_decoder_points: 4
+    num_denoising: 100
+    num_levels: 3
+    num_queries: 900
+    position_embed_type: sine
+  WEIGHTS: resources/swin_tiny_patch4_window7_224.pkl
+INPUT:
+  FORMAT: RGB
+  MAX_SIZE_TEST: 640
+  MIN_SIZE_TEST: 640
diff --git a/docs/main_results.png b/docs/main_results.png
diff --git a/docs/speed_compare.jpeg b/docs/speed_compare.jpeg
diff --git a/docs/turbo_model.jpeg b/docs/turbo_model.jpeg
diff --git a/install.md b/install.md
@@ -0,0 +1,34 @@
+# Install
+## Requirements
+
+* CUDA>=11.8
+
+* Python>=3.9
+
+  Create Python environments.
+  ```bash
+  conda create -n omdet python=3.9
+  ```
+  Activate the environment:
+  ```bash
+  conda activate omdet
+  ```
+
+* Pytorch>=2.1.0, Torchvision>=0.17.1
+
+  If your CUDA version is 11.8, you can install Pytorch as following:
+  ```bash
+  conda install pytorch==2.1.0 torchvision==0.17.1 pytorch-cuda=11.8 -c pytorch -c nvidia
+  ```
+
+* detectron2>=0.6.0:
+
+  Install detectron2:
+  ```bash
+  python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
+  ```
+
+* Other requirements
+    ```bash
+    pip install -r requirements.txt
+    ```
diff --git a/omdet/__init__.py b/omdet/__init__.py
diff --git a/omdet/infernece/__init__.py b/omdet/infernece/__init__.py
diff --git a/omdet/infernece/base_engine.py b/omdet/infernece/base_engine.py
@@ -0,0 +1,57 @@
+import torch
+from PIL import Image
+import requests
+import io
+import base64
+from detectron2.data.detection_utils import _apply_exif_orientation, convert_PIL_to_numpy
+import numpy as np
+
+
+def get_output_shape(oldh: int, oldw: int, short_edge_length: int, max_size: int):
+    """
+    Compute the output size given input size and target short edge length.
+    """
+    h, w = oldh, oldw
+    size = short_edge_length * 1.0
+    scale = size / min(h, w)
+    if h < w:
+        newh, neww = size, scale * w
+    else:
+        newh, neww = scale * h, size
+    if max(newh, neww) > max_size:
+        scale = max_size * 1.0 / max(newh, neww)
+        newh = newh * scale
+        neww = neww * scale
+    neww = int(neww + 0.5)
+    newh = int(newh + 0.5)
+    return (newh, neww)
+
+
+class BaseEngine(object):
+    def _load_data(self, src_type, cfg, data, return_transform=False):
+        if src_type == 'local':
+            image_data = [Image.open(x) for x in data]
+
+        elif src_type == 'url':
+            image_data = []
+            for x in data:
+                temp = Image.open(io.BytesIO(requests.get(x).content))
+                image_data.append(temp)
+
+        else:
+            raise Exception("Unknown mode {}.".format(src_type))
+
+        input_data = []
+        transforms = []
+        for x in image_data:
+            width, height = x.size
+            pil_image = x.resize((cfg.INPUT.MIN_SIZE_TEST, cfg.INPUT.MIN_SIZE_TEST), Image.BILINEAR)
+            image = convert_PIL_to_numpy(pil_image, cfg.INPUT.FORMAT)
+
+            image = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))
+            input_data.append({"image": image, "height": height, "width": width})
+
+        if return_transform:
+            return input_data, transforms
+        else:
+            return input_data
diff --git a/omdet/infernece/det_engine.py b/omdet/infernece/det_engine.py
@@ -0,0 +1,95 @@
+import os
+import torch
+from typing import List, Union, Dict
+from omdet.utils.tools import chunks
+from detectron2.checkpoint import DetectionCheckpointer
+from detectron2.config import get_cfg
+from detectron2.engine import DefaultTrainer as Trainer
+from omdet.utils.cache import LRUCache
+from omdet.infernece.base_engine import BaseEngine
+from detectron2.utils.logger import setup_logger
+from omdet.omdet_v2_turbo.config import add_omdet_v2_turbo_config
+
+
+class DetEngine(BaseEngine):
+    def __init__(self, model_dir='resources/', device='cpu', batch_size=10):
+        self.model_dir = model_dir
+        self._models = LRUCache(10)
+        self.device = device
+        self.batch_size = batch_size
+        self.logger = setup_logger(name=__name__)
+
+    def _init_cfg(self, cfg, model_id):
+        cfg.MODEL.WEIGHTS = os.path.join(self.model_dir, model_id+'.pth')
+        cfg.MODEL.DEVICE = self.device
+        cfg.INPUT.MAX_SIZE_TEST = 640
+        cfg.INPUT.MIN_SIZE_TEST = 640
+        cfg.MODEL.DEPLOY_MODE = True
+        cfg.freeze()
+        return cfg
+
+    def count_parameters(self, model):
+        return sum(p.numel() for p in model.parameters())
+
+    def _load_model(self, model_id):
+        if not self._models.has(model_id):
+            cfg = get_cfg()
+            add_omdet_v2_turbo_config(cfg)
+            cfg.merge_from_file(os.path.join('configs', model_id+'.yaml'))
+            cfg = self._init_cfg(cfg, model_id)
+            model = Trainer.build_model(cfg)
+            self.logger.info("Model:\n{}".format(model))
+            DetectionCheckpointer(model).load(cfg.MODEL.WEIGHTS)
+            print("Loading a OmDet model {}".format(cfg.MODEL.WEIGHTS))
+            model.eval()
+            model.to(cfg.MODEL.DEVICE)
+            print("Total parameters: {}".format(self.count_parameters(model)))
+            self._models.put(model_id, (model, cfg))
+
+        return self._models.get(model_id)
+
+    def inf_predict(self, model_id,
+                    data: List,
+                    task: Union[str, List],
+                    labels: List[str],
+                    src_type: str = 'local',
+                    conf_threshold: float = 0.5,
+                    nms_threshold: float = 0.5
+                    ):
+
+        if len(task) == 0:
+            raise Exception("Task cannot be empty.")
+
+        model, cfg = self._load_model(model_id)
+
+        resp = []
+        flat_labels = labels
+
+        with torch.no_grad():
+            for batch in chunks(data, self.batch_size):
+                batch_image = self._load_data(src_type, cfg, batch)
+                for img in batch_image:
+                    img['label_set'] = labels
+                    img['tasks'] = task
+
+                batch_y = model(batch_image, score_thresh=conf_threshold, nms_thresh=nms_threshold)
+
+                for z in batch_y:
+                    temp = []
+                    instances = z['instances'].to('cpu')
+                    instances = instances[instances.scores > conf_threshold]
+
+                    for idx, pred in enumerate(zip(instances.pred_boxes, instances.scores, instances.pred_classes)):
+                        (x, y, xx, yy), conf, cls = pred
+                        conf = float(conf)
+                        cls = flat_labels[int(cls)]
+
+                        temp.append({'xmin': int(x),
+                                     'ymin': int(y),
+                                     'xmax': int(xx),
+                                     'ymax': int(yy),
+                                     'conf': conf,
+                                     'label': cls})
+                    resp.append(temp)
+
+        return resp
diff --git a/omdet/modeling/__init__.py b/omdet/modeling/__init__.py
diff --git a/omdet/modeling/backbone/__init__.py b/omdet/modeling/backbone/__init__.py
@@ -0,0 +1 @@
+from omdet.modeling.backbone import (convnext,  swint)
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		from omdet.modeling.backbone import (convnext, swint)