CMT_nuScenes_testset.mp4
This repository is an official implementation of CMT.
CMT is a robust 3D detector for end-to-end 3D multi-modal detection. A DETR-like framework is designed for multi-modal detection(CMT) and lidar-only detection(CMT-L), which obtains 73.5% and 70.1% NDS separately on nuScenes benchmark. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. CMT can be a strong baseline for further research.
-
Environments
Python == 3.8, CUDA == 11.1, pytorch == 1.9.0, mmdet3d == 1.0.0rc5 -
Data
Follow the mmdet3d to process the nuScenes dataset (https://github.com/open-mmlab/mmdetection3d/blob/master/docs/en/data_preparation.md).
We provide some results on nuScenes val set. The default batch size is 2 on each GPU.
config | mAP | NDS | GPU | schedule | time |
---|---|---|---|---|---|
CMT-pillar0200-r50-704x256 | 53.8% | 58.5% | 8 x 2080ti | 20 epoch | 13 hours |
CMT-voxel0100-r50-800x320 | 60.1% | 63.4% | 8 x 2080ti | 20 epoch | 14 hours |
CMT-voxel0075-vov-1600x640 | 69.4% | 71.9% | 8 x A100 | 15e+5e(with cbgs) | 45 hours |
If you find CMT helpful in your research, please consider citing:
@article{yan2023cross,
title={Cross Modal Transformer via Coordinates Encoding for 3D Object Dectection},
author={Yan, Junjie and Liu, Yingfei and Sun, Jianjian and Jia, Fan and Li, Shuailin and Wang, Tiancai and Zhang, Xiangyu},
journal={arXiv preprint arXiv:2301.01283},
year={2023}
}
If you have any questions, feel free to open an issue or contact us at [email protected], [email protected], [email protected] or [email protected].