A Multilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer
Link to our MOVText: A Large-Scale, Multilingual Open World Dataset for Video Text Spotting
-
(11/05/2022) TransDETR, a better transformer-based video text spotting method has been launched.
-
(08/04/2021) Refactoring the code.
-
(10/20/2021) The complete code has been released .
Methods | MOTA | MOTP | IDF1 | Mostly Matched | Partially Matched | Mostly Lost |
---|---|---|---|---|---|---|
TransVTSpotter | 45.75 | 73.58 | 57.56 | 658 | 611 | 647 |
- The training time is on 8 NVIDIA V100 GPUs with batchsize 16.
- We use the models pre-trained on COCOTextV2.
- We do not release the recognition code due to the company's regulations.
The codebases are built on top of Deformable DETR and TransTrack.
- Linux, CUDA>=9.2, GCC>=5.4
- Python>=3.7
- PyTorch ≥ 1.5 and torchvision that matches the PyTorch installation. You can install them together at pytorch.org to make sure of this
- OpenCV is optional and needed by demo and visualization
- Install and build libs
git clone [email protected]:weijiawu/TransVTSpotter.git
cd TransVTSpotter
cd models/ops
python setup.py build install
cd ../..
pip install -r requirements.txt
- Prepare datasets and annotations
COCOTextV2 dataset is available in COCOTextV2.
python3 track_tools/convert_COCOText_to_coco.py
ICDAR2015 dataset is available in icdar2015.
python3 track_tools/convert_ICDAR15video_to_coco.py
- Pre-train on COCOTextV2
python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py --output_dir ./output/Pretrain_COCOTextV2 --dataset_file pretrain --coco_path ./Data/COCOTextV2 --batch_size 2 --with_box_refine --num_queries 500 --epochs 300 --lr_drop 100 --resume ./output/Pretrain_COCOTextV2/checkpoint.pth
python3 track_tools/Pretrain_model_to_mot.py
The pre-trained model is available Baidu Netdisk, password:59w8. Google Netdisk
And the MOTA 44% can be found here password:xnlw. Google Netdisk
- Train TransVTSpotter
python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py --output_dir ./output/ICDAR15 --dataset_file text --coco_path ./Data/ICDAR2015_video --batch_size 2 --with_box_refine --num_queries 300 --epochs 80 --lr_drop 40 --resume ./output/Pretrain_COCOTextV2/pretrain_coco.pth
- Inference and Visualize TransVTSpotter
# Inference
python3 main_track.py --output_dir ./output/ICDAR15 --dataset_file text --coco_path ./Data/ICDAR2015_video --batch_size 1 --resume ./output/ICDAR15/checkpoint.pth --eval --with_box_refine --num_queries 300 --track_thresh 0.3
# Visualize
python3 track_tools/Evaluation_ICDAR15_video/vis_tracking.py
TransVTSpotter is released under MIT License.
If you use TranVTSpotter in your research or wish to refer to the baseline results published here, please use the following BibTeX entries:
@article{wu2021opentext,
title={A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer},
author={Weijia Wu, Debing Zhang, Yuanqiang Cai, Sibo Wang, Jiahong Li, Zhuang Li, Yejun Tang, Hong Zhou},
journal={35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks},
year={2021}
}