Ali Athar, Alexander Hermans, Jonathon Luiten, Deva Ramanan, Bastian Leibe
PDF
| Paperswithcode
| Cite
Video Instance Segmentation | Video Panoptic Segmentation | Video Object Segmentation | Point Exemplar-guided Tracking |
---|---|---|---|
30.04.2023: Complete code and pre-trained checkpoints uploaded.
The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily def ined ‘targets’ in video. Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract ‘queries’ which are then used to predict pixel-precise target masks. A single TarViS model can be trained jointly on a collection of datasets spanning different tasks, and can hot-swap between tasks during inference without any task-specific retraining. To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking (PET). Our unified, jointly trained model achieves state-of-the-art performance on 5/7 benchmarks spanning these four tasks, and competitive performance on the remaining two
Build the deformable attention CUDA kernels as follows:
cd tarvis/modelling/backbone/temporal_neck/ops
bash make.sh
If you already have the kernels installed from Mask2Former then you can skip this step since they're identical.
For managing datasets, checkpoints and pretrained backbones, we use a single environment variable $TARVIS_WORKSPACE_DIR
which points to a directory that is organized as follows:
├── $TARVIS_WORKSPACE_DIR
│ ├── checkpoints <- model weights saved here during training
│ ├── pretrained_backbones <- ImageNet pretrained Swin backbone weights
│ ├── dataset_images <- Images/videos for all datasets go here
| | ├── training
| | | ├── ade20k
| | | | ├── ADE_train_00000001.jpg
| | | | ├── ...
| | | ├── cityscapes
| | | | ├── aachen
| | | | ├── ...
| | | ├── coco
| | | | ├── 000000000009.jpg
| | | | ├── ...
| | | ├── mapillary
| | | | ├── 0035fkbjWljhaftpVM37-g.jpg
| | | | ├── ...
| | | ├── cityscapes_vps
| | | | ├── 0001_0001_frankfurt_000000_000279_newImg8bit.png
| | | | ├── ...
| | | ├── kitti_step
| | | | ├── 0000
| | | | ├── ...
| | | ├── vipseg
| | | | ├── 0_wHveSGjXyDY
| | | | ├── ...
| | | ├── youtube_vis_2021
| | | | ├── 3245e049fb
| | | | ├── ...
| | | ├── ovis
| | | | ├── 001ca3cb
| | | | ├── ...
| | | ├── davis
| | | | ├── bear
| | | | ├── ...
| | | ├── burst
| | | | ├── YFCC100M
| | | | ├── ...
| | ├── inference
| | | ├── cityscapes_vps_val <- same directory structure as training
| | | ├── kitti_step_val
| | | ├── vipseg
| | | ├── youtube_vis_2021
| | | ├── ovis
| | | ├── davis
| | | ├── burst
| | | | ├── val
| | | | | ├── YFCC100M
| | | | | ├── ...
| | | | ├── test
| | | | | ├── YFCC100M
| | | | | ├── ...
| ├── dataset_annotations <- Annotations for all datasets go here
| | ├── training
| | | ├── ade20k_panoptic
| | | | ├── pan_maps
| | | | | ├── ADE_train_00000001.png
| | | | ├── segments.json
| | | ├── cityscapes_panoptic
| | | | ├── pan_maps
| | | | | ├── aachen_000000_000019_gtFine_panoptic.png
| | | | | ├── ...
| | | | ├── segments.json
| | | ├── coco_panoptic
| | | | ├── pan_maps
| | | | | ├── 000000000009.png
| | | | | ├── ...
| | | | ├── segments.json
| | | ├── mapillary_panoptic
| | | | ├── pan_maps
| | | | | ├── 0035fkbjWljhaftpVM37-g.png
| | | | | ├── ...
| | | | ├── segments.json
| | | ├── cityscapes_vps.json
| | | ├── kitti_step.json
| | | ├── vipseg
| | | | ├── panoptic_masks
| | | | | ├── 0_wHveSGjXyDY
| | | | | ├── ...
| | | | ├── video_info.json
| | | ├── youtube_vis_2021.json
| | | ├── ovis.json
| | | ├── davis_semisupervised.json
| | | ├── burst.json
| | ├── inference
| | | ├── cityscapes_vps
| | | | ├── im_all_info_val_city_vps.json
| | | ├── vipseg
| | | | ├── val.json
| | | ├── youtube_vis
| | | | ├── valid_2021.json
| | | ├── ovis
| | | | ├── valid.json
| | | ├── davis
| | | | ├── Annotations
| | | | ├── ImageSet_val.txt
| | | | ├── ImageSet_testdev.txt
| | | ├── burst
| | | | ├── first_frame_annotations_val.json
| | | | ├── first_frame_annotations_test.json
You do not need to setup all the datasets if you only want to train/infer on a sub-set of them. For training the full model however, you need the complete directory tree above. All paths in-code are provided from tarvis/utils/paths.py, so any changes can be easily made there.
To make the setup easier, you can download a partially complete workspace directory from HERE. It does not include any image files, but it does contain all the JSON format annotation files. For some datasets, e.g. DAVIS, we converted the dataset annotations into a custom format to make it easier to re-use data loading code.
Important: The license agreement for this repository does not apply to the annotations. Please refer to the license agreements for the respective datasets if you wish to use these annotations.
Download links to trained checkpoints are available below. Each zipped directory contains a checkpoint file ending with .pth
and a config.yaml
file. We also provide checkpoints for the models after the pre-training step on augmented image sequences.
Backbone | Pretrain (augmented images) | Finetune (video) |
---|---|---|
ResNet-50 | URL | URL |
Swin-T | URL | URL |
Swin-L | URL | URL |
Run the following command with the desired dataset. The checkpoint path can be any of the finetuned models from the the download links above, or models that you train yourself. It is important that config.yaml
is also present in the same directory as the checkpoint .pth
file.
python tarvis/inference/main.py /path/to/checkpoint.pth --dataset {YOUTUBE_VIS,OVIS,CITYSCAPES_VPS,KITTI_STEP,DAVIS,BURST} --amp --output_dir /path/to/output/directory
Run python tarvis/inference/main.py --help
to see additional options.
First run the pretraining on augmented image datasets (COCO, ADE20k, Cityscapes, Mapillary). The following command does this assuming 8 nodes, each containing 4 GPUs:
torchrun --nnodes=8 --nproc_per_node=4 --rdzv_id=22021994 --rdzv_backend=c10d --rdzv_endpoint ${DDP_HOST_ADDRESS} tarvis/training/main.py --model_dir my_first_tarvis_pretraining --cfg pretrain_{resnet50_swin-tiny,swin-large}.yaml --amp --cfg.MODEL.BACKBONE {ResNet50,SwinTiny,SwinLarge}
The --model_dir
argument is interpreted relative to ${TARVIS_WORKSPACE_DIR}/checkpoints
. You can also give as absolute path here.
Then run the finetuning on video data (YouTube-VIS, OVIS, KITTI-STEP, Cityscapes-VPS, DAVIS, BURST):
torchrun --nnodes=8 --nproc_per_node=4 --rdzv_id=22021994 --rdzv_backend=c10d --rdzv_endpoint ${DDP_HOST_ADDRESS} tarvis/training/main.py --model_dir my_first_tarvis_finetuning --cfg finetune_{resnet50_swin-tiny,swin-large}.yaml --amp --finetune_from /path/to/my_first_tarvis_pretraining
For this, the --finetune_from
argument should point to the pretraining directory (not the checkpoint file). It will automatically load the latest checkpoint from this directory.
- To run on a single GPU, omit
torchrun
and its arguments and just run the script normally:python tarvis/training/main.py ...
- The training code applies gradient accumulation, so the batch size specified in the config (32 by default) is maintained regardless of how many GPUs are used e.g. if you use 8 GPUs, the gradients for 4 consecutive iterations will be accumulated before running
optimizer.step()
. The only constraint is that the batch size should be exactly divisible by the number of GPUs. - Training can be resumed from a checkpoint by running the training script as follows:
torchrun --nnodes=8 --nproc_per_node=4 --rdzv_id=22021994 --rdzv_backend=c10d --rdzv_endpoint ${DDP_HOST_ADDRESS} tarvis/training/main.py --restore_session /path/to/latest/checkpoint.pth
- The data loading code uses OpenCV which sometimes behaves strangely on compute clusters by spawning too many worker threads. If you experience slow data loading times, try setting
OMP_NUM_THREADS=1
and running the training script with--cv2_num_threads=2
. - Training metrics are logged using tensorboard by default, but logging with weights and biases is also supported by provided the
--wandb_session
option. - With the default settings, each GPU should have at least 24GB VRAM to prevent OOM errors. If you want to train on smaller GPUs, consider reducing the input image resolution.
- Run
python tarvis/training/main.py --help
to see additional options.
@inproceedings{athar2023tarvis,
title={TarViS: A Unified Architecture for Target-based Video Segmentation},
author={Athar, Ali and Hermans, Alexander and Luiten, Jonathon and Ramanan, Deva and Leibe, Bastian},
booktitle={CVPR},
year={2023}
}