Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

This repo contains the main baselines of VID-sentence dataset introduced in WSSTG. Please refer to our paper and the repo for the information of VID-sentence dataset.

Task

Description: "A brown and white dog is lying on the grass and then it stands up."

The proposed WSSTG task aims to localize a spatio-temporal tube (i.e., the sequence of green bounding boxes) in the video which semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training.

Architecture

The architecture of the proposed method.

Requirements: software

Pytorch (version=0.4.0)
python 2.7
numpy
scipy
magic
easydict
dill
matplotlib
tensorboardX

Installation

Clone the WSSTG repository and VID-sentence reposity

    git clone https://github.com/JeffCHEN2017/WSSTG.git
    git clone https://github.com/JeffCHEN2017/VID-Sentence.git
    ln -s VID-sentence_ROOT/data/ILSVRC WSSTG_ROOT/data

Download tube proposals, RGB feature and I3D feature from Google Drive.
Extract *.tar files and make symlinks between the download data and the desired data folder

tar xvf tubePrp.tar
ln -s tubePrp $WSSTG_ROOT/data/tubePrp

tar xvf vid_i3d.tar vid_i3d
ln -s vid_i3d $WSSTG_ROOT/data/vid_i3d
ln -s $WSSTG_ROOT/data/vid_i3d/val test

tar xvf vid_rgb.tar vid_rgb
ln -s vid_rgb $WSSTG_ROOT/data/vid_rgb
ln -s $WSSTG_ROOT/data/vid_rgb/vidTubeCacheFtr/val test

Note: We extract the tube proposals using the method proposed by Gkioxari and Malik .A python implementation here is provided by Yamaguchi etal.. We extract singel-frame propsoals and RGB feature for each frame using a faster-RCNN model pretrained on COCO dataset, which is provided by Jianwei Yang. We extract I3D-RGB and I3D-flow features using the model provided by Carreira and Zisserman.

Training

cd $WSSTG_ROOT
sh scripts/train_video_emb_att.sh

Notice: Because the changes of batch sizes and the random seed, the performance may be slightly different from our submission. We provide a checkpoint here which achieves similar performance (38.1 VS 38.2 on the [email protected] ) to the model we reported in the paper.

Testing

Download the checkpoint from Google Drive, put it in WSSTG_ROOT/data/models and run

cd $WSSTG_ROOT
sh scripts/test_video_emb_att.sh

License

WSSTG is released under the CC-BY-NC 4.0 LICENSE (refer to the LICENSE file for details).

Citing WSSTG

If you find this repo useful in your research, please consider citing:

@inproceedings{chen2019weakly,
    Title={Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video},
    Author={Chen, Zhenfang and Ma, Lin and Luo, Wenhan and Wong, Kwan-Yee~K},
    Booktitle={ACL},
    year={2019}
}

Contact

You can contact Zhenfang Chen by sending email to [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
annotations		annotations
data		data
fun		fun
images		images
scripts		scripts
util		util
LICENSE		LICENSE
README.md		README.md
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Task

Architecture

Contents

Requirements: software

Installation

Training

Testing

License

Citing WSSTG

Contact

About

Releases

Packages

Languages

License

zfchenUnique/WSSTG

Folders and files

Latest commit

History

Repository files navigation

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Task

Architecture

Contents

Requirements: software

Installation

Training

Testing

License

Citing WSSTG

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages