[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Last update: Dec 27, 2022

Overview

TubeDETR: Spatio-Temporal Video Grounding with Transformers

This repository provides the code for our paper. This includes:

Software setup, data downloading and preprocessing instructions for the VidSTG, HC-STVG1 and HC-STVG2.0 datasets
Training scripts and pretrained checkpoints
Evaluation scripts and demo

Setup

Download FFMPEG and add it to the PATH environment variable. The code was tested with version ffmpeg-4.2.2-amd64-static. Then create a conda environment and install the requirements with the following commands:

conda create -n tubedetr_env python=3.8
conda activate tubedetr_env
pip install -r requirements.txt

Data Downloading

Setup the paths where you are going to download videos and annotations in the config json files.

VidSTG: Download VidOR videos and annotations from the VidOR dataset providers. Then download the VidSTG annotations from the VidSTG dataset providers. The vidstg_vid_path folder should contain a folder video containing the unzipped video folders. The vidstg_ann_path folder should contain both VidOR and VidSTG annotations.

HC-STVG: Download HC-STVG1 and HC-STVG2.0 videos and annotations from the HC-STVG dataset providers. The hcstvg_vid_path folder should contain a folder video containing the unzipped video folders. The hcstvg_ann_path folder should contain both HC-STVG1 and HC-STVG2.0 annotations.

Data Preprocessing

To preprocess annotation files, run:

python preproc/preproc_vidstg.py
python preproc/preproc_hcstvg.py
python preproc/preproc_hcstvgv2.py

Training

Download pretrained RoBERTa tokenizer and model weights in the TRANSFORMERS_CACHE folder. Download pretrained ResNet-101 model weights in the TORCH_HOME folder. Download MDETR pretrained model weights with ResNet-101 backbone in the current folder.

VidSTG To train on VidSTG, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=vidstg --combine_datasets_val=vidstg \
--dataset_config config/vidstg.json --output-dir=OUTPUT_DIR

HC-STVG2.0 To train on HC-STVG2.0, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--v2 --dataset_config config/hcstvg.json --epochs=20 --output-dir=OUTPUT_DIR

HC-STVG1 To train on HC-STVG1, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--dataset_config config/hcstvg.json --epochs=40 --eval_skip=40 --output-dir=OUTPUT_DIR

Baselines

To remove time encoding, add --no_time_embed.
To remove the temporal self-attention in the space-time decoder, add --no_tsa.
To train from ImageNet initialization, pass an empty string to the argument --load and add --sted_loss_coef=5 --lr=2e-5 --text_encoder_lr=2e-5 --epochs=20 --lr_drop=20 for VidSTG or --epochs=60 --lr_drop=60 for HC-STVG1.
To train with a randomly initalized temporal self-attention, add --rd_init_tsa.
To train with a different spatial resolution (e.g. res=352) or temporal stride (e.g. k=4), add --resolution=224 or --stride=5.
To train with the slow-only variant, add --no_fast.
To train with alternative designs for the fast branch, add --fast=VARIANT.

Available Checkpoints

Training data	parameters	url	size
MDETR init + VidSTG	k=4 res=352	Drive	3.0GB
MDETR init + VidSTG	k=2 res=224	Drive	3.0GB
ImageNet init + VidSTG	k=4 res=352	Drive	3.0GB
MDETR init + HC-STVG2.0	k=4 res=352	Drive	3.0GB
MDETR init + HC-STVG2.0	k=2 res=224	Drive	3.0GB
MDETR init + HC-STVG1	k=4 res=352	Drive	3.0GB
ImageNet init + HC-STVG1	k=4 res=352	Drive	3.0GB

Evaluation

For evaluation only, simply run the same commands as for training with --resume=CHECKPOINT --eval. For this to be done on the test set, add --test (in this case predictions and attention weights are also saved).

Spatio-Temporal Video Grounding Demo

You can also use a pretrained model to infer a spatio-temporal tube on a video of your choice (VIDEO_PATH with potential START and END timestamps) given the natural language query of your choice (CAPTION) with the following command:

python demo_stvg.py --load=CHECKPOINT --caption_example CAPTION --video_example VIDEO_PATH --start_example=START --end_example=END --output-dir OUTPUT_PATH

Note that we also host an online demo at this link, the code of which is available at server_stvg.py and server_stvg.html.

Acknowledgements

This codebase is built on the MDETR codebase. The code for video spatial data augmentation is inspired by torch_videovision.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@inproceedings{yang2022tubedetr,
title={TubeDETR: Spatio-Temporal Video Grounding with Transformers},
author={Yang, Antoine and Miech, Antoine and Sivic, Josef and Laptev, Ivan and Schmid, Cordelia},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}}

[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Related tags

Overview

TubeDETR: Spatio-Temporal Video Grounding with Transformers

Setup

Data Downloading

Data Preprocessing

Training

Available Checkpoints

Evaluation

Spatio-Temporal Video Grounding Demo

Acknowledgements

Citation

Owner

Antoine Yang

JUSTICE: A Benchmark Dataset for Supreme Court’s Judgment Prediction

Pre-Training 3D Point Cloud Transformers with Masked Point Modeling

Experimental code for paper: Generative Adversarial Networks as Variational Training of Energy Based Models

The 2nd place solution of 2021 google landmark retrieval on kaggle.

MicRank is a Learning to Rank neural channel selection framework where a DNN is trained to rank microphone channels.

g9.py - Torch interactive graphics

Official implementation of the paper Visual Parser: Representing Part-whole Hierarchies with Transformers

Ranking Models in Unlabeled New Environments （iccv21）

Contrastive Learning with Non-Semantic Negatives

Predictive AI layer for existing databases.

The Noise Contrastive Estimation for softmax output written in Pytorch

Implementation of accepted AAAI 2021 paper: Deep Unsupervised Image Hashing by Maximizing Bit Entropy

Tool which allow you to detect and translate text.

Devkit for 3D -- Some utils for 3D object detection based on Numpy and Pytorch

Deformable DETR is an efficient and fast-converging end-to-end object detector.

Change is Everywhere: Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery (ICCV 2021)

Build and run Docker containers leveraging NVIDIA GPUs

PyTorch implementation of the Crafting Better Contrastive Views for Siamese Representation Learning

Pre-training of Graph Augmented Transformers for Medication Recommendation

Heterogeneous Deep Graph Infomax