ReferFormer - Official Implementation of ReferFormer

Overview

License Framework

PWC PWC

The official implementation of the paper:

Language as Queries for Referring
Video Object Segmentation

Language as Queries for Referring Video Object Segmentation

Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, Ping Luo

Abstract

In this work, we propose a simple and unified framework built upon Transformer, termed ReferFormer. It views the language as queries and directly attends to the most relevant regions in the video frames. Concretely, we introduce a small set of object queries conditioned on the language as the input to the Transformer. In this manner, all the queries are obligated to find the referred objects only. They are eventually transformed into dynamic kernels which capture the crucial object-level information, and play the role of convolution filters to generate the segmentation masks from feature maps. The object tracking is achieved naturally by linking the corresponding queries across frames. This mechanism greatly simplifies the pipeline and the end-to-end framework is significantly different from the previous methods. Extensive experiments on Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences show the effectiveness of ReferFormer.

Requirements

We test the codes in the following environments, other versions may also be compatible:

  • CUDA 11.1
  • Python 3.7
  • Pytorch 1.8.1

Installation

Please refer to install.md for installation.

Data Preparation

Please refer to data.md for data preparation.

We provide the pretrained model for different visual backbones. You may download them here and put them in the directory pretrained_weights.

After the organization, we expect the directory struture to be the following:

ReferFormer/
├── data/
│   ├── ref-youtube-vos/
│   ├── ref-davis/
│   ├── a2d_sentences/
│   ├── jhmdb_sentences/
├── davis2017/
├── datasets/
├── models/
├── scipts/
├── tools/
├── util/
├── pretrained_weights/
├── eval_davis.py
├── main.py
├── engine.py
├── inference_ytvos.py
├── inference_davis.py
├── opts.py
...

Model Zoo

All the models are trained using 8 NVIDIA Tesla V100 GPU. You may change the --backbone parameter to use different backbones (see here).

Note: If you encounter the OOM error, please add the command --use_checkpoint (we add this command for Swin-L, Video-Swin-S and Video-Swin-B models).

Ref-Youtube-VOS

To evaluate the results, please upload the zip file to the competition server.

Backbone J&F CFBI J&F Pretrain Model Submission CFBI Submission
ResNet-50 55.6 59.4 weight model link link
ResNet-101 57.3 60.3 weight model link link
Swin-T 58.7 61.2 weight model link link
Swin-L 62.4 63.3 weight model link link
Video-Swin-T* 55.8 - - model link -
Video-Swin-T 59.4 - weight model link -
Video-Swin-S 60.1 - weight model link -
Video-Swin-B 62.9 - weight model link -

* indicates the model is trained from scratch.

Ref-DAVIS17

As described in the paper, we report the results using the model trained on Ref-Youtube-VOS without finetune.

Backbone J&F J F Model
ResNet-50 58.5 55.8 61.3 model
Swin-L 60.5 57.6 63.4 model
Video-Swin-B 61.1 58.1 64.1 model

A2D-Sentences

The pretrained models are the same as those provided for Ref-Youtube-VOS.

Backbone Overall IoU Mean IoU mAP Pretrain Model
Video-Swin-T 77.6 69.6 52.8 weight model | log
Video-Swin-S 77.7 69.8 53.9 weight model | log
Video-Swin-B 78.6 70.3 55.0 weight model | log

JHMDB-Sentences

As described in the paper, we report the results using the model trained on A2D-Sentences without finetune.

Backbone Overall IoU Mean IoU mAP Model
Video-Swin-T 71.9 71.0 42.2 model
Video-Swin-S 72.8 71.5 42.4 model
Video-Swin-B 73.0 71.8 43.7 model

Get Started

Please see Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences for details.

Acknowledgement

This repo is based on Deformable DETR and VisTR. We also refer to the repositories MDETR and MTTR. Thanks for their wonderful works.

Citation

@article{wu2022referformer,
      title={Language as Queries for Referring Video Object Segmentation}, 
      author={Jiannan Wu and Yi Jiang and Peize Sun and Zehuan Yuan and Ping Luo},
      journal={arXiv preprint arXiv:2201.00487},
      year={2022},
}
Owner
Jonas Wu
The University of Hong Kong. PhD Candidate. Computer Vision.
Jonas Wu
A list of multi-task learning papers and projects.

This page contains a list of papers on multi-task learning for computer vision. Please create a pull request if you wish to add anything. If you are interested, consider reading our recent survey pap

svandenh 297 Dec 17, 2022
[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

Yuqing Wang 687 Jan 07, 2023
Companion repository to the paper accepted at the 4th ACM SIGSPATIAL International Workshop on Advances in Resilient and Intelligent Cities

Transfer learning approach to bicycle sharing systems station location planning using OpenStreetMap Companion repository to the paper accepted at the

Politechnika Wrocławska - repozytorium dla informatyków 4 Oct 24, 2022
Multiband spectro-radiometric satellite image analysis with K-means cluster algorithm

Multi-band Spectro Radiomertric Image Analysis with K-means Cluster Algorithm Overview Multi-band Spectro Radiomertric images are images comprising of

Chibueze Henry 6 Mar 16, 2022
Generalized Data Weighting via Class-level Gradient Manipulation

Generalized Data Weighting via Class-level Gradient Manipulation This repository is the official implementation of Generalized Data Weighting via Clas

18 Nov 12, 2022
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Text-AutoAugment (TAA) This repository contains the code for our paper Text AutoAugment: Learning Compositional Augmentation Policy for Text Classific

LancoPKU 105 Jan 03, 2023
Vertex AI: Serverless framework for MLOPs (ESP / ENG)

Vertex AI: Serverless framework for MLOPs (ESP / ENG) Español Qué es esto? Este repo contiene un pipeline end to end diseñado usando el SDK de Kubeflo

Hernán Escudero 2 Apr 28, 2022
Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

Yaoming Cai 5 Jul 18, 2022
Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"

VoCapXLM Code for EMNLP2021 paper Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training Environment DockerFile: dancingso

Bo Zheng 15 Jul 28, 2022
Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data

Real-ESRGAN Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data Ported from https://github.com/xinntao/Real-ESRGAN Depend

Holy Wu 44 Dec 27, 2022
Lyapunov-guided Deep Reinforcement Learning for Stable Online Computation Offloading in Mobile-Edge Computing Networks

PyTorch code to reproduce LyDROO algorithm [1], which is an online computation offloading algorithm to maximize the network data processing capability subject to the long-term data queue stability an

Liang HUANG 87 Dec 28, 2022
Python Environment for Bayesian Learning

Pebl is a python library and command line application for learning the structure of a Bayesian network given prior knowledge and observations. Pebl in

Abhik Shah 103 Jul 14, 2022
source code of “Visual Saliency Transformer” (ICCV2021)

Visual Saliency Transformer (VST) source code for our ICCV 2021 paper “Visual Saliency Transformer” by Nian Liu, Ni Zhang, Kaiyuan Wan, Junwei Han, an

89 Dec 21, 2022
The devkit of the nuScenes dataset.

nuScenes devkit Welcome to the devkit of the nuScenes and nuImages datasets. Overview Changelog Devkit setup nuImages nuImages setup Getting started w

Motional 1.6k Jan 05, 2023
Flask101 - FullStack Web Development with Python & JS - From TAQWA

Task: Create a CLI Calculator Step 0: Creating Virtual Environment $ python -m

Hossain Foysal 1 May 31, 2022
[CVPR 2022] Official PyTorch Implementation for "Reference-based Video Super-Resolution Using Multi-Camera Video Triplets"

Reference-based Video Super-Resolution (RefVSR) Official PyTorch Implementation of the CVPR 2022 Paper Project | arXiv | RealMCVSR Dataset This repo c

Junyong Lee 151 Dec 30, 2022
DropNAS: Grouped Operation Dropout for Differentiable Architecture Search

DropNAS: Grouped Operation Dropout for Differentiable Architecture Search DropNAS, a grouped operation dropout method for one-level DARTS, with better

weijunhong 4 Aug 15, 2022
Audio Source Separation is the process of separating a mixture into isolated sounds from individual sources

Audio Source Separation is the process of separating a mixture into isolated sounds from individual sources (e.g. just the lead vocals).

Victor Basu 14 Nov 07, 2022
The trained model and denoising example for paper : Cardiopulmonary Auscultation Enhancement with a Two-Stage Noise Cancellation Approach

The trained model and denoising example for paper : Cardiopulmonary Auscultation Enhancement with a Two-Stage Noise Cancellation Approach

ycj_project 1 Jan 18, 2022
CAMoE + Dual SoftMax Loss (DSL): Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

CAMoE + Dual SoftMax Loss (DSL): Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss This is official implement of "

程星 87 Dec 24, 2022