[CVPR2022] Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos

Last update: Dec 23, 2022

Related tags

Overview

Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos

Created by Muheng Li, Lei Chen, Yueqi Duan, Zhilan Hu, Jianjiang Feng, Jie Zhou, Jiwen Lu

This repository contains PyTorch implementation for Bridge-Prompt (CVPR 2022).

We propose a prompt-based framework, Bridge-Prompt (Br-Prompt), to model the semantics across multiple adjacent correlated actions, so that it simultaneously exploits both out-of-context and contextual information from a series of ordinal actions in instructional videos. More specifically, we reformulate the individual action labels as integrated text prompts for supervision, which bridge the gap between individual action semantics. The generated text prompts are paired with corresponding video clips, and together co-train the text encoder and the video encoder via a contrastive approach. The learned vision encoder has a stronger capability for ordinal-action-related downstream tasks, e.g. action segmentation and human activity recognition.

Our code is based on CLIP and ActionCLIP.

Prerequisites

Requirements

PyTorch >= 1.8
wandb
dotmap
yaml
pprint
tqdm
RandAugment

You may need ffmpeg for video data pre-processing.

The environment is also recorded in requirements.txt, which can be reproduced by

pip install -r requirements.txt

Pretrained models

We use the base model (ViT-B/16 for image encoder & text encoder) pre-trained by ActionCLIP based on Kinetics-400. The model can be downloaded in link (pwd:ilgw). The pre-trained model should be saved in ./models/.

Datasets

Raw video files are needed to train our framework. Please download the datasets with RGB videos from the official websites ( Breakfast / GTEA / 50Salads ) and save them under the folder ./data/(name_dataset). For convenience, we have used the extracted frames of the raw RGB videos as inputs. You can extract the frames from raw RGB datasets by running:

python preprocess/get_frames.py --dataset (name_dataset) --vpath (folder_to_your_videos) --fpath ./data/(name_dataset)/frames/

To be noticed, ffmpeg is needed here for frame extraction.

Furthermore, please also extract the .zip files to ./data/(name_dataset) respectively.

Training

To train Bridge-Prompt on Breakfast from Kinetics400 pretrained models, you can run:

bash scripts/run_train.sh  ./configs/breakfast/breakfast_ft.yaml

To train Bridge-Prompt on GTEA from Kinetics400 pretrained models, you can run:

bash scripts/run_train.sh  ./configs/gtea/gtea_ft.yaml

To train Bridge-Prompt on 50Salads from Kinetics400 pretrained models, you can run:

bash scripts/run_train.sh  ./configs/salads/salads_ft.yaml

Extracting frame features

We use the Bridge-Prompt pre-trained image encoders to extract frame-wise features for further downstream tasks (e.g. action segmentation). You can run the following command for each dataset respectively:

python extract_frame_features.py --config ./configs/(dataset_name)/(dataset_name)_exfm.yaml --dataset (dataset_name)

Since 50Salads/Breakfast are large scale datasets, we extract the frame features by window splits. To combine the splits, please run the following command:

python preprocess/combine_features.py

Please modify the variables dataset and feat_name in combine_features.py for each dataset.

Action segmentation

You can reproduce the action segmentation results using ASFormer by the previously extracted frame features.

Activity recognition

You can reproduce the activity recognition results using the command:

python ft_acti.py

based on the previously extracted frame features (Breakfast).

Ordinal action recognition

The ordinal action inferences are executed using the command:

bash scripts/run_test.sh  ./configs/(dataset_name)/(dataset_name)_test.yaml

and check the accuracies using:

bash preprocess/checknpy.py

Please modify the variables dataset in checknpy.py for each dataset.

Notes

Please modify pretrain in all config files according to your own working directions.

License

MIT License.

[CVPR2022] Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos

Related tags

Overview

Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos

Prerequisites

Requirements

Pretrained models

Datasets

Training

Extracting frame features

Action segmentation

Activity recognition

Ordinal action recognition

Notes

License

Owner

YoHa - A practical hand tracking engine.

[CVPR 2022] Unsupervised Image-to-Image Translation with Generative Prior

Continual learning with sketched Jacobian approximations

The official repository for "Intermediate Layers Matter in Momentum Contrastive Self Supervised Learning" paper.

Fast and accurate optimisation for registration with little learningconvexadam

FMA: A Dataset For Music Analysis

The code for our NeurIPS 2021 paper "Kernelized Heterogeneous Risk Minimization".

A Simple LSTM-Based Solution for "Heartbeat Signal Classification and Prediction" in Tianchi

Resilience from Diversity: Population-based approach to harden models against adversarial attacks

Reading Group @mila-iqia on Computational Optimal Transport for Machine Learning Applications

NeurIPS-2021: Neural Auto-Curricula in Two-Player Zero-Sum Games.

A Python package to create, run, and post-process MODFLOW-based models.

FACIAL: Synthesizing Dynamic Talking Face With Implicit Attribute Learning. ICCV, 2021.

COD-Rank-Localize-and-Segment (CVPR2021)

StyleGAN2-ADA-training-jupyter - Training custom datasets in styleGAN2-ADA by NVIDIA using Jupyter

Code for paper Novel View Synthesis via Depth-guided Skip Connections

You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors

Implementation of E(n)-Transformer, which extends the ideas of Welling's E(n)-Equivariant Graph Neural Network to attention

Code for the paper "M2m: Imbalanced Classification via Major-to-minor Translation" (CVPR 2020)

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph