Moment-DETR code and QVHighlights dataset

Overview

Moment-DETR

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Jie Lei, Tamara L. Berg, Mohit Bansal

For dataset details, please check data/README.md

Getting Started

Prerequisites

  1. Clone this repo
git clone https://github.com/jayleicn/moment_detr.git
cd moment_detr
  1. Prepare feature files

Download moment_detr_features.tar.gz (8GB), extract it under project root directory:

tar -xf path/to/moment_detr_features.tar.gz
  1. Install dependencies.

This code requires Python 3.7, PyTorch, and a few other Python libraries. We recommend creating conda environment and installing all the dependencies as follows:

# create conda env
conda create --name moment_detr python=3.7
# activate env
conda actiavte moment_detr
# install pytorch with CUDA 11.0
conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch
# install other python packages
pip install tqdm ipython easydict tensorboard tabulate scikit-learn pandas

Training

Training can be launched by running the following command:

bash moment_detr/scripts/train.sh 

This will train Moment-DETR for 200 epochs on the QVHighlights train split, with SlowFast and Open AI CLIP features. The training is very fast, it can be done within 4 hours using a single RTX 2080Ti GPU. The checkpoints and other experiment log files will be written into results. For training under different settings, you can append additional command line flags to the command above. For example, if you want to train the model without the saliency loss (by setting the corresponding loss weight to 0):

bash moment_detr/scripts/train.sh --lw_saliency 0

For more configurable options, please checkout our config file moment_detr/config.py.

Inference

Once the model is trained, you can use the following command for inference:

bash moment_detr/scripts/inference.sh CHECKPOINT_PATH SPLIT_NAME  

where CHECKPOINT_PATH is the path to the saved checkpoint, SPLIT_NAME is the split name for inference, can be one of val and test.

Pretraining and Finetuning

Moment-DETR utilizes ASR captions for weakly supervised pretraining. To launch pretraining, run:

bash moment_detr/scripts/pretrain.sh 

This will pretrain the Moment-DETR model on the ASR captions for 100 epochs, the pretrained checkpoints and other experiment log files will be written into results. With the pretrained checkpoint, we can launch finetuning from a pretrained checkpoint PRETRAIN_CHECKPOINT_PATH as:

bash moment_detr/scripts/train.sh  --resume ${PRETRAIN_CHECKPOINT_PATH}

Note that this finetuning process is the same as standard training except that it initializes weights from a pretrained checkpoint.

Evaluation and Codalab Submission

Please check standalone_eval/README.md for details.

Acknowledgement

We thank Linjie Li for the helpful discussions. This code is based on detr and TVRetrieval XML. We used resources from mdetr, MMAction2, CLIP, SlowFast and HERO_Video_Feature_Extractor. We thank the authors for their awesome open-source contributions.

LICENSE

The annotation files are under CC BY-NC-SA 4.0 license, see ./data/LICENSE. All the code are under MIT license, see LICENSE.

Comments
  • About experiments on CharadesSTA dataset

    About experiments on CharadesSTA dataset

    Hi, I noticed that you also conduct experiments on CharadesSTA dataset. I'm wondering how you prepare the video feature in CharadesSTA dataset? Could you share the feature files you prepared?

    opened by xljh0520 8
  • About the annotations

    About the annotations

    Hi @jayleicn, thanks for your great work! I notice that in the annotation files, as shown below, the duration of a video (126s) does not match the actual duration (810s - 660s = 150s). May I ask that should I crop the original video to 126s before processing in this case?

    {
        "qid": 8737, 
        "query": "A family is playing basketball together on a green court outside.", 
        "duration": 126, 
        "vid": "bP5KfdFJzC4_660.0_810.0", 
        "relevant_windows": [[0, 16]],
        "relevant_clip_ids": [0, 1, 2, 3, 4, 5, 6, 7], 
        "saliency_scores": [[4, 1, 1], [4, 1, 1], [4, 2, 1], [4, 3, 2], [4, 3, 2], [4, 3, 3], [4, 3, 3], [4, 3, 2]]
    }
    
    opened by yeliudev 4
  • CodaLab Submission Error

    CodaLab Submission Error

    Hi, I recently generate the test results and validation results on CodaLab as the following structure.

    --Submit.zip
    ----hl_val_submission.jsonl
    ----hl_test_submission.jsonl
    

    The CodaLab gave me the error IOError: [Errno 2] No such file or directory: '/tmp/codalab/tmphfqu8Q/run/input/res/hl_test_submission.jsonl'

    How can I solve this problem?

    opened by vateye 3
  • Video feature extraction

    Video feature extraction

    Hi, thanks for your excellent work! I found that the provided video features include both clip_features and slow_fast features. When it comes to the run_on_video/run.py, the codes only extract the clip features. Is there a mistake here? Besides, could you please provide the run.py extracting both clip and slowfast features? Thank you.

    opened by fxqzb 2
  • About paper

    About paper

    hi, We think that mdetr has great potential, but we look at table 6 in the paper and find that the metics of moment retrieval on the charades-sta dataset is not much higher than that of ivg-dcl (in particular, ivg-dcl adopts C3d feature for video extractor and glove for text embedding), and your work uses clip feature + slowfast). Have you ever tested on other video grounding dataset, like activitynets?

    opened by BMEI1314 2
  • About dataset?

    About dataset?

    Good job. I have read the paper and the github repository, but I still don’t understand how the features such as clip_features, clip_sub_features, clip_text_features, slowfast_features, etc. under the features folder are extracted and the details of the features extracted? Can you describe it in detail if it is convenient?

    opened by dourcer 2
  • [Request for the approval in competition] Hello. can you approve the request?

    [Request for the approval in competition] Hello. can you approve the request?

    Hello.

    Thanks for the great work. Motivated by the work and the interesting topic, we sincerely hope to get approved to be in the competition.

    Thank you!!! Btw, Sorry for bothering you.

    Regards.

    opened by wjun0830 1
  • Meaning of GT saliency scores

    Meaning of GT saliency scores

    Thank you for your great work and open-source code.

    I have an issue with the GT saliency scores (only localized 2-sec clips), can you please explain briefly? besides, how Predicted saliency scores (for all 2-sec clip) corresponds to the previous term?

    Thanks!

    Best, Kevin

    Build models...
    Loading feature extractors...
    Loading CLIP models
    Loading trained Moment-DETR model...
    Run prediction...
    ------------------------------idx0
    >> query: Chef makes pizza and cuts it up.
    >> video_path: run_on_video/example/RoripwjYFp8_60.0_210.0.mp4
    >> GT moments: [[106, 122]]
    >> Predicted moments ([start_in_seconds, end_in_seconds, score]): [
        [49.967, 64.9129, 0.9421], 
        [66.4396, 81.0731, 0.9271], 
        [105.9434, 122.0372, 0.9234], 
        [93.2057, 103.3713, 0.2222], 
        ..., 
        [45.3834, 52.2183, 0.0005]
       ]
    >> GT saliency scores (only localized 2-sec clips):  # what it means?
        [[2, 3, 3], [2, 3, 3], ...]
    >> Predicted saliency scores (for all 2-sec clip):  # how this correspond to the GT saliency scores?
        [-0.9258, -0.8115, -0.7598, ..., 0.0739, 0.1068]  
    
    opened by QinghongLin 1
  • How do I make my dataset ?

    How do I make my dataset ?

    Hi, Congrats on the amazing work. I want to make a data set similar to QVHighlights in my research direction, I have a lot of questions? 1、What annotation tools do you use? And details in the annotation process. 2、How to use CLIP to extract QVHIGHLIGHTS text features ? Can you provide the specific code?

    opened by Yangaiei 1
  • About File missing in run_on_video

    About File missing in run_on_video

    Thank you for your wonderful work! However, when I tried to run your demo in folder run_on_video, the file bpe_simple_vocab_16e6.txt.gz for the tokenizer is missing. Can you provide this file?

    FileNotFoundError: [Errno 2] No such file or directory: 'moment_detr/run_on_video/clip/bpe_simple_vocab_16e6.txt.gz'

    opened by lmfethan 1
  • The meaning of

    The meaning of "tef"

    Hi, I have a question about the "tef" in vision feature:

    if self.use_tef:
        tef_st = torch.arange(0, ctx_l, 1.0) / ctx_l
        tef_ed = tef_st + 1.0 / ctx_l
        tef = torch.stack([tef_st, tef_ed], dim=1)  # (Lv, 2)
        if self.use_video:
            model_inputs["video_feat"] = torch.cat(
                [model_inputs["video_feat"], tef], dim=1)  # (Lv, Dv+2)
        else:
            model_inputs["video_feat"] = tef
    

    What does "tef" mean in the visual feature? Thanks in advance.

    opened by vateye 1
  • Slowfast config setting

    Slowfast config setting

    Hi, thanks for your good work and released code!

    I have a question regarding the feature extractor: which setting did you adopt for the QVHighlight slowfast feature? e.g., SLOWFAST_8x8_R50.

    Thanks!

    Kevin

    opened by QinghongLin 0
  • predicted saliency scores

    predicted saliency scores

    1. How is the predicted saliency scores (for all 2-sec clip) calculated?
    >> Predicted saliency scores (for all 2-sec clip): 
        [-0.9258, -0.8115, -0.7598, ..., 0.0739, 0.1068]  
    
    1. Is it the average of the scores of three people? And why the predicted saliency scores (for all 2-sec clip) is negative.
    opened by Yangaiei 0
Releases(checkpoints)
Owner
Jie Lei 雷杰
UNC CS PhD student, vision+language.
Jie Lei 雷杰
ICON: Implicit Clothed humans Obtained from Normals

ICON: Implicit Clothed humans Obtained from Normals arXiv, December 2021. Yuliang Xiu · Jinlong Yang · Dimitrios Tzionas · Michael J. Black Table of C

Yuliang Xiu 1.1k Dec 30, 2022
Llvlir - Low Level Variable Length Intermediate Representation

Low Level Variable Length Intermediate Representation Low Level Variable Length

Michael Clark 2 Jan 24, 2022
FLSim a flexible, standalone library written in PyTorch that simulates FL settings with a minimal, easy-to-use API

Federated Learning Simulator (FLSim) is a flexible, standalone core library that simulates FL settings with a minimal, easy-to-use API. FLSim is domain-agnostic and accommodates many use cases such a

Meta Research 162 Jan 02, 2023
ElasticFace: Elastic Margin Loss for Deep Face Recognition

This is the official repository of the paper: ElasticFace: Elastic Margin Loss for Deep Face Recognition Paper on arxiv: arxiv Model Log file Pretrain

Fadi Boutros 113 Dec 14, 2022
基于PaddleClas实现垃圾分类,并转换为inference格式用PaddleHub服务端部署

百度网盘链接及提取码: 链接:https://pan.baidu.com/s/1HKpgakNx1hNlOuZJuW6T1w 提取码:wylx 一个垃圾分类项目带你玩转飞桨多个产品(1) 基于PaddleClas实现垃圾分类,导出inference模型并利用PaddleHub Serving进行服务

thomas-yanxin 22 Jul 12, 2022
PyTorch implementation of MICCAI 2018 paper "Liver Lesion Detection from Weakly-labeled Multi-phase CT Volumes with a Grouped Single Shot MultiBox Detector"

Grouped SSD (GSSD) for liver lesion detection from multi-phase CT Note: the MICCAI 2018 paper only covers the multi-phase lesion detection part of thi

Sang-gil Lee 36 Oct 12, 2022
A framework for GPU based high-performance medical image processing and visualization

FAST is an open-source cross-platform framework with the main goal of making it easier to do high-performance processing and visualization of medical images on heterogeneous systems utilizing both mu

Erik Smistad 315 Dec 30, 2022
Collects many various multi-modal transformer architectures, including image transformer, video transformer, image-language transformer, video-language transformer and related datasets

The repository collects many various multi-modal transformer architectures, including image transformer, video transformer, image-language transformer, video-language transformer and related datasets

Jun Chen 139 Dec 21, 2022
Implementation of the Remixer Block from the Remixer paper, in Pytorch

Remixer - Pytorch Implementation of the Remixer Block from the Remixer paper, in Pytorch. It claims that substituting the feedforwards in transformers

Phil Wang 35 Aug 23, 2022
Code To Tune or Not To Tune? Zero-shot Models for Legal Case Entailment.

COLIEE 2021 - task 2: Legal Case Entailment This repository contains the code to reproduce NeuralMind's submissions to COLIEE 2021 presented in the pa

NeuralMind 13 Dec 16, 2022
PyGAD, a Python 3 library for building the genetic algorithm and training machine learning algorithms (Keras & PyTorch).

PyGAD: Genetic Algorithm in Python PyGAD is an open-source easy-to-use Python 3 library for building the genetic algorithm and optimizing machine lear

Ahmed Gad 1.1k Dec 26, 2022
This program can detect your face and add an Christams hat on the top of your head

Auto_Christmas This program can detect your face and add a Christmas hat to the top of your head. just run the Auto_Christmas.py, then you can see the

3 Dec 22, 2021
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

This is the Vowpal Wabbit fast online learning code. Why Vowpal Wabbit? Vowpal Wabbit is a machine learning system which pushes the frontier of machin

Vowpal Wabbit 8.1k Jan 06, 2023
Speed-Test - You can check your intenet speed using this tool

Speed-Test Tool By Hez_X AVAILABLE ON : Termux & Kali linux & Ubuntu (Linux E

Hez-X 3 Feb 17, 2022
Computer-Vision-Paper-Reviews - Computer Vision Paper Reviews with Key Summary along Papers & Codes

Computer-Vision-Paper-Reviews Computer Vision Paper Reviews with Key Summary along Papers & Codes. Jonathan Choi 2021 50+ Papers across Computer Visio

Jonathan Choi 2 Mar 17, 2022
DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation This project hosts the code for implementing the DCT-MASK algorithms

Alibaba Cloud 57 Nov 27, 2022
Torch implementation of SegNet and deconvolutional network

Torch implementation of SegNet and deconvolutional network

Fedor Chervinskii 5 Jul 17, 2020
Convolutional neural network web app trained to track our infant’s sleep schedule using our Google Nest camera.

Machine Learning Sleep Schedule Tracker What is it? Convolutional neural network web app trained to track our infant’s sleep schedule using our Google

g-parki 7 Jul 15, 2022
A program that can analyze videos according to the weights you select

MaskMonitor A program that can analyze videos according to the weights you select 下載 訓練完的 weight檔案 執行 MaskDetection.py 內部可更改 輸入來源(鏡頭, 影片, 圖片) 以及輸出條件(人

Patrick_star 1 Nov 07, 2021
LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation by Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zh

Kingdrone 174 Dec 22, 2022