YouRefIt: Embodied Reference Understanding with Language and Gesture

Last update: Jul 11, 2022

Related tags

Deep Learning YouRefIt_ERU

Overview

YouRefIt: Embodied Reference Understanding with Language and Gesture

by Yixin Chen, Qing Li, Deqian Kong, Yik Lun Kei, Tao Gao, Yixin Zhu, Song-Chun Zhu and Siyuan Huang

The IEEE International Conference on Computer Vision (ICCV), 2021

Introduction

We study the machine's understanding of embodied reference: One agent uses both language and gesture to refer to an object to another agent in a shared physical environment. To tackle this problem, we introduce YouRefIt, a new crowd-sourced, real-world dataset of embodied reference.

For more details, please refer to our paper.

Checklist

Image ERU
Video ERU

Installation

The code was tested with the following environment: Ubuntu 18.04/20.04, python 3.7/3.8, pytorch 1.9.1. Run

    git clone https://github.com/yixchen/YouRefIt_ERU
    pip install -r requirements.txt

Dataset

Download the YouRefIt dataset from Dataset Request Page and put under ./ln_data

Model weights

Yolov3: download the pretrained model and place the file in ./saved_models by
```
sh saved_models/yolov3_weights.sh
```
More pretrained models are availble Google drive, and should also be placed in ./saved_models.

Make sure to put the files in the following structure:

|-- ROOT
|	|-- ln_data
|		|-- yourefit
|			|-- images
|			|-- paf
|			|-- saliency
|	|-- saved_modeks
|		|-- final_model_full.tar
|		|-- final_resc.tar

Training

Train the model, run the code under main folder.

python train.py --data_root ./ln_data/ --dataset yourefit --gpu gpu_id

Evaluation

Evaluate the model, run the code under main folder. Using flag --test to access test mode.

python train.py --data_root ./ln_data/ --dataset yourefit --gpu gpu_id \
 --resume saved_models/model.pth.tar \
 --test

Evaluate Image ERU on our released model

Evaluate our full model with PAF and saliency feature, run

python train.py --data_root ./ln_data/ --dataset yourefit  --gpu gpu_id \
 --resume saved_models/final_model_full.tar --use_paf --use_sal --large --test

Evaluate baseline model that only takes images as input, run

python train.py --data_root ./ln_data/ --dataset yourefit  --gpu gpu_id \
 --resume saved_models/final_resc.tar --large --test

Evalute the inference results on test set on different IOU levels by changing the path accordingly,

 python evaluate_results.py

Citation

@inProceedings{chen2021yourefit,
 title={YouRefIt: Embodied Reference Understanding with Language and Gesture},
 author = {Chen, Yixin and Li, Qing and Kong, Deqian and Kei, Yik Lun and Zhu, Song-Chun and Gao, Tao and Zhu, Yixin and Huang, Siyuan},
 booktitle={The IEEE International Conference on Computer Vision (ICCV),
 year={2021}
 }

Acknowledgement

Our code is built on ReSC and we thank the authors for their hard work.

YouRefIt: Embodied Reference Understanding with Language and Gesture

Related tags

Overview

YouRefIt: Embodied Reference Understanding with Language and Gesture

Introduction

Checklist

Installation

Dataset

Model weights

Training

Evaluation

Evaluate Image ERU on our released model

Citation

Acknowledgement

Owner

Energy consumption estimation utilities for Jetson-based platforms

🤗 Paper Style Guide

modelvshuman is a Python library to benchmark the gap between human and machine vision

GANimation: Anatomically-aware Facial Animation from a Single Image (ECCV'18 Oral) [PyTorch]

Neural-PIL: Neural Pre-Integrated Lighting for Reflectance Decomposition - NeurIPS2021

novel deep learning research works with PaddlePaddle

Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Jiminy Cricket Environment (NeurIPS 2021)

Official implementation of Monocular Quasi-Dense 3D Object Tracking

Official repository for MixFaceNets: Extremely Efficient Face Recognition Networks

The official implementation of You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient.

SAFL: A Self-Attention Scene Text Recognizer with Focal Loss

🏅 The Most Comprehensive List of Kaggle Solutions and Ideas 🏅

Rainbow is all you need! A step-by-step tutorial from DQN to Rainbow

To provide 100 JAX exercises over different sections structured as a course or tutorials to teach and learn for beginners, intermediates as well as experts

How the Deep Q-learning method works and discuss the new ideas that makes the algorithm work

Efficient semidefinite bounds for multi-label discrete graphical models.

Improving Deep Network Debuggability via Sparse Decision Layers

AutoDeeplab / auto-deeplab / AutoML for semantic segmentation, implemented in Pytorch

Locally Constrained Self-Attentive Sequential Recommendation