Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)

Overview

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

This repository contains a pytorch implementation of "Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion"

report

This codebase provides:

  • train code
  • test code
  • dataset
  • pretrained motion models

The main sections are:

  • Overview
  • Instalation
  • Download Data and Models
  • Training from Scratch
  • Testing with Pretrained Models

Please note, we will not be providing visualization code for the photorealistic rendering.

Overview:

We provide models and code to train and test our listener motion models.

See below for sections:

  • Installation: environment setup and installation for visualization
  • Download data and models: download annotations and pre-trained models
  • Training from scratch: scripts to get the training pipeline running from scratch
  • Testing with pretrianed models: scripts to test pretrained models and save output motion parameters

Installation:

Tested with cuda/9.0, cudnn/v7.0-cuda.9.0, and python 3.6.11

git clone [email protected]:evonneng/learning2listen.git

cd learning2listen/src/
conda create -n venv_l2l python=3.6
conda activate venv_l2l
pip install -r requirements.txt

export L2L_PATH=`pwd`

IMPORTANT: After installing torch, please make sure to modify the site-packages/torch/nn/modules/conv.py file by commenting out the self.padding_mode != 'zeros' line to allow for replicated padding for ConvTranspose1d as shown here.

Download Data and Models:

Download Data:

Please first download the dataset for the corresponding individual with google drive.

Make sure all downloaded .tar files are moved to the directory $L2L_PATH/data/ (e.g. $L2L_PATH/data/conan_data.tar)

Then run the following script.

./scripts/unpack_data.sh

The downloaded data will unpack into the following directory structure as viewed from $L2L_PATH:

|-- data/
    |-- conan/
        |-- test/
            |-- p0_list_faces_clean_deca.npy
            |-- p0_speak_audio_clean_deca.npy
            |-- p0_speak_faces_clean_deca.npy
            |-- p0_speak_files_clean_deca.npy
            |-- p1_list_faces_clean_deca.npy
            |-- p1_speak_audio_clean_deca.npy
            |-- p1_speak_faces_clean_deca.npy
            |-- p1_speak_files_clean_deca.npy
        |-- train/
    |-- devi2/
    |-- fallon/
    |-- kimmel/
    |-- stephen/
    |-- trevor/

Our dataset consists of 6 different youtube channels named accordingly. Please see comments in $L2L_PATH/scripts/download_models.sh for more details.

Data Format:

The data format is as described below:

We denote p0 as the person on the left side of the video, and p1 as the right side.

  • p0_list_faces_clean_deca.npy - face features (N x 64 x 184) for when p0 is listener
    • N sequences of length 64. Features of size 184, which includes the deca parameter set of expression (50D), pose (6D), and details (128D).
  • p0_speak_audio_clean_deca.npy - audio features (N x 256 x 128) for when p0 is speaking
    • N sequences of length 256. Features of size 128 mel features
  • p0_speak_faces_clean_deca.npy - face features (N x 64 x 184) for when p0 is speaking
  • p0_speak_files_clean_deca.npy - file names of the format (N x 64 x 3) for when p0 is speaking

Using Your Own Data:

To train and test on your own videos, please follow this process to convert your data into a compatible format:

(Optional) In our paper, we ran preprocessing to figure out when a each person is speaking or listening. We used this information to segment/chunk up our data. We then extracted speaker-only audio by removing listener back-channels.

  1. Run SyncNet on the video to determine who is speaking when.
  2. Then run Multi Sensory to obtain speaker's audio with all the listener backchannels removed.

For the main processing, we assuming there are 2 people in the video - one speaker and one listener...

  1. Run DECA to extract the facial expression and pose details of the two faces for each frame in the video. For each person combine the extracted features across the video into a (1 x T x (50+6)) matrix and save to p0_list_faces_clean_deca.npy or p0_speak_faces_clean_deca.npy files respectively. Note, in concatenating the features, expression comes first.

  2. Use librosa.feature.melspectrogram(...) to process the speaker's audio into a (1 x 4T x 128) feature. Save to p0_speak_audio_clean_deca.npy.

Download Model:

Please first download the models for the corresponding individual with google drive.

Make sure all downloaded .tar files are moved to the directory $L2L_PATH/models/ (e.g. $L2L_PATH/models/conan_models.tar)

Once downloaded, you can run the follow script to unpack all of the models.

cd $L2L_PATH
./scripts/unpack_models.sh

We provide person-specific models trained for Conan, Fallon, Stephen, and Trevor. Each person-specific model consists of 2 models: 1) VQ-VAE pre-trained codebook of motion in $L2L_PATH/vqgan/models/ and 2) predictor model for listener motion prediction in $L2L_PATH/models/. It is important that the models are paired correctly during test time.

In addition to the models, we also provide the corresponding config files that were used to define the models/listener training setup.

Please see comments in $L2L_PATH/scripts/unpack_models.sh for more details.

Training from Scratch:

Training a model from scratch follows a 2-step process.

  1. Train the VQ-VAE codebook of listener motion:
# --config: the config file associated with training the codebook
# Includes network setup information and listener information
# See provided config: configs/l2_32_smoothSS.json

cd $L2L_PATH/vqgan/
python train_vq_transformer.py --config <path_to_config_file>

Please note, during training of the codebook, it is normal for the loss to increase before decreasing. Typical training was ~2 days on 4 GPUs.

  1. After training of the VQ-VAE has converged, we can begin training the predictor model that uses this codebook.
# --config: the config file associated with training the predictor
# Includes network setup information and codebook information
# Note, you will have to update this config to point to the correct codebook.
# See provided config: configs/vq/delta_v6.json

cd $L2L_PATH
python -u train_vq_decoder.py --config <path_to_config_file>

Training the predictor model should have a much faster convergance. Typical training was ~half a day on 4 GPUs.

Testing with Pretrained Models:

# --config: the config file associated with training the predictor 
# --checkpoint: the path to the pretrained model
# --speaker: can specify which speaker you want to test on (conan, trevor, stephen, fallon, kimmel)

cd $L2L_PATH
python test_vq_decoder.py --config <path_to_config> --checkpoint <path_to_pretrained_model> --speaker <optional>

For our provided models and configs you can run:

python test_vq_decoder.py --config configs/vq/delta_v6.json --checkpoint models/delta_v6_er2er_best.pth --speaker 'conan'

Visualization

As part of responsible practices, we will not be releasing code for the photorealistic visualization pipeline. However, the raw 3D meshes can be rendered using the DECA renderer.

Potentially Coming Soon

  • Visualization of 3D meshes code from saved output
A curated list of awesome resources combining Transformers with Neural Architecture Search

A curated list of awesome resources combining Transformers with Neural Architecture Search

Yash Mehta 173 Jan 03, 2023
GAT - Graph Attention Network (PyTorch) 💻 + graphs + 📣 = ❤️

GAT - Graph Attention Network (PyTorch) 💻 + graphs + 📣 = ❤️ This repo contains a PyTorch implementation of the original GAT paper ( 🔗 Veličković et

Aleksa Gordić 1.9k Jan 09, 2023
Bio-Computing Platform Featuring Large-Scale Representation Learning and Multi-Task Deep Learning “螺旋桨”生物计算工具集

English | 简体中文 Latest News 2021.10.25 Paper "Docking-based Virtual Screening with Multi-Task Learning" is accepted by BIBM 2021. 2021.07.29 PaddleHeli

633 Jan 04, 2023
Additional functionality for use with fastai’s medical imaging module

fmi Adding additional functionality to fastai's medical imaging module To learn more about medical imaging using Fastai you can view my blog Install g

14 Oct 31, 2022
PyTorch implementation of SimSiam: Exploring Simple Siamese Representation Learning

SimSiam: Exploring Simple Siamese Representation Learning This is a PyTorch implementation of the SimSiam paper: @Article{chen2020simsiam, author =

Facebook Research 834 Dec 30, 2022
toroidal - a lightweight transformer library for PyTorch

toroidal - a lightweight transformer library for PyTorch Toroidal transformers are of smaller size and lower weight than the more common E-I types. Th

MathInf GmbH 64 Jan 07, 2023
Bootstrapped Unsupervised Sentence Representation Learning (ACL 2021)

Install first pip3 install -e . Training python3 training/unsupervised_tuning.py python3 training/supervised_tuning.py python3 training/multilingual_

yanzhang_nlp 26 Jul 22, 2022
Scalable Optical Flow-based Image Montaging and Alignment

SOFIMA SOFIMA (Scalable Optical Flow-based Image Montaging and Alignment) is a tool for stitching, aligning and warping large 2d, 3d and 4d microscopy

Google Research 16 Dec 21, 2022
Denoising Diffusion Implicit Models

Denoising Diffusion Implicit Models (DDIM) Jiaming Song, Chenlin Meng and Stefano Ermon, Stanford Implements sampling from an implicit model that is t

465 Jan 05, 2023
A simple python stock Predictor

Python Stock Predictor A simple python stock Predictor Demo Run Locally Clone the project git clone https://github.com/yashraj-n/stock-price-predict

Yashraj narke 5 Nov 29, 2021
Tensorflow port of a full NetVLAD network

netvlad_tf The main intention of this repo is deployment of a full NetVLAD network, which was originally implemented in Matlab, in Python. We provide

Robotics and Perception Group 225 Nov 08, 2022
Moiré Attack (MA): A New Potential Risk of Screen Photos [NeurIPS 2021]

Moiré Attack (MA): A New Potential Risk of Screen Photos [NeurIPS 2021] This repository is the official implementation of Moiré Attack (MA): A New Pot

Dantong Niu 22 Dec 24, 2022
Learning 3D Part Assembly from a Single Image

Learning 3D Part Assembly from a Single Image This repository contains a PyTorch implementation of the paper: Learning 3D Part Assembly from A Single

18 Dec 21, 2022
FedJAX is a library for developing custom Federated Learning (FL) algorithms in JAX.

FedJAX: Federated learning with JAX What is FedJAX? FedJAX is a library for developing custom Federated Learning (FL) algorithms in JAX. FedJAX priori

Google 208 Dec 14, 2022
This repository contains the implementation of the paper Contrastive Instance Association for 4D Panoptic Segmentation using Sequences of 3D LiDAR Scans

Contrastive Instance Association for 4D Panoptic Segmentation using Sequences of 3D LiDAR Scans This repository contains the implementation of the pap

Photogrammetry & Robotics Bonn 40 Dec 01, 2022
Pytorch library for fast transformer implementations

Transformers are very successful models that achieve state of the art performance in many natural language tasks

Idiap Research Institute 1.3k Dec 30, 2022
Deep Inertial Prediction (DIPr)

Deep Inertial Prediction For more information and context related to this repo, please refer to our website. Getting Started (non Docker) Note: you wi

Arcturus Industries 12 Nov 11, 2022
PyTorch implementation for ACL 2021 paper "Maria: A Visual Experience Powered Conversational Agent".

Maria: A Visual Experience Powered Conversational Agent This repository is the Pytorch implementation of our paper "Maria: A Visual Experience Powered

Jokie 22 Dec 12, 2022
Generalized Decision Transformer for Offline Hindsight Information Matching

Generalized Decision Transformer for Offline Hindsight Information Matching [arxiv] If you use this codebase for your research, please cite the paper:

Hiroki Furuta 35 Dec 12, 2022
Causal Imitative Model for Autonomous Driving

Causal Imitative Model for Autonomous Driving Mohammad Reza Samsami, Mohammadhossein Bahari, Saber Salehkaleybar, Alexandre Alahi. arXiv 2021. [Projec

VITA lab at EPFL 8 Oct 04, 2022