Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

Last update: Aug 23, 2022

Related tags

Deep Learning ResDAVEnet-VQ

Overview

ResDAVEnet-VQ

Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

What is in this repo?

Multi-GPU training of ResDAVEnet-VQ
Quantitative evaluation
- Image-to-speech and speech-to-image retrieval
- ZeroSpeech 2019 ABX phone-discriminability test
- Word detection
Qualitative evaluation
- Visualize time-aligned word/phone/code transcripts
- F1/recall/precision scatter plots for model/layer comparison

If you find the code useful, please cite

@inproceedings{Harwath2020Learning,
  title={Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech},
  author={David Harwath and Wei-Ning Hsu and James Glass},
  booktitle={International Conference on Learning Representations},
  year={2020},
  url={https://openreview.net/forum?id=B1elCp4KwH}
}

Pre-trained models

Model	[email protected]	Link	MD5 sum
{}	0.735	gDrive	e3f94990c72ce9742c252b2e04f134e4
{}->{2}	0.760	gDrive	d8ebaabaf882632f49f6aea0a69516eb
{}->{3}	0.794	gDrive	2c3a269c70005cbbaaa15fc545da93fa
{}->{2,3}	0.787	gDrive	d0764d8e97187c8201f205e32b5f7fee
{2}	0.753	gDrive	d68c942069fcdfc3944e556f6af79c60
{2}->{2,3}	0.764	gDrive	09e704f8fcd9f85be8c4d5bdf779bd3b
{2}->{2,3}->{2,3,4}	0.793	gDrive	6e403e7f771aad0c95f087318bf8447e
{3}	0.734	gDrive	a0a3d5adbbd069a2739219346c8a8f70
{3}->{2,3}	0.760	gDrive	6c92bcc4445895876a7840bc6e88892b
{2,3}	0.667	gDrive	7a98a661302939817a1450d033bc2fcc

Data preparation

Download the MIT Places Image/Audio Data

We use MIT Places scene recognition database (Places Image) and a paired MIT Places Audio Caption Corpus (Places Audio) as visually-grounded speech, which contains roughly 400K image/spoken caption pairs, to train ResDAVEnet-VQ.

Places Image can be downloaded here
Places Audio can be downloaded here

Optional data preprocessing

Data specifcation files can be found at metadata/{train,val}.json inside the Places Audio directory; however, they do not include the time-aligned word transcripts for analysis. Those with alignments can be downloaded here:

train
valid

Open the *.json files and update the values of image_base_path and audio_base_path to reflect the path where the image and the audio datasets are stored.

To speed up data loading, we save images and audio data into the HDF5 binary files, and use the h5py Python interface to access the data. The corresponding PyTorch Dataset class is ImageCaptionDatasetHDF5 in ./dataloaders/image_caption_dataset_hdf5.py. To prepare HDF5 datasets, run

./scripts/preprocess.sh

(We do support on-the-fly feature processing with the ImageCaptionDataset class in ./dataloaders/image_caption_dataset.py, which takes a data specification file as input (e.g., metadata/train.json). However, this can be very slow)

ImageCaptionDataset and ImageCaptionDatasetHDF5 are interchangeable, but most scripts in this repo assume the preprocessed HDF5 dataset is available. Users would have to modify the code correspondingly to use ImageCaptionDataset.

Interactive Qualtitative Evaluation

See run_evaluations.ipynb

Quantitative Evaluation

ZeroSpeech 2019 ABX Phone Discriminability Test

Users need to download the dataset and the Docker image by following the instructions here.

To extract ResDAVEnet-VQ features, see ./scripts/dump_zs19_abx.sh.

Word detection

See ./run_unit_analysis.py. It needs both HDF5 dataset and the original JSON dataset to get the time-aligned word transcripts.

Example:

python run_unit_analysis.py --hdf5_path=$hdf5_path --json_path=$json_path \
  --exp_dir=$exp_dir --layer=$layer --output_dir=$out_dir

Cross-modal retrieval

See ./run_ResDavenetVQ.py. Set --mode=eval for retrieval evaluation.

Example:

python run_ResDavenetVQ.py --resume=True --mode=eval \
  --data-train=$data_tr --data-val=$data_dt \
  --exp-dir="./exps/pretrained/RDVQ_01000_01100_01110"

Training

See ./scripts/train.sh.

To train a model from scratch with the 2nd and 3rd layers quantized, run

./scripts/train.sh 01100 RDVQ_01100 ""

To train a model with the 2nd and 3rd layers quantized, and initialize weights from a pre-trained model (e.g., ./exps/RDVQ_00000), run

./scripts/train.sh 01100 RDVQ_01100 "--seed-dir ./exps/RDVQ_00000"

Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

Related tags

Overview

ResDAVEnet-VQ

What is in this repo?

Pre-trained models

Data preparation

Download the MIT Places Image/Audio Data

Optional data preprocessing

Interactive Qualtitative Evaluation

Quantitative Evaluation

ZeroSpeech 2019 ABX Phone Discriminability Test

Word detection

Cross-modal retrieval

Training

Owner

Wei-Ning Hsu

Inverse Rendering for Complex Indoor Scenes: Shape, Spatially-Varying Lighting and SVBRDF From a Single Image

Source code for CVPR 2021 paper "Riggable 3D Face Reconstruction via In-Network Optimization"

PyTorch implementation of EGVSR: Efficcient & Generic Video Super-Resolution (VSR)

Code and data to accompany the camera-ready version of "Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation" in EMNLP 2021

Implementation of accepted AAAI 2021 paper: Deep Unsupervised Image Hashing by Maximizing Bit Entropy

EFENet: Reference-based Video Super-Resolution with Enhanced Flow Estimation

On Size-Oriented Long-Tailed Graph Classification of Graph Neural Networks

MEND: Model Editing Networks using Gradient Decomposition

Code for the ICML 2021 paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

U-Net implementation in PyTorch for FLAIR abnormality segmentation in brain MRI

details on efforts to dump the Watermelon Games Paprium cart

MobileNetV1-V2，MobileNeXt，GhostNet，AdderNet，ShuffleNetV1-V2，Mobile+ViT etc.

Deep functional residue identification

OneFlow is a performance-centered and open-source deep learning framework.

Efficiently computes derivatives of numpy code.

Research Artifact of USENIX Security 2022 Paper: Automated Side Channel Analysis of Media Software with Manifold Learning

Intrusion Test Tool with Python

Official code of the paper "Expanding Low-Density Latent Regions for Open-Set Object Detection" (CVPR 2022)

Attentive Implicit Representation Networks (AIR-Nets)

Improving Non-autoregressive Generation with Mixup Training