Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

Overview

ResDAVEnet-VQ

Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

What is in this repo?

  • Multi-GPU training of ResDAVEnet-VQ
  • Quantitative evaluation
    • Image-to-speech and speech-to-image retrieval
    • ZeroSpeech 2019 ABX phone-discriminability test
    • Word detection
  • Qualitative evaluation
    • Visualize time-aligned word/phone/code transcripts
    • F1/recall/precision scatter plots for model/layer comparison

alt text

If you find the code useful, please cite

@inproceedings{Harwath2020Learning,
  title={Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech},
  author={David Harwath and Wei-Ning Hsu and James Glass},
  booktitle={International Conference on Learning Representations},
  year={2020},
  url={https://openreview.net/forum?id=B1elCp4KwH}
}

Pre-trained models

Model [email protected] Link MD5 sum
{} 0.735 gDrive e3f94990c72ce9742c252b2e04f134e4
{}->{2} 0.760 gDrive d8ebaabaf882632f49f6aea0a69516eb
{}->{3} 0.794 gDrive 2c3a269c70005cbbaaa15fc545da93fa
{}->{2,3} 0.787 gDrive d0764d8e97187c8201f205e32b5f7fee
{2} 0.753 gDrive d68c942069fcdfc3944e556f6af79c60
{2}->{2,3} 0.764 gDrive 09e704f8fcd9f85be8c4d5bdf779bd3b
{2}->{2,3}->{2,3,4} 0.793 gDrive 6e403e7f771aad0c95f087318bf8447e
{3} 0.734 gDrive a0a3d5adbbd069a2739219346c8a8f70
{3}->{2,3} 0.760 gDrive 6c92bcc4445895876a7840bc6e88892b
{2,3} 0.667 gDrive 7a98a661302939817a1450d033bc2fcc

Data preparation

Download the MIT Places Image/Audio Data

We use MIT Places scene recognition database (Places Image) and a paired MIT Places Audio Caption Corpus (Places Audio) as visually-grounded speech, which contains roughly 400K image/spoken caption pairs, to train ResDAVEnet-VQ.

  • Places Image can be downloaded here
  • Places Audio can be downloaded here

Optional data preprocessing

Data specifcation files can be found at metadata/{train,val}.json inside the Places Audio directory; however, they do not include the time-aligned word transcripts for analysis. Those with alignments can be downloaded here:

Open the *.json files and update the values of image_base_path and audio_base_path to reflect the path where the image and the audio datasets are stored.

To speed up data loading, we save images and audio data into the HDF5 binary files, and use the h5py Python interface to access the data. The corresponding PyTorch Dataset class is ImageCaptionDatasetHDF5 in ./dataloaders/image_caption_dataset_hdf5.py. To prepare HDF5 datasets, run

./scripts/preprocess.sh

(We do support on-the-fly feature processing with the ImageCaptionDataset class in ./dataloaders/image_caption_dataset.py, which takes a data specification file as input (e.g., metadata/train.json). However, this can be very slow)

ImageCaptionDataset and ImageCaptionDatasetHDF5 are interchangeable, but most scripts in this repo assume the preprocessed HDF5 dataset is available. Users would have to modify the code correspondingly to use ImageCaptionDataset.

Interactive Qualtitative Evaluation

See run_evaluations.ipynb

Quantitative Evaluation

ZeroSpeech 2019 ABX Phone Discriminability Test

Users need to download the dataset and the Docker image by following the instructions here.

To extract ResDAVEnet-VQ features, see ./scripts/dump_zs19_abx.sh.

Word detection

See ./run_unit_analysis.py. It needs both HDF5 dataset and the original JSON dataset to get the time-aligned word transcripts.

Example:

python run_unit_analysis.py --hdf5_path=$hdf5_path --json_path=$json_path \
  --exp_dir=$exp_dir --layer=$layer --output_dir=$out_dir

Cross-modal retrieval

See ./run_ResDavenetVQ.py. Set --mode=eval for retrieval evaluation.

Example:

python run_ResDavenetVQ.py --resume=True --mode=eval \
  --data-train=$data_tr --data-val=$data_dt \
  --exp-dir="./exps/pretrained/RDVQ_01000_01100_01110"

Training

See ./scripts/train.sh.

To train a model from scratch with the 2nd and 3rd layers quantized, run

./scripts/train.sh 01100 RDVQ_01100 ""

To train a model with the 2nd and 3rd layers quantized, and initialize weights from a pre-trained model (e.g., ./exps/RDVQ_00000), run

./scripts/train.sh 01100 RDVQ_01100 "--seed-dir ./exps/RDVQ_00000"
Owner
Wei-Ning Hsu
Research Scientist @ Facebook AI Research (FAIR). Former PhD Student @ MIT Spoken Language Systems Group
Wei-Ning Hsu
We will see a basic program that is basically a hint to brute force attack to crack passwords. In other words, we will make a program to Crack Any Password Using Python. Show some ❤️ by starring this repository!

Crack Any Password Using Python We will see a basic program that is basically a hint to brute force attack to crack passwords. In other words, we will

Ananya Chatterjee 11 Dec 03, 2022
Implement A3C for Mujoco gym envs

pytorch-a3c-mujoco Disclaimer: my implementation right now is unstable (you ca refer to the learning curve below), I'm not sure if it's my problems. A

Andrew 70 Dec 12, 2022
Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Unsupervised-Multi-hop-QA This repository contains code and models for the paper: Unsupervised Multi-hop Question Answering by Question Generation (NA

Liangming Pan 70 Nov 27, 2022
This is a yolo3 implemented via tensorflow 2.7

YoloV3 - an object detection algorithm implemented via TF 2.x source code In this article I assume you've already familiar with basic computer vision

2 Jan 17, 2022
EM-POSE 3D Human Pose Estimation from Sparse Electromagnetic Trackers.

EM-POSE: 3D Human Pose Estimation from Sparse Electromagnetic Trackers This repository contains the code to our paper published at ICCV 2021. For ques

Facebook Research 62 Dec 14, 2022
This reposityory contains the PyTorch implementation of our paper "Generative Dynamic Patch Attack".

Generative Dynamic Patch Attack This reposityory contains the PyTorch implementation of our paper "Generative Dynamic Patch Attack". Requirements PyTo

Xiang Li 8 Nov 17, 2022
Weakly- and Semi-Supervised Panoptic Segmentation (ECCV18)

Weakly- and Semi-Supervised Panoptic Segmentation by Qizhu Li*, Anurag Arnab*, Philip H.S. Torr This repository demonstrates the weakly supervised gro

Qizhu Li 159 Dec 20, 2022
PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos. By adopting a unified pipeline-ba

PyKale 370 Dec 27, 2022
Official PyTorch implementation of "BlendGAN: Implicitly GAN Blending for Arbitrary Stylized Face Generation" (NeurIPS 2021)

BlendGAN: Implicitly GAN Blending for Arbitrary Stylized Face Generation Official PyTorch implementation of the NeurIPS 2021 paper Mingcong Liu, Qiang

onion 462 Dec 29, 2022
Gluon CV Toolkit

Gluon CV Toolkit | Installation | Documentation | Tutorials | GluonCV provides implementations of the state-of-the-art (SOTA) deep learning models in

Distributed (Deep) Machine Learning Community 5.4k Jan 06, 2023
Paddle-Adversarial-Toolbox (PAT) is a Python library for Deep Learning Security based on PaddlePaddle.

Paddle-Adversarial-Toolbox Paddle-Adversarial-Toolbox (PAT) is a Python library for Deep Learning Security based on PaddlePaddle. Model Zoo Common FGS

AgentMaker 17 Nov 08, 2022
This repository is the offical Pytorch implementation of ContextPose: Context Modeling in 3D Human Pose Estimation: A Unified Perspective (CVPR 2021).

Context Modeling in 3D Human Pose Estimation: A Unified Perspective (CVPR 2021) Introduction This repository is the offical Pytorch implementation of

37 Nov 21, 2022
This is a collection of all challenges in HKCERT CTF 2021

香港網絡保安新生代奪旗挑戰賽 2021 (HKCERT CTF 2021) This is a collection of all challenges (and writeups) in HKCERT CTF 2021 Challenges ID Chinese name Name Score S

10 Jan 27, 2022
Natural Intelligence is still a pretty good idea.

Human Learn Machine Learning models should play by the rules, literally. Project Goal Back in the old days, it was common to write rule-based systems.

vincent d warmerdam 641 Dec 26, 2022
DeLag: Detecting Latency Degradation Patterns in Service-based Systems

DeLag: Detecting Latency Degradation Patterns in Service-based Systems Replication package of the work "DeLag: Detecting Latency Degradation Patterns

SEALABQualityGroup @ University of L'Aquila 2 Mar 24, 2022
Code + pre-trained models for the paper Keeping Your Eye on the Ball Trajectory Attention in Video Transformers

Motionformer This is an official pytorch implementation of paper Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. In this rep

Facebook Research 192 Dec 23, 2022
This is the pytorch re-implementation of the IterNorm

IterNorm-pytorch Pytorch reimplementation of the IterNorm methods, which is described in the following paper: Iterative Normalization: Beyond Standard

Lei Huang 32 Dec 27, 2022
Block-wisely Supervised Neural Architecture Search with Knowledge Distillation (CVPR 2020)

DNA This repository provides the code of our paper: Blockwisely Supervised Neural Architecture Search with Knowledge Distillation. Illustration of DNA

Changlin Li 215 Dec 19, 2022
Official PyTorch Implementation of GAN-Supervised Dense Visual Alignment

GAN-Supervised Dense Visual Alignment — Official PyTorch Implementation Paper | Project Page | Video This repo contains training, evaluation and visua

944 Jan 07, 2023
Clinica is a software platform for clinical research studies involving patients with neurological and psychiatric diseases and the acquisition of multimodal data

Clinica Software platform for clinical neuroimaging studies Homepage | Documentation | Paper | Forum | See also: AD-ML, AD-DL ClinicaDL About The Proj

ARAMIS Lab 165 Dec 29, 2022