Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Overview

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Official implementation of the Efficient Conformer, progressively downsampled Conformer with grouped attention for Automatic Speech Recognition.

Efficient Conformer Encoder

Inspired from previous works done in Automatic Speech Recognition and Computer Vision, the Efficient Conformer encoder is composed of three encoder stages where each stage comprises a number of Conformer blocks using grouped attention. The encoded sequence is progressively downsampled and projected to wider feature dimensions, lowering the amount of computation while achieving better performance. Grouped multi-head attention reduce attention complexity by grouping neighbouring time elements along the feature dimension before applying scaled dot-product attention.

Installation

Clone GitHub repository and set up environment

git clone https://github.com/burchim/EfficientConformer.git
cd EfficientConformer
pip install -r requirements.txt

Install ctcdecode

Download LibriSpeech

Librispeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.

cd datasets
./download_LibriSpeech.sh

Running an experiment

You can run an experiment by providing a config file using the '--config_file' flag. Training checkpoints and logs will be saved in the callback folder specified in the config file. Note that '--prepare_dataset' and '--create_tokenizer' flags may be needed for your first experiment.

python main.py --config_file configs/config_file.json

Evaluation

Models can be evaluated by selecting a subset validation/test mode and by providing the epoch/name of the checkpoint to load for evaluation with the '--initial_epoch' flag. The '--gready' flag designates whether to use gready search or beam search decoding for evaluation.

python main.py --config_file configs/config_file.json --initial_epoch epoch/name --mode validation/test --gready

Options

-c / --config_file		type=str   default="configs/EfficientConformerCTCSmall.json"	help="Json configuration file containing model hyperparameters"
-m / --mode                	type=str   default="training"                               	help="Mode : training, validation-clean, test-clean, eval_time-dev-clean, ..."
-d / --distributed         	action="store_true"                                            	help="Distributed data parallelization"
-i / --initial_epoch  		type=str   default=None                                       	help="Load model from checkpoint"
--initial_epoch_lm         	type=str   default=None                                       	help="Load language model from checkpoint"
--initial_epoch_encoder    	type=str   default=None                                       	help="Load model encoder from encoder checkpoint"
-p / --prepare_dataset		action="store_true"                                            	help="Prepare dataset before training"
-j / --num_workers        	type=int   default=8                                          	help="Number of data loading workers"
--create_tokenizer         	action="store_true"                                            	help="Create model tokenizer"
--batch_size_eval      		type=int   default=8                                          	help="Evaluation batch size"
--verbose_val              	action="store_true"                                            	help="Evaluation verbose"
--val_steps                	type=int   default=None                                       	help="Number of validation steps"
--steps_per_epoch      		type=int   default=None                                       	help="Number of steps per epoch"
--world_size               	type=int   default=torch.cuda.device_count()                  	help="Number of available GPUs"
--cpu                      	action="store_true"                                            	help="Load model on cpu"
--show_dict            		action="store_true"                                            	help="Show model dict summary"
--swa                      	action="store_true"                                            	help="Stochastic weight averaging"
--swa_epochs               	nargs="+"   default=None                                       	help="Start epoch / end epoch for swa"
--swa_epochs_list      		nargs="+"   default=None                                       	help="List of checkpoints epochs for swa"
--swa_type                   	type=str   default="equal"                                    	help="Stochastic weight averaging type (equal/exp)"
--parallel                   	action="store_true"                                            	help="Parallelize model using data parallelization"
--rnnt_max_consec_dec_steps  	type=int   default=None                                       	help="Number of maximum consecutive transducer decoder steps during inference"
--eval_loss                  	action="store_true"                                            	help="Compute evaluation loss during evaluation"
--gready                     	action="store_true"                                            	help="Proceed to a gready search evaluation"
--saving_period              	type=int   default=1                                          	help="Model saving every 'n' epochs"
--val_period                 	type=int   default=1                                          	help="Model validation every 'n' epochs"
--profiler                   	action="store_true"                                            	help="Enable eval time profiler"

Monitor training

tensorboard --logdir callback_path

LibriSpeech Performance

Model Size Type Params (M) test-clean/test-other gready WER (%) test-clean/test-other n-gram WER (%) GPUs
Efficient Conformer Small CTC 13.2 3.6 / 9.0 2.7 / 6.7 4 x RTX 2080 Ti
Efficient Conformer Medium CTC 31.5 3.0 / 7.6 2.4 / 5.8 4 x RTX 2080 Ti
Efficient Conformer Large CTC 125.6 2.5 / 5.8 2.1 / 4.7 4 x RTX 3090

Reference

Maxime Burchi, Valentin Vielzeuf. Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition.

Author

Owner
Maxime Burchi
Master of Engineering in Computer Science, ESIEE Paris
Maxime Burchi
Collision risk estimation using stochastic motion models

collision_risk_estimation Collision risk estimation using stochastic motion models. This is a new approach, based on stochastic models, to predict the

Unmesh 7 Jun 26, 2022
Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt)

Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt) Task Training huge unsupervised deep neural networks yields to strong progress in

2 Aug 05, 2022
Ansible Automation Example: JSNAPY PRE/POST Upgrade Validation

Ansible Automation Example: JSNAPY PRE/POST Upgrade Validation Overview This example will show how to validate the status of our firewall before and a

Calvin Remsburg 1 Jan 07, 2022
Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020

XDVioDet Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020. The proj

peng 64 Dec 12, 2022
Protect against subdomain takeover

domain-protect scans Amazon Route53 across an AWS Organization for domain records vulnerable to takeover deploy to security audit account scan your en

OVO Technology 0 Nov 17, 2022
Recommendation algorithms for large graphs

Fast recommendation algorithms for large graphs based on link analysis. License: Apache Software License Author: Emmanouil (Manios) Krasanakis Depende

Multimedia Knowledge and Social Analytics Lab 27 Jan 07, 2023
BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

Pre-trained checkpoint and bert config json file Location of checkpoint and bert config json file This MLCommons members Google Drive location contain

SAIT (Samsung Advanced Institute of Technology) 12 Apr 27, 2022
CCP dataset from Clothing Co-Parsing by Joint Image Segmentation and Labeling

Clothing Co-Parsing (CCP) Dataset Clothing Co-Parsing (CCP) dataset is a new clothing database including elaborately annotated clothing items. 2, 098

Wei Yang 434 Dec 24, 2022
CryptoFrog - My First Strategy for freqtrade

cryptofrog-strategies CryptoFrog - My First Strategy for freqtrade NB: (2021-04-20) You'll need the latest freqtrade develop branch otherwise you migh

Robert Davey 137 Jan 01, 2023
A benchmark dataset for emulating atmospheric radiative transfer in weather and climate models with machine learning (NeurIPS 2021 Datasets and Benchmarks Track)

ClimART - A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate Models Official PyTorch Implementation Using deep le

21 Dec 31, 2022
PanopticBEV - Bird's-Eye-View Panoptic Segmentation Using Monocular Frontal View Images

Bird's-Eye-View Panoptic Segmentation Using Monocular Frontal View Images This r

63 Dec 16, 2022
AI drive app that can help user become beautiful.

爱美丽 Beauty 简体中文 Features Beauty is an AI drive app that can help user become beautiful. it contain those functions: face score cheek face beauty repor

Starved Midnight 1 Jan 30, 2022
A Comparative Framework for Multimodal Recommender Systems

Cornac Cornac is a comparative framework for multimodal recommender systems. It focuses on making it convenient to work with models leveraging auxilia

Preferred.AI 671 Jan 03, 2023
Minimal deep learning library written from scratch in Python, using NumPy/CuPy.

SmallPebble Project status: experimental, unstable. SmallPebble is a minimal/toy automatic differentiation/deep learning library written from scratch

Sidney Radcliffe 92 Dec 30, 2022
This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

212 Dec 25, 2022
Global-Local Attention for Emotion Recognition

Global-Local Attention for Emotion Recognition Requirements Python 3 Install tensorflow (or tensorflow-gpu) = 2.0.0 Install some other packages pip i

Minh Nhat Le 15 Apr 21, 2022
Dcf-game-infrastructure-public - Contains all the components necessary to run a DC finals (attack-defense CTF) game from OOO

dcf-game-infrastructure All the components necessary to run a game of the OOO DC

Order of the Overflow 46 Sep 13, 2022
Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features

MediumVC MediumVC is an utterance-level method towards any-to-any VC. Before that, we propose SingleVC to perform A2O tasks(Xi → Ŷi) , Xi means utter

谷下雨 47 Dec 25, 2022
Leveraging OpenAI's Codex to solve cornerstone problems in Music

Music-Codex Leveraging OpenAI's Codex to solve cornerstone problems in Music Please NOTE: Presented generated samples were created by OpenAI's Codex P

Alex 2 Mar 11, 2022
Kaggleship: Kaggle Notebooks

Kaggleship: Kaggle Notebooks This repository contains my Kaggle notebooks. They are generally about data science, machine learning, and deep learning.

Erfan Sobhaei 1 Jan 25, 2022