Multistream Convolutional Neural Network (CNN)

A multistream CNN is a novel neural network architecture for robust acoustic modeling in speech recognition tasks. It processes input speech with diverse resolutions by applying different dilation rates to convolutional neural networks across multiple streams to achieve the robustness. The dilation rate of 3 are selected from the multiples of a sub-sampling rate of 3 frames. Each stream stacks TDNN-F layers (a variant of 1D CNN), and output embedding vectors from the streams are concatenated then projected to the final layer, as illustrated below:

References

Multistream CNN for Robust Acoustic Modeling [paper]

{
  @inproceedings{han2021multistream-cnn,
    title={Multistream CNN for Robust Acoustic Modeling},
    author={Kyu J. Han and Jing Pan and Venkata Krishna Naveen Tadala and Tao Ma and Dan Povey},
    booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    year={2021}
}

ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition [paper]

{
  @inproceedings{pan2020asapp-asr,
    title={ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition},
    author={Jing Pan and Joshua Shapiro and Jeremy Wohlwend and Kyu J. Han and Tao Lei and Tao Ma},
    booktitle={Interspeech},
    year={2020}
}

Installation

Please follow the original Kaldi build sequence, as below.

>> cd tools; make; cd ../src; ./configure; make clean; make -j clean depend; make -j all

Recipes and Results

LibriSpeech

>> egs/librispeech/s5/local/chain/run_multistream_cnn_1a.sh

	dev-clean	dev-other	test-clean	test-other
tdnn_1d	3.29	8.71	3.80	8.76
multistream_cnn_1a	3.20	7.68	3.54	7.87

Fisher-SWBD

>> egs/fisher_swbd/s5/local/chain/run_multistream_cnn_1a.sh

	eval2000	swbd	callhm
tdnn_7d	12.6	8.8	16.3
multistream_cnn_1a	12.6	9.2	15.7

Multistream CNN for Robust Acoustic Modeling

Related tags

Overview

Multistream Convolutional Neural Network (CNN)

References

Installation

Recipes and Results

Owner

ASAPP Research

Using fully convolutional networks for semantic segmentation with caffe for the cityscapes dataset

Implementation of ViViT: A Video Vision Transformer

A flexible framework of neural networks for deep learning

AITUS - An atomatic notr maker for CYTUS

Vision Deep-Learning using Tensorflow, Keras.

SysWhispers Shellcode Loader

A simple code to convert image format and channel as well as resizing and renaming multiple images.

Select, weight and analyze complex sample data

A custom DeepStack model for detecting 16 human actions.

Official implementation of the paper Momentum Capsule Networks (MoCapsNet)

Syllabus del curso IIC2115 - Programación como Herramienta para la Ingeniería 2022/I

Texture mapping with variational auto-encoders

Decorator for PyMC3

FreeSOLO for unsupervised instance segmentation, CVPR 2022

FlingBot: The Unreasonable Effectiveness of Dynamic Manipulations for Cloth Unfolding

Label Hallucination for Few-Shot Classification

a reimplementation of LiteFlowNet in PyTorch that matches the official Caffe version

[ICCV2021] Learning to Track Objects from Unlabeled Videos

Implementation of Memformer, a Memory-augmented Transformer, in Pytorch

A annotation of yolov5-5.0