DeLighT: Very Deep and Light-Weight Transformers

Last update: Dec 18, 2022

Related tags

Overview

DeLighT: Very Deep and Light-weight Transformers

This repository contains the source code of our work on building efficient sequence models: DeFINE (ICLR'20) and DeLighT (preprint).

Table of contents

Overview
Requirements and installation
Training, evaluation, and results
Multiplication-addition operations
Citation
Acknowledgement
Issues

Overview

In this repository, we share the source code of our paper DeLight, that delivers similar or better performance than transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using DExTra, a deep and light-weight transformation and (2) across blocks using block-wise scaling, that allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks are 2.5 to 4 times deeper than standard transformer models and yet have fewer parameters and operations. For details, see our papers: DeFINE and and DeLighT.

Requirements and Installation

PyTorch version >= 1.4.0
Python version >= 3.6
For training new models, you'll also need an NVIDIA GPU and NCCL
To use DeLighT, you need to install fairseq and develop locally:

git clone https://github.com/sacmehta/delight
cd delight
pip install --editable ./

For faster training install NVIDIA's apex library:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

Training, Evaluation, and Results

For training, evaluation, and results, see below links. To ease reproduction of our results, we also provide links to training logs.

Neural machine translation

Language Modeling

WikiText-103

Multiplication-Addition Operations

We have added module profiling for both Transformer and DeLight networks. This can be enabled using --print-stats argument. A model summary will be printed (by default for 20 tokens), similar to below screenshot. To use larger sequence lengths for source and target for profiling statistics, you can use --src-len-ps and --tgt-len-ps flags.

Citation

If you find our work useful, please consider citing following works:

@misc{mehta2020delight,
    title={DeLighT: Very Deep and Light-weight Transformer},
    author={Sachin Mehta and Marjan Ghazvininejad and Srinivasan Iyer and Luke Zettlemoyer and Hannaneh Hajishirzi},
    year={2020},
    eprint={2008.00623},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

@inproceedings{mehta2019define,
  title={DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling},
  author={Mehta, Sachin and Koncel-Kedziorski, Rik and Rastegari, Mohammad and Hajishirzi, Hannaneh},
  booktitle={International Conference on Learning Representations},
  year={2019}
}

Acknowledgements

We would like to thank Fairseq team for building easy-to-use sequence library.

Issues

Thanks for your interest in our work. For any issues, please raise a request.

DeLighT: Very Deep and Light-Weight Transformers

Related tags

Overview

DeLighT: Very Deep and Light-weight Transformers

Overview

Requirements and Installation

Training, Evaluation, and Results

Neural machine translation

Language Modeling

Multiplication-Addition Operations

Citation

Acknowledgements

Issues

Owner

Sachin Mehta

[ICCV 2021] Released code for Causal Attention for Unbiased Visual Recognition

PyTorch implementation of the REMIND method from our ECCV-2020 paper "REMIND Your Neural Network to Prevent Catastrophic Forgetting"

Code for layerwise detection of linguistic anomaly paper (ACL 2021)

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

DuBE: Duple-balanced Ensemble Learning from Skewed Data

NR-GAN: Noise Robust Generative Adversarial Networks

PyTorch implementation of probabilistic deep forecast applied to air quality.

Churn-Prediction-Project - In this project, a churn prediction model is developed for a private bank as a term project for Data Mining class.

The source code for Adaptive Kernel Graph Neural Network at AAAI2022

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

ML-Ensemble – high performance ensemble learning

Rate-limit-semaphore - Semaphore implementation with rate limit restriction for async-style (any core)

CausalNLP is a practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.

Blender scripts for computing geodesic distance

Code for BMVC2021 "MOS: A Low Latency and Lightweight Framework for Face Detection, Landmark Localization, and Head Pose Estimation"

A PyTorch-based R-YOLOv4 implementation which combines YOLOv4 model and loss function from R3Det for arbitrary oriented object detection.

Solutions and questions for AoC2021. Merry christmas!

PyTorch-based framework for Deep Hedging

[ICCV 2021] Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation

A really easy-to-use and powerful sudoku solver.