SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

Last update: Dec 25, 2022

Related tags

Overview

SmallInitEmb

LayerNorm(SmallInit(Embedding)) in a Transformer

I find that when training a transformer, the embedding matrix moves slowly, hence it's difficult for the model to jump out of the initial noisy embedding.

(initial embedding)
[[-0.0073  0.0062 -0.0261 ...  0.0086  0.0107 -0.008 ] ... ]
 (after 1 step, the directions of the embedding vectors are not moved much because the numbers change by ~LR = ~4e-4)
[[-0.0069  0.0066 -0.0265 ...  0.009   0.0111 -0.0084] ... ]

So I propose initializing the embedding matrix to tiny values, and put another LayerNorm after it (before all the SA & FFN layers):

if isinstance(module, (nn.Embedding)):
    nn.init.uniform_(module.weight, a=-1e-4, b=1e-4) # SmallInit(Emb)
...
if self.config.USE_SMALL_EMB and self.layer_id == 0:
    x = self.lnPre(x) # LN(SmallInit(Emb))
x = x + self.att(self.ln1(x))
x = x + self.ffn(self.ln2(x))

And then you get improved convergence (especially for BPE models) because the model can quickly jump out of the tiny initial embedding (small changes after 1 step -> significant changes of directions -> significant changes after LayerNorm).

Loss curve comparison: https://wandb.ai/blinkdl/SmallEmbTest

(the gap between LayerNorm(SmallEmb)) and baseline persists after more training)

Moreover, you can directly train PostLN models without warmup with SmallInit(Emb)

if isinstance(module, (nn.Embedding)):
    nn.init.uniform_(module.weight, a=-1e-4, b=1e-4) # SmallInit(Emb)
...
x = self.ln1(x) # this plays the same role as the lnPre in the above PreLN code
x = x + self.att(x)
x = self.ln2(x)
x = x + self.ffn(x)
(note you shall have another LN after the final ffn)

SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

Related tags

Overview

SmallInitEmb

Moreover, you can directly train PostLN models without warmup with SmallInit(Emb)

Owner

PENG Bo

Code for paper "A Critical Assessment of State-of-the-Art in Entity Alignment" (https://arxiv.org/abs/2010.16314)

Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2

BASH - Biomechanical Animated Skinned Human

Code for Emergent Translation in Multi-Agent Communication

MaskTrackRCNN for video instance segmentation based on mmdetection

Voice Conversion by CycleGAN (语音克隆/语音转换)：CycleGAN-VC3

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

The codes I made while I practiced various TensorFlow examples

A curated list of long-tailed recognition resources.

CFNet: Cascade and Fused Cost Volume for Robust Stereo Matching（CVPR2021）

A Keras implementation of CapsNet in the paper: Sara Sabour, Nicholas Frosst, Geoffrey E Hinton. Dynamic Routing Between Capsules

🌊 Online machine learning in Python

functorch is a prototype of JAX-like composable function transforms for PyTorch.

[NeurIPS'20] Multiscale Deep Equilibrium Models

Codes accompanying the paper "Learning Nearly Decomposable Value Functions with Communication Minimization" (ICLR 2020)

This is the accompanying toolbox for the paper "A Survey on GANs for Anomaly Detection"

Image super-resolution (SR) is a fast-moving field with novel architectures attracting the spotlight

Cosine Annealing With Warmup

Analyzing basic network responses to novel classes