PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Last update: Jul 27, 2022

Overview

ALiBi

PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Quickstart

Clone this repository.

git clone https://github.com/jaketae/alibi.git

Navigate to the cloned directory. You can use the bare-bone ALiBi decoder via

>>> import torch; from alibi import ALiBiConfig, ALiBiTransformer
>>> config  = ALiBiConfig()
>>> model = ALiBiTransformer(config)
>>> x = torch.randn(8, 100, 256)
>>> model(x).shape
torch.Size([8, 100, 256])

By default, the model comes with the following parameters:

ALiBiConfig(
    num_layers=6, 
    d_model=256, 
    num_heads=8, 
    max_len=256, 
    dropout=0.1, 
    causal=True, 
    expansion_factor=1
)

To use an encoder instead of a decoder, simply toggle causal=False.

Abstract

Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question remains open: how to achieve extrapolation at inference time to longer sequences than seen during training? We first show that extrapolation can be improved by changing the position representation method, though we find that existing proposals do not allow efficient extrapolation. We introduce a simple and efficient method, Attention with Linear Biases (ALiBi), that allows for extrapolation. ALiBi does not add positional embeddings to the word embeddings; instead, it biases the query-key attention scores with a term that is proportional to their distance. We show that this method allows training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048, 11% faster and using 11% less memory. ALiBi's inductive bias towards recency allows it to outperform multiple strong position methods on the WikiText-103 benchmark. Finally, we provide analysis of ALiBi to understand why it leads to better performance.

Citation

@misc{press2021train,
	title        = {Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation},
	author       = {Ofir Press and Noah A. Smith and Mike Lewis},
	year         = 2021,
	eprint       = {2108.12409},
	archiveprefix = {arXiv},
	primaryclass = {cs.CL}
}

PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Related tags

Overview

ALiBi

Quickstart

Abstract

Citation

Owner

Jake Tae

基于Flask开发后端、VUE开发前端框架，在WEB端部署YOLOv5目标检测模型

SpanNER: Named EntityRe-/Recognition as Span Prediction

Transfer Learning Shootout for PyTorch's model zoo (torchvision)

PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

Technical Analysis library in pandas for backtesting algotrading and quantitative analysis

A Re-implementation of the paper "A Deep Learning Framework for Character Motion Synthesis and Editing"

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

Official Code Release for "CLIP-Adapter: Better Vision-Language Models with Feature Adapters"

DualGAN-tensorflow: tensorflow implementation of DualGAN

Large dataset storage format for Pytorch

Fast image augmentation library and an easy-to-use wrapper around other libraries

Official PyTorch Implementation for InfoSwap: Information Bottleneck Disentanglement for Identity Swapping

Code for the submitted paper Surrogate-based cross-correlation for particle image velocimetry

Streaming over lightweight data transformations

QA-GNN: Question Answering using Language Models and Knowledge Graphs

Learning to Self-Train for Semi-Supervised Few-Shot

OntoProtein: Protein Pretraining With Ontology Embedding

This repository includes the code of the sequence-to-sequence model for discontinuous constituent parsing described in paper Discontinuous Grammar as a Foreign Language.

custom pytorch implementation of MoCo v3

External Attention Network