Implementation of a Transformer, but completely in Triton

Last update: Dec 22, 2022

Overview

Transformer in Triton (wip)

Implementation of a Transformer, but completely in Triton. I'm completely new to lower-level neural net code, so this repository will mostly be a learning experience, with the end-goal being a vanilla transformer that is faster and more efficient to train.

Install

$ pip install triton-transformer

Usage

import torch
from triton_transformer import Transformer

model = Transformer(
    num_tokens = 256,
    max_seq_len = 1024,
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64
)

x = torch.randint(0, 256, (1, 1024))
mask = torch.ones(1, 1024).bool()

logits = model(x, mask = mask) # (1, 1024, 256)

Citations

@article{Tillet2019TritonAI,
    title   = {Triton: an intermediate language and compiler for tiled neural network computations},
    author  = {Philippe Tillet and H. Kung and D. Cox},
    journal = {Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages},
    year    = {2019}
}

@misc{vaswani2017attention,
    title   = {Attention Is All You Need}, 
    author  = {Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
    year    = {2017},
    eprint  = {1706.03762},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}

A Pytorch implementation of CVPR 2021 paper "RSG: A Simple but Effective Module for Learning Imbalanced Datasets"

RSG: A Simple but Effective Module for Learning Imbalanced Datasets (CVPR 2021) A Pytorch implementation of our CVPR 2021 paper "RSG: A Simple but Eff

120 Dec 12, 2022

A concise but complete implementation of CLIP with various experimental improvements from recent papers

x-clip (wip) A concise but complete implementation of CLIP with various experimental improvements from recent papers Install $ pip install x-clip Usag

515 Dec 26, 2022

A concise but complete implementation of CLIP with various experimental improvements from recent papers

x-clip (wip) A concise but complete implementation of CLIP with various experimental improvements from recent papers Install $ pip install x-clip Usag

115 Dec 9, 2021

Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Protein GLM (wip) Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capabil

17 May 6, 2022

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

61 Jan 1, 2023

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

CSWin-Transformer This repo is the official implementation of "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows". Th

409 Jan 6, 2023

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation ". Please

610 Dec 28, 2022

3D-Transformer: Molecular Representation with Transformer in 3D Space

55 Dec 19, 2022

Comments

Question concerning PyTorch build

Hello. I find your project very interesting and I have seen your comparison between PyTorch and Triton implementations.

However, I am curious whether your PyTorch environment is a source build optimized for your machine or a pip/conda install.

Source building has faster runtimes and if a conda install is being used for comparison, the difference in speed may simply be due to Triton optimizing CUDA for the run environment.

Thank you again for your interesting project.

opened by veritas9872 13
_layernorm implementation forward result not equal F.layer_norm

I have a try on your triton-transformer and test the layernorm module alone. It's very weird that the forward result is different while the backward result is equal.

code: from triton_transformer.layernorm import layernorm import torch import torch.nn as nn

torch.manual_seed(0) x = torch.randn(2,5).cuda() x.requires_grad_(True) dy = .1*torch.randn_like(x).cuda() dim = 5 norm = nn.LayerNorm(dim).cuda()

y1 = layernorm(x, norm.weight, norm.bias, use_triton = True) y2 = layernorm(x, norm.weight, norm.bias, use_triton = False) print(y1, y2) print(torch.allclose(y1, y2))

y1.backward(dy, retain_graph=True) dx_y1 = x.grad.clone()

x.grad = None

y2.backward(dy, retain_graph=True) dx_y2 = x.grad.clone() print(dx_y1, dx_y2) print(torch.allclose(dx_y1, dx_y2))

result: `tensor([[ 0.9492, -0.0021, -0.9797, 0.4449, -0.4123], [-0.7624, 0.4399, 0.7299, -0.3091, -0.0983]], device='cuda:0', grad_fn=<_layernormBackward>) tensor([[ 1.4217, -0.0031, -1.4674, 0.6663, -0.6175], [-1.4342, 0.8276, 1.3732, -0.5815, -0.1850]], device='cuda:0', grad_fn=) False

tensor([[-0.0706, 0.0288, -0.0813, 0.0446, 0.0785], [ 0.0218, -0.0152, 0.0141, -0.0522, 0.0315]], device='cuda:0') tensor([[-0.0706, 0.0288, -0.0813, 0.0446, 0.0785], [ 0.0218, -0.0152, 0.0141, -0.0522, 0.0315]], device='cuda:0') True`

opened by Tengxu-Sun 1
Current state of benchmarking & contributing?
Hey @lucidrains - hope you're doing well! I have some time to hack the next couple weeks, just wanted to get a sense of:

Current state of benchmarking (what Triton kernels provide how much lift, aggregate lift over a "vanilla Transformer implementation"

If there's anything I could help with, especially as I learn Triton!
opened by siddk 0
Official layer norm added

Hi @lucidrains , in Triton layer norm was just added in examples, https://github.com/openai/triton/commit/d4baad426db72b83c5222e1c83c929c1860cae54 I tested it, it's twice as fast as Torch, often faster then Apex.

I'm looking forward for your implementation of attention, so far the Torch implementation is the fastest with 12.3 / 14.5 (forw / back) vs the other Triton implementation in DeepSpeed which is 17.3/ 23.0 on my data.

opened by olegklimov 2

Releases(0.1.1)

0.1.1(Apr 5, 2022)

Source code(tar.gz)
Source code(zip)
0.1.0(Apr 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.28(Mar 23, 2022)

Source code(tar.gz)
Source code(zip)
0.0.27(Nov 6, 2021)

Source code(tar.gz)
Source code(zip)
0.0.26(Nov 6, 2021)

Source code(tar.gz)
Source code(zip)
0.0.25(Oct 6, 2021)

Source code(tar.gz)
Source code(zip)
0.0.24(Oct 4, 2021)

Source code(tar.gz)
Source code(zip)
0.0.23(Oct 4, 2021)

Source code(tar.gz)
Source code(zip)
0.0.22(Oct 4, 2021)

Source code(tar.gz)
Source code(zip)
0.0.21(Oct 4, 2021)

Source code(tar.gz)
Source code(zip)
0.0.20(Sep 29, 2021)

Source code(tar.gz)
Source code(zip)
0.0.19(Sep 29, 2021)

Source code(tar.gz)
Source code(zip)
0.0.18(Sep 29, 2021)

Source code(tar.gz)
Source code(zip)
0.0.17(Sep 28, 2021)

Source code(tar.gz)
Source code(zip)
0.0.16(Sep 28, 2021)

Source code(tar.gz)
Source code(zip)
0.0.15(Sep 27, 2021)

Source code(tar.gz)
Source code(zip)
0.0.14(Sep 23, 2021)

Source code(tar.gz)
Source code(zip)
0.0.12(Sep 23, 2021)

Source code(tar.gz)
Source code(zip)
0.0.10(Sep 23, 2021)

Source code(tar.gz)
Source code(zip)
0.0.9(Sep 22, 2021)

Source code(tar.gz)
Source code(zip)
0.0.8(Sep 22, 2021)

Source code(tar.gz)
Source code(zip)
0.0.7(Sep 22, 2021)

Source code(tar.gz)
Source code(zip)
0.0.6(Sep 22, 2021)

Source code(tar.gz)
Source code(zip)
0.0.5(Sep 22, 2021)

Source code(tar.gz)
Source code(zip)
0.0.4(Sep 22, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Sep 15, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Sep 15, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

A new version of the CIDACS-RL linkage tool suitable to a cluster computing environment.

Fully Distributed CIDACS-RL The CIDACS-RL is a brazillian record linkage tool suitable to integrate large amount of data with high accuracy. However,

5 Nov 04, 2022

Volsdf - Volume Rendering of Neural Implicit Surfaces

Volume Rendering of Neural Implicit Surfaces Project Page | Paper | Data This re

221 Jan 07, 2023

Breaking the Dilemma of Medical Image-to-image Translation

Breaking the Dilemma of Medical Image-to-image Translation Supervised Pix2Pix and unsupervised Cycle-consistency are two modes that dominate the field

86 Dec 21, 2022

Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

GPT2-Pytorch with Text-Generator Better Language Models and Their Implications Our model, called GPT-2 (a successor to GPT), was trained simply to pre

775 Jan 08, 2023

Fast, modular reference implementation and easy training of Semantic Segmentation algorithms in PyTorch.

TorchSeg This project aims at providing a fast, modular reference implementation for semantic segmentation models using PyTorch. Highlights Modular De

1.4k Jan 02, 2023

v objective diffusion inference code for JAX.

v-diffusion-jax v objective diffusion inference code for JAX, by Katherine Crowson (@RiversHaveWings) and Chainbreakers AI (@jd_pressman). The models

186 Dec 21, 2022

Official implementation for Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos

Multi-modal Interaction Graph Convolutioal Network for Temporal Language Localization in Videos Official implementation for Multi-Modal Interaction Gr

15 Oct 18, 2022

Video Autoencoder: self-supervised disentanglement of 3D structure and motion

Video Autoencoder: self-supervised disentanglement of 3D structure and motion This repository contains the code (in PyTorch) for the model introduced

157 Dec 22, 2022

Pytorch implementation of our method for high-resolution (e.g. 2048x1024) photorealistic video-to-video translation.

vid2vid Project | YouTube(short) | YouTube(full) | arXiv | Paper(full) Pytorch implementation for high-resolution (e.g., 2048x1024) photorealistic vid

8.1k Jan 01, 2023

PyTorch implementation of Advantage async actor-critic Algorithms (A3C) in PyTorch

Advantage async actor-critic Algorithms (A3C) in PyTorch @inproceedings{mnih2016asynchronous, title={Asynchronous methods for deep reinforcement lea

111 Dec 08, 2022

This is the official repository of the paper Stocastic bandits with groups of similar arms (NeurIPS 2021). It contains the code that was used to compute the figures and experiments of the paper.

Experiments How to reproduce experimental results of Stochastic bandits with groups of similar arms submitted paper ? Section 5 of the paper To reprod

0 Oct 25, 2021

A large-image collection explorer and fast classification tool

IMAX: Interactive Multi-image Analysis eXplorer This is an interactive tool for visualize and classify multiple images at a time. It written in Python

23 Dec 16, 2022

MultiTaskLearning - Multi Task Learning for 3D segmentation

Multi Task Learning for 3D segmentation Perception stack of an Autonomous Drivin

2 Sep 22, 2022

The code is the training example of AAAI2022 Security AI Challenger Program Phase 8: Data Centric Robot Learning on ML models.

Example code of [Tianchi AAAI2022 Security AI Challenger Program Phase 8]

22 Oct 14, 2022

Implementation of a Transformer, but completely in Triton

Related tags

Overview

Transformer in Triton (wip)

Install

Usage

Citations

You might also like...

A Pytorch implementation of CVPR 2021 paper "RSG: A Simple but Effective Module for Learning Imbalanced Datasets"

A concise but complete implementation of CLIP with various experimental improvements from recent papers

A concise but complete implementation of CLIP with various experimental improvements from recent papers

Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

3D-Transformer: Molecular Representation with Transformer in 3D Space

Comments

Question concerning PyTorch build

_layernorm implementation forward result not equal F.layer_norm

Current state of benchmarking & contributing?

Official layer norm added

Releases(0.1.1)

0.1.1(Apr 5, 2022)

0.1.0(Apr 4, 2022)

0.0.28(Mar 23, 2022)

0.0.27(Nov 6, 2021)

0.0.26(Nov 6, 2021)

0.0.25(Oct 6, 2021)

0.0.24(Oct 4, 2021)

0.0.23(Oct 4, 2021)

0.0.22(Oct 4, 2021)

0.0.21(Oct 4, 2021)

0.0.20(Sep 29, 2021)

0.0.19(Sep 29, 2021)

0.0.18(Sep 29, 2021)

0.0.17(Sep 28, 2021)

0.0.16(Sep 28, 2021)

0.0.15(Sep 27, 2021)

0.0.14(Sep 23, 2021)

0.0.12(Sep 23, 2021)

0.0.10(Sep 23, 2021)

0.0.9(Sep 22, 2021)

0.0.8(Sep 22, 2021)

0.0.7(Sep 22, 2021)

0.0.6(Sep 22, 2021)

0.0.5(Sep 22, 2021)

0.0.4(Sep 22, 2021)

0.0.3(Sep 15, 2021)

0.0.2(Sep 15, 2021)

Owner

Phil Wang

A new version of the CIDACS-RL linkage tool suitable to a cluster computing environment.

Volsdf - Volume Rendering of Neural Implicit Surfaces

Breaking the Dilemma of Medical Image-to-image Translation

Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

Fast, modular reference implementation and easy training of Semantic Segmentation algorithms in PyTorch.

v objective diffusion inference code for JAX.

Official implementation for Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos

Video Autoencoder: self-supervised disentanglement of 3D structure and motion

Pytorch implementation of our method for high-resolution (e.g. 2048x1024) photorealistic video-to-video translation.

PyTorch implementation of Advantage async actor-critic Algorithms (A3C) in PyTorch

This is the official repository of the paper Stocastic bandits with groups of similar arms (NeurIPS 2021). It contains the code that was used to compute the figures and experiments of the paper.

A large-image collection explorer and fast classification tool

MultiTaskLearning - Multi Task Learning for 3D segmentation

The code is the training example of AAAI2022 Security AI Challenger Program Phase 8: Data Centric Robot Learning on ML models.

The source code of CVPR17 'Generative Face Completion'.

A TensorFlow implementation of Neural Program Synthesis from Diverse Demonstration Videos

Self-Learned Video Rain Streak Removal: When Cyclic Consistency Meets Temporal Correspondence

How to use TensorLayer

Implementation of the state of the art beat-detection, downbeat-detection and tempo-estimation model

PyTorch implementation of SwAV (Swapping Assignments between Views)