DeepViT

This repo is the official implementation of "DeepViT: Towards Deeper Vision Transformer". The repo is based on the timm library (https://github.com/rwightman/pytorch-image-models) by Ross Wightman

Introduction

Deep Vision Transformer is initially described in arxiv, which observes the attention collapese phenomenon when training deep vision transformers: In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet.

2. DeepViT Models

Model	Re-attention	Top1 Acc (%)	#params	#Similar Blocks	Checkpoint
ViT-16	NA	78.88	24.5M	5	[here](comming soon)
DeepViT-16	FC	79.10	24.5M	0	[here](comming soon)
ViT-24	NA	79.35	36.3M	11	[here](comming soon)
DeepViT-24	FC	79.99	36.3M	0	[here](comming soon)
ViT-32	NA	79.27	48.1M	15	[here](comming soon)
DeepViT_t-32	FC	80.90	48.1M	0	[here](comming soon)

Citing DeepVit

@article{zhou2021deepvit,
  title={DeepViT: Towards Deeper Vision Transformer},
  author={Zhou, Daquan and Kang, Bingyi and Jin, Xiaojie and Yang, Linjie and Lian, Xiaochen and Hou, Qibin and Feng, Jiashi},
  journal={arXiv preprint arXiv:2103.11886},
  year={2021}
}

《DeepViT: Towards Deeper Vision Transformer》(2021)

Related tags

Overview

DeepViT

Introduction

2. DeepViT Models

Citing DeepVit

Owner

Repository for "Space-Time Correspondence as a Contrastive Random Walk" (NeurIPS 2020)

This repository contains an overview of important follow-up works based on the original Vision Transformer (ViT) by Google.

TransZero++: Cross Attribute-guided Transformer for Zero-Shot Learning

Library for time-series-forecasting-as-a-service.

Reaction SMILES-AA mapping via language modelling

Learning Facial Representations from the Cycle-consistency of Face (ICCV 2021)

The implementation of our CIKM 2021 paper titled as: "Cross-Market Product Recommendation"

Anderson Acceleration for Deep Learning

a delightful machine learning tool that allows you to train, test and use models without writing code

This repository contains the source code for the paper Tutorial on amortized optimization for learning to optimize over continuous domains by Brandon Amos

PyTorch implementation of our paper How robust are discriminatively trained zero-shot learning models?

Geometry-Aware Learning of Maps for Camera Localization (CVPR2018)

Gesture Volume Control Using OpenCV and MediaPipe

Multi-layer convolutional LSTM with Pytorch

A basic implementation of Layer-wise Relevance Propagation (LRP) in PyTorch.

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Train a deep learning net with OpenStreetMap features and satellite imagery.

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

This application is the basic of automated online-class-joiner(for YıldızEdu) within the right time. Gets the ZOOM link by scheduled date and time.

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.