Vision transformers (ViTs) have found only limited practical use in processing images

Last update: Sep 10, 2022

Related tags

Overview

CXV

Convolutional Xformers for Vision

Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs), owing to the quadratic complexity of their self-attention mechanism. We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nyströmformer, and Linear Transformer, to reduce its GPU usage. Inductive prior for image data is provided by convolutional sub-layers, thereby eliminating the need for class token and positional embeddings used by the ViTs. CXV outperforms other architectures, token mixers (eg ConvMixer, FNet and MLP Mixer), transformer models (eg ViT, CCT, CvT and hybrid Xformers), and ResNets for image classification in scenarios with limited data and GPU resources.

Models:

CNV - Convolutional Nyströmformer for Vision
CPV - Convolutional Performer for Vision
CLTV - Convolutional Linear Transformer for Vision

Vision transformers (ViTs) have found only limited practical use in processing images

Related tags

Overview

CXV

Convolutional Xformers for Vision

Owner

Cloudwalker

Axel - 3D printed robotic hands and they controll with Raspberry Pi and Arduino combo

Implementation of GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022).

Demonstrates iterative FGSM on Apple's NeuralHash model.

In this project we combine techniques from neural voice cloning and musical instrument synthesis to achieve good results from as little as 16 seconds of target data.

Neural Message Passing for Computer Vision

TEA: A Sequential Recommendation Framework via Temporally Evolving Aggregations

Office source code of paper UniFuse: Unidirectional Fusion for 360$^\circ$ Panorama Depth Estimation

When BERT Plays the Lottery, All Tickets Are Winning

AI-based, context-driven network device ranking

Apply AnimeGAN-v2 across frames of a video clip

STEAL - Learning Semantic Boundaries from Noisy Annotations (CVPR 2019)

Dataset and codebase for NeurIPS 2021 paper: Exploring Forensic Dental Identification with Deep Learning

HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022

DimReductionClustering - Dimensionality Reduction + Clustering + Unsupervised Score Metrics

JFB: Jacobian-Free Backpropagation for Implicit Models

Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

A 10000+ hours dataset for Chinese speech recognition

LF-YOLO (Lighter and Faster YOLO) is used to detect defect of X-ray weld image.

ColBERT: Contextualized Late Interaction over BERT (SIGIR'20)

Privacy-Preserving Portrait Matting [ACM MM-21]