Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"

Overview

Focal Transformer

PWC PWC PWC PWC PWC PWC

This is the official implementation of our Focal Transformer -- "Focal Self-attention for Local-Global Interactions in Vision Transformers", by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.

Introduction

focal-transformer-teaser

Our Focal Transfomer introduced a new self-attention mechanism called focal self-attention for vision transformers. In this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively.

With our Focal Transformers, we achieved superior performance over the state-of-the-art vision Transformers on a range of public benchmarks. In particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve 83.6 and 84.0 Top-1 accuracy, respectively, on ImageNet classification at 224x224 resolution. Using Focal Transformers as the backbones, we obtain consistent and substantial improvements over the current state-of-the-art methods for 6 different object detection methods trained with standard 1x and 3x schedules. Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation.

Benchmarking

Image Classification on ImageNet-1K

Model Pretrain Use Conv Resolution [email protected] [email protected] #params FLOPs Checkpoint Config
Focal-T IN-1K No 224 82.2 95.9 28.9M 4.9G download yaml
Focal-T IN-1K Yes 224 82.7 96.1 30.8M 4.9G download yaml
Focal-S IN-1K No 224 83.6 96.2 51.1M 9.4G download yaml
Focal-B IN-1K No 224 84.0 96.5 89.8M 16.4G download yaml

Object Detection and Instance Segmentation on COCO

Mask R-CNN

Backbone Pretrain Lr Schd #params FLOPs box mAP mask mAP
Focal-T ImageNet-1K 1x 49M 291G 44.8 41.0
Focal-T ImageNet-1K 3x 49M 291G 47.2 42.7
Focal-S ImageNet-1K 1x 71M 401G 47.4 42.8
Focal-S ImageNet-1K 3x 71M 401G 48.8 43.8
Focal-B ImageNet-1K 1x 110M 533G 47.8 43.2
Focal-B ImageNet-1K 3x 110M 533G 49.0 43.7

RetinaNet

Backbone Pretrain Lr Schd #params FLOPs box mAP
Focal-T ImageNet-1K 1x 39M 265G 43.7
Focal-T ImageNet-1K 3x 39M 265G 45.5
Focal-S ImageNet-1K 1x 62M 367G 45.6
Focal-S ImageNet-1K 3x 62M 367G 47.3
Focal-B ImageNet-1K 1x 101M 514G 46.3
Focal-B ImageNet-1K 3x 101M 514G 46.9

Other detection methods

Backbone Pretrain Method Lr Schd #params FLOPs box mAP
Focal-T ImageNet-1K Cascade Mask R-CNN 3x 87M 770G 51.5
Focal-T ImageNet-1K ATSS 3x 37M 239G 49.5
Focal-T ImageNet-1K RepPointsV2 3x 45M 491G 51.2
Focal-T ImageNet-1K Sparse R-CNN 3x 111M 196G 49.0

Semantic Segmentation on ADE20K

Backbone Pretrain Method Resolution Iters #params FLOPs mIoU mIoU (MS)
Focal-T ImageNet-1K UPerNet 512x512 160k 62M 998G 45.8 47.0
Focal-S ImageNet-1K UPerNet 512x512 160k 85M 1130G 48.0 50.0
Focal-B ImageNet-1K UPerNet 512x512 160k 126M 1354G 49.0 50.5
Focal-L ImageNet-22K UPerNet 640x640 160k 240M 3376G 54.0 55.4

Getting Started

Citation

If you find this repo useful to your project, please consider to cite it with following bib:

@misc{yang2021focal,
    title={Focal Self-attention for Local-Global Interactions in Vision Transformers}, 
    author={Jianwei Yang and Chunyuan Li and Pengchuan Zhang and Xiyang Dai and Bin Xiao and Lu Yuan and Jianfeng Gao},
    year={2021},
    eprint={2107.00641},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Acknowledgement

Our codebase is built based on Swin-Transformer. We thank the authors for the nicely organized code!

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Simple streamlit app to demonstrate HERE Tour Planning

Table of Contents About the Project Built With Getting Started Prerequisites Installation Usage Roadmap Contributing License Acknowledgements About Th

Amol 8 Sep 05, 2022
Only a Matter of Style: Age Transformation Using a Style-Based Regression Model

Only a Matter of Style: Age Transformation Using a Style-Based Regression Model The task of age transformation illustrates the change of an individual

444 Dec 30, 2022
Optimize Trading Strategies Using Freqtrade

Optimize trading strategy using Freqtrade Short demo on building, testing and optimizing a trading strategy using Freqtrade. The DevBootstrap YouTube

DevBootstrap 139 Jan 01, 2023
Implementation for ACProp ( Momentum centering and asynchronous update for adaptive gradient methdos, NeurIPS 2021)

This repository contains code to reproduce results for submission NeurIPS 2021, "Momentum Centering and Asynchronous Update for Adaptive Gradient Meth

Juntang Zhuang 15 Jun 11, 2022
Multi-Agent Reinforcement Learning (MARL) method to learn scalable control polices for multi-agent target tracking.

scalableMARL Scalable Reinforcement Learning Policies for Multi-Agent Control CD. Hsu, H. Jeong, GJ. Pappas, P. Chaudhari. "Scalable Reinforcement Lea

Christopher Hsu 17 Nov 17, 2022
Large scale embeddings on a single machine.

Marius Marius is a system under active development for training embeddings for large-scale graphs on a single machine. Training on large scale graphs

Marius 107 Jan 03, 2023
Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Evolutionary Scale Modeling This repository contains code and pre-trained weights for Transformer protein language models from Facebook AI Research, i

Meta Research 1.6k Jan 09, 2023
A simple python program that can be used to implement user authentication tokens into your program...

token-generator A simple python module that can be used by developers to implement user authentication tokens into your program... code examples creat

octo 6 Apr 18, 2022
This repository is for DSA and CP scripts for reference.

dsa-script-collections This Repo is the collection of DSA and CP scripts for reference. Contents Python Bubble Sort Insertion Sort Merge Sort Quick So

Aditya Kumar Pandey 9 Nov 22, 2022
This is the official code for the paper "Ad2Attack: Adaptive Adversarial Attack for Real-Time UAV Tracking".

Ad^2Attack:Adaptive Adversarial Attack on Real-Time UAV Tracking Demo video 📹 Our video on bilibili demonstrates the test results of Ad^2Attack on se

Intelligent Vision for Robotics in Complex Environment 10 Nov 07, 2022
CTF Challenge for CSAW Finals 2021

Terminal Velocity Misc CTF Challenge for CSAW Finals 2021 This is a challenge I've had in mind for almost 15 years and never got around to building un

Jordan 6 Jul 30, 2022
OpenFace – a state-of-the art tool intended for facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation.

OpenFace 2.2.0: a facial behavior analysis toolkit Over the past few years, there has been an increased interest in automatic facial behavior analysis

Tadas Baltrusaitis 5.8k Dec 31, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect. It handles Algerian

117 Jan 07, 2023
A forwarding MPI implementation that can use any other MPI implementation via an MPI ABI

MPItrampoline MPI wrapper library: MPI trampoline library: MPI integration tests: MPI is the de-facto standard for inter-node communication on HPC sys

Erik Schnetter 31 Dec 22, 2022
LQM - Improving Object Detection by Estimating Bounding Box Quality Accurately

Improving Object Detection by Estimating Bounding Box Quality Accurately Abstract Object detection aims to locate and classify object instances in ima

IM Lab., POSTECH 0 Sep 28, 2022
A denoising diffusion probabilistic model (DDPM) tailored for conditional generation of protein distograms

Denoising Diffusion Probabilistic Model for Proteins Implementation of Denoising Diffusion Probabilistic Model in Pytorch. It is a new approach to gen

Phil Wang 108 Nov 23, 2022
Causal Imitative Model for Autonomous Driving

Causal Imitative Model for Autonomous Driving Mohammad Reza Samsami, Mohammadhossein Bahari, Saber Salehkaleybar, Alexandre Alahi. arXiv 2021. [Projec

VITA lab at EPFL 8 Oct 04, 2022
商品推荐系统

商品top50推荐系统 问题建模 本项目的数据集给出了15万左右的用户以及12万左右的商品, 以及对应的经过脱敏处理的用户特征和经过预处理的商品特征,旨在为用户推荐50个其可能购买的商品。 推荐系统架构方案 本项目采用传统的召回+排序的方案。

107 Dec 29, 2022