Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Last update: Jan 08, 2023

Overview

OFA

OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.) to a simple sequence-to-sequence learning framework. For more information, please refer to our paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.

News

2022.2.13: Released the demo of image captioning. Have fun!
2022.2.11: Released the Colab notebook for image captioning . Enjoy!
2022.2.11: Released the pretrained checkpoint of OFA-Large and the complete (2-staged) finetuning code for image captioning.
2022.2.10: Released the inference code & finetuned checkpoint for image captioning, which can reproduce the results on COCO Karparthy test split (149.6 CIDEr)

TODO

To release finetuning and inference codes for multimodal downstream tasks soon, including image captioning, VQA, text-to-image generation, SNLI-VE, Referring expression, comprehension, etc.
To release codes for pretraining soon.

Approach

Requirements

python 3.7.4
pytorch 1.8.1
torchvision 0.9.1
JAVA 1.8 (for COCO evaluation)

Installation

git clone https://github.com/OFA-Sys/OFA
pip install -r requirements.txt

Datasets and Checkpoints

See datasets.md and checkpoints.md.

Pretraining

To release soon:)

Finetuning & Inference

Below we provide methods for fintuning and inference on different downstream tasks.

Caption

Download data and files and put them in the correct directory
Train

cd run_scripts/caption
nohup sh train_caption_stage1.sh &  # stage1, train with cross-entropy loss
nohup sh train_caption_stage2.sh &  # stage2, load the best ckpt of stage1 and train with CIDEr optimization

Inference

cd run_scripts/caption ; sh evaluate_caption.sh  # inference & evaluate

Gallery

Below we provide examples of OFA in text-to-image generation and open-ended VQA. Also, we demonstrate its performance in unseen task (Grounded QA) as well as unseen domain (Visual Grounding on images from unseen domains).

Text-to-Image Generation (normal query)

Text-to-Image Generation (counterfactual query)

Open-Ended VQA

Grounded QA (unseen task)

Viusal Grounding (unseen domain)

Citation

Please cite our paper if you find it helpful :)

@article{wang2022OFA,
  title={Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework},
  author={Wang, Peng and Yang, An and Men, Rui and Lin, Junyang and Bai, Shuai and Li, Zhikang and Ma, Jianxin and Zhou, Chang and Zhou, Jingren and Yang, Hongxia},
  journal={arXiv e-prints},
  pages={arXiv--2202},
  year={2022}
}

Related Codebase

Fairseq

License

Apache-2.0

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Related tags

Overview

OFA

News

TODO

Approach

Requirements

Installation

Datasets and Checkpoints

Pretraining

Finetuning & Inference

Caption

Gallery

Text-to-Image Generation (normal query)

Text-to-Image Generation (counterfactual query)

Open-Ended VQA

Grounded QA (unseen task)

Viusal Grounding (unseen domain)

Citation

Related Codebase

License

Owner

OFA Sys

[ACMMM 2021, Oral] Code release for "Elastic Tactile Simulation Towards Tactile-Visual Perception"

AI-UPV at IberLEF-2021 EXIST task: Sexism Prediction in Spanish and English Tweets Using Monolingual and Multilingual BERT and Ensemble Models

Official pytorch implementation of "DSPoint: Dual-scale Point Cloud Recognition with High-frequency Fusion"

Code of the paper "Part Detector Discovery in Deep Convolutional Neural Networks" by Marcel Simon, Erik Rodner and Joachim Denzler

Trains an agent with stochastic policy gradient ascent to solve the Lunar Lander challenge from OpenAI

Hand tracking demo for DIY Smart Glasses with a remote computer doing the work

A Jinja extension (compatible with Flask and other frameworks) to compile and/or compress your assets.

Official code for "On the Frequency Bias of Generative Models", NeurIPS 2021

A hifiasm fork for metagenome assembly using Hifi reads.

Code to compute permutation and drop-column importances in Python scikit-learn models

MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera

Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt)

Make a surveillance camera from your raspberry pi!

Deep Ensemble Learning with Jet-Like architecture

A model that attempts to learn and benefit from data collected on card counting.

This program writes christmas wish programmatically. It is using turtle as a pen pointer draw christmas trees and stars.

[ICLR 2021] "Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective" by Wuyang Chen, Xinyu Gong, Zhangyang Wang

Code for "FPS-Net: A convolutional fusion network for large-scale LiDAR point cloud segmentation".

Kalman Filter book using Jupyter Notebook. Focuses on building intuition and experience, not formal proofs. Includes Kalman filters,extended Kalman filters, unscented Kalman filters, particle filters, and more. All exercises include solutions.

Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"