Optimized code based on M2 for faster image captioning training

Last update: Dec 16, 2022

Related tags

Overview

Transformer Captioning

This repository contains the code for Transformer-based image captioning. Based on meshed-memory-transformer, we further optimize the code for FASTER training without any accuracy decline.

Specifically, we optimize following aspects:

vocab: we pre-tokenize the dataset so there are no ' '(space token) in vocab or generated sentences.
Dataloader: we optimize speed of dataloader and achieve 2x~6x speed-up.
BeamSearch:
- Make ops parallel in beam_search.py (e.g. loop gather -> parallel gather)
- Use cheaper ops (e.g. torch.sort -> torch.topk)
- Use faster and specialized functions instead of general ones
Self-critical Training
- Compute Cider by index instead of raw text
- Cache tf-idf vector of gts instead of computing it again and again
- drop on-the-fly tokenization since it is too SLOW.
contiguous model parameter
other details...

speed-up result (1 GeForce 1080Ti GPU, num_workers=8, batch_size=50(XE)/100(SCST))

Training its/s	Original	Optimized	Accelerate
XE	7.5	10.3	138%
SCST	0.6	1.3	204%

Dataloader its/s	Original XE	Optimized XE	Accelerate	Original SCST	Optimized SCST	Accelerate
batch size=50	12.5	52.5	320%	29.3	90.7	209%
batch size=100	5.5	33.5	510%	22.3	88.5	297%
batch size=150	3.7	25.4	580%	13.4	71.8	435%
batch size=200	2.7	20.1	650%	11.4	54.1	376%

Things I have tried but not useful

TorchText n-gram counter: slower than the original one.
nn.Module.MultiHeadAttention: slightly faster than original one.
GPU cider: very slow
BeamableMM: slower than the original

Environment setup

Clone the repository and create the m2release conda environment using the environment.yml file:

conda env create -f environment.yml
conda activate m2release

Then download spacy data by executing the following command:

python -m spacy download en

Note: Python 3.6 is required to run our code.

Data preparation

To run the code, annotations and detection features for the COCO dataset are needed. Please download the annotations file annotations.zip and extract it.

Detection features are computed with the code provided by [1]. To reproduce our result, please download the COCO features file coco_detections.hdf5 (~53.5 GB), in which detections of each image are stored under the <image_id>_features key. <image_id> is the id of each COCO image, without leading zeros (e.g. the <image_id> for COCO_val2014_000000037209.jpg is 37209), and each value should be a (N, 2048) tensor, where N is the number of detections.

REMEMBER to do pre-tokenize

python pre_tokenize.py

Evaluation

Run python test.py using the following arguments:

Argument	Possible values
`--batch_size`	Batch size (default: 10)
`--workers`	Number of workers (default: 0)
`--features_path`	Path to detection features file
`--annotation_folder`	Path to folder with COCO annotations

Training procedure

Run python train.py using the following arguments:

Argument	Possible values
`--exp_name`	Experiment name
`--batch_size`	Batch size (default: 10)
`--workers`	Number of workers (default: 0)
`--head`	Number of heads (default: 8)
`--resume_last`	If used, the training will be resumed from the last checkpoint.
`--resume_best`	If used, the training will be resumed from the best checkpoint.
`--features_path`	Path to detection features file
`--annotation_folder`	Path to folder with COCO annotations
`--logs_folder`	Path folder for tensorboard logs (default: "tensorboard_logs")

For example, to train our model with the parameters used in our experiments, use

We recommend to use batch size=100 during SCST stage. Since it will accelerate convergence without obvious accuracy decline

python train.py --exp_name test --batch_size 50 --head 8 --features_path ~/datassd/coco_detections.hdf5 --annotation_folder annotation --workers 8 --rl_batch_size 100 --image_field FasterImageDetectionsField --model transformer --seed 118

Optimized code based on M2 for faster image captioning training

Related tags

Overview

Transformer Captioning

Environment setup

Data preparation

Evaluation

Training procedure

References

Owner

lyricpoem

An inofficial PyTorch implementation of PREDATOR based on KPConv.

ContourletNet: A Generalized Rain Removal Architecture Using Multi-Direction Hierarchical Representation

The code uses SegFormer for Semantic Segmentation on Drone Dataset.

details on efforts to dump the Watermelon Games Paprium cart

Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

Learning Tracking Representations via Dual-Branch Fully Transformer Networks

Code of TIP2021 Paper《SFace: Sigmoid-Constrained Hypersphere Loss for Robust Face Recognition》. We provide both MxNet and Pytorch versions.

NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation (ACL-IJCNLP 2021)

Mind the Trade-off: Debiasing NLU Models without Degrading the In-distribution Performance

GAN-generated image detection based on CNNs

This code provides various models combining dilated convolutions with residual networks

Accelerated NLP pipelines for fast inference on CPU and GPU. Built with Transformers, Optimum and ONNX Runtime.

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

A High-Level Fusion Scheme for Circular Quantities published at the 20th International Conference on Advanced Robotics

Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency[ECCV 2020]

Weight initialization schemes for PyTorch nn.Modules

A Python toolbox to create adversarial examples that fool neural networks in PyTorch, TensorFlow, and JAX

The official codes for the ICCV2021 Oral presentation "Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework"

TACTO: A Fast, Flexible and Open-source Simulator for High-Resolution Vision-based Tactile Sensors