Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Last update: Dec 16, 2022

Related tags

Deep Learning Grounded-Image-Captioning

Overview

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Requirements

Python 3.7
Pytorch 1.2

Prepare data

Please use git clone --recurse-submodules to clone this repository and remember to follow initialization steps in coco-caption/README.md. Then download and place the Flickr30k reference file under coco-caption/annotations. Also, download Stanford CoreNLP 3.9.1 for grounding evaluation and place the uncompressed folder under the tools/ directory.
Download the preprocessd dataset from this link and extract it to data/.
For Flickr30k-Entities, please download bottom-up visual feature extracted by Anderson's extractor (Zhou's extractor) from this link ( link) and place the uncompressed folders under data/flickrbu/. For MSCOCO, please follow this instruction to prepare the bottom-up features and place them under data/mscoco/.
Download the pretrained models from here and extract them to log/.
Download the pretrained SCAN models from this link and extract them to misc/SCAN/runs.

Evaluation

To reproduce the results reported in the paper, just simply run

bash eval_flickr.sh

fro Flickr30k-Entities and

bash eval_coco.sh

for MSCOCO.

Training

In the first training stage, run like

python train.py --id CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att  --input_box_dir data/flickrbu/flickrbu_box  --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log/CE-scan-sup-0.1kl --save_checkpoint_every 1000 --val_images_use -1 --max_epochs 30  --att_supervise  True   --att_supervise_weight 0.1

In the second training stage, run like

python train.py --id sc-ground-CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att  --input_box_dir data/flickrbu/flickrbu_box  --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-5 --start_from log/CE-scan-sup-0.1kl --checkpoint_path log/sc-ground-CE-scan-sup-0.1kl --save_checkpoint_every 1000 --language_eval 1 --val_images_use -1 --self_critical_after 30  --max_epochs  110      --cider_reward_weight  1
--ground_reward_weight   1

Citation

@inproceedings{zhou2020grounded,
  title={More Grounded Image Captioning by Distilling Image-Text Matching Model},
  author={Zhou, Yuanen and Wang, Meng and Liu, Daqing and  Hu, Zhenzhen and Zhang, Hanwang},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  year={2020}
}

Acknowledgements

This repository is built upon self-critical.pytorch, SCAN and grounded-video-description. Thanks for their released code.

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Related tags

Overview

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Requirements

Prepare data

Evaluation

Training

Citation

Acknowledgements

Owner

YE Zhou

Project for tracking occupancy in Tel-Aviv parking lots.

[CVPR 2021] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion

Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models

MAGMA - a GPT-style multimodal model that can understand any combination of images and language

Implementation of paper "Towards a Unified View of Parameter-Efficient Transfer Learning"

Doosan robotic arm, simulation, control, visualization in Gazebo and ROS2 for Reinforcement Learning.

Job Assignment System by Real-time Emotion Detection

Determined: Deep Learning Training Platform

This example implements the end-to-end MLOps process using Vertex AI platform and Smart Analytics technology capabilities

Implementation for paper "STAR: A Structure-aware Lightweight Transformer for Real-time Image Enhancement" (ICCV 2021).

Pytorch implementation of the paper "Enhancing Content Preservation in Text Style Transfer Using Reverse Attention and Conditional Layer Normalization"

Speeding-Up Back-Propagation in DNN: Approximate Outer Product with Memory

A curated list of automated deep learning (including neural architecture search and hyper-parameter optimization) resources.

UnpNet - Rethinking 3-D LiDAR Point Cloud Segmentation(IEEE TNNLS)

The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relating to PyTorch.

GRF: Learning a General Radiance Field for 3D Representation and Rendering

3D-CariGAN: An End-to-End Solution to 3D Caricature Generation from Normal Face Photos

Classical OCR DCNN reproduction based on PaddlePaddle framework.

Supplemental learning materials for "Fourier Feature Networks and Neural Volume Rendering"

MNIST, but with Bezier curves instead of pixels