Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Last update: Dec 30, 2022

Related tags

Deep Learning DeCLIP

Overview

DeCLIP

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm.

Our paper is available in arxiv

Updates

** Our code, dataset and models will be relased soon**

Introduction

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) (Radfordet al., 2021) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our DeCLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from these intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1× fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.

Model

Our pretrain visual backbone model (w/o text encoder)

DeCLIP_r50 GoogleDriver.
DeCLIP_vitb32 GoogleDriver

Citing DeCLIP

@misc{li2021supervision,
      title={Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm}, 
      author={Yangguang Li and Feng Liang and Lichen Zhao and Yufeng Cui and Wanli Ouyang and Jing Shao and Fengwei Yu and Junjie Yan},
      year={2021},
      eprint={2110.05208},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Related tags

Overview

DeCLIP

Updates

Introduction

Model

Our pretrain visual backbone model (w/o text encoder)

Citing DeCLIP

Owner

Sense-GVT

Prevent `CUDA error: out of memory` in just 1 line of code.

Removing Inter-Experimental Variability from Functional Data in Systems Neuroscience

Building blocks for uncertainty-aware cycle consistency presented at NeurIPS'21.

Machine learning framework for both deep learning and traditional algorithms

Pytorch Lightning Implementation of SC-Depth Methods.

TakeInfoatNistforICS - Take Information in NIST NVD for ICS

Software for Multimodalty 2D+3D Facial Expression Recognition (FER) UI

Joint Versus Independent Multiview Hashing for Cross-View Retrieval[J] (IEEE TCYB 2021, PyTorch Code)

AbelNN: Deep Learning Python module from scratch

Code for "Learning to Segment Rigid Motions from Two Frames".

Mixed Transformer UNet for Medical Image Segmentation

Mitsuba 2: A Retargetable Forward and Inverse Renderer

Implementation of SSMF: Shifting Seasonal Matrix Factorization

A Probabilistic End-To-End Task-Oriented Dialog Model with Latent Belief States towards Semi-Supervised Learning

[Official] Exploring Temporal Coherence for More General Video Face Forgery Detection(ICCV 2021)

Code for paper Novel View Synthesis via Depth-guided Skip Connections

Implementation for our ICCV 2021 paper: Dual-Camera Super-Resolution with Aligned Attention Modules

Pydantic models for pywttr and aiopywttr.

Turning SymPy expressions into PyTorch modules.

Compartmental epidemic model to assess undocumented infections: applications to SARS-CoV-2 epidemics in Brazil - Datasets and Codes