CLASP - Contrastive Language-Aminoacid Sequence Pretraining

Last update: Dec 29, 2022

Related tags

Deep Learning clasp

Overview

CLASP - Contrastive Language-Aminoacid Sequence Pretraining

Repository for creating models pretrained on language and aminoacid sequences similar to ConVIRT, CLIP, and ALIGN.

Work in progress - more updates soon!

Requirements

You can install the requirements with the following

$ python setup.py install --user

Then, you must install Microsoft's sparse attention CUDA kernel with the following two steps.

$ sh install_deepspeed.sh

Next, you need to pip install the package triton

$ pip install triton

If both of the above succeeded, now you can train your long biosequences with CLASP

Usage

import torch
from torch.optim import Adam

from clasp import CLASP, Transformer, tokenize

# instantiate the attention models for text and bioseq

text_enc = Transformer(
    num_tokens = 20000,
    dim = 512,
    depth = 6,
    seq_len = 1024
)

bioseq_enc = Transformer(
    num_tokens = 21,
    dim = 512,
    depth = 6,
    seq_len = 512,
    sparse_attn = True
)

# clasp (CLIP) trainer

clasp = CLASP(
    text_encoder = text_enc,
    bioseq_encoder = bioseq_enc
)

# data

text, text_mask = tokenize(['Spike protein S2: HAMAP-Rule:MF_04099'], context_length = 1024, return_mask = True)

bioseq = torch.randint(0, 21, (1, 511))         # when using sparse attention, should be 1 less than the sequence length
bioseq_mask = torch.ones_like(bioseq).bool()

# do the below with large batch sizes for many many iterations

opt = Adam(clasp.parameters(), lr = 3e-4)

loss = clasp(
    text,
    bioseq,
    text_mask = text_mask,
    bioseq_mask = bioseq_mask,
    return_loss = True               # set return loss to True
)

loss.backward()

Once trained

scores = clasp(
    texts,
    bio_seq,
    text_mask = text_mask,
    bioseq_mask = bioseq_mask
)

Resources

See interesting resources (feel free to add interesting material that could be useful).

Citations

@article{zhang2020contrastive,
  title={Contrastive learning of medical visual representations from paired images and text},
  author={Zhang, Yuhao and Jiang, Hang and Miura, Yasuhide and Manning, Christopher D and Langlotz, Curtis P},
  journal={arXiv preprint arXiv:2010.00747},
  year={2020}
}

OpenAI blog post "CLIP: Connecting Text and Images"

@article{radford2021learning,
  title={Learning transferable visual models from natural language supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
  journal={arXiv preprint arXiv:2103.00020},
  year={2021}
}

@article{jia2021scaling,
  title={Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision},
  author={Jia, Chao and Yang, Yinfei and Xia, Ye and Chen, Yi-Ting and Parekh, Zarana and Pham, Hieu and Le, Quoc V and Sung, Yunhsuan and Li, Zhen and Duerig, Tom},
  journal={arXiv preprint arXiv:2102.05918},
  year={2021}
}

CLASP - Contrastive Language-Aminoacid Sequence Pretraining

Related tags

Overview

CLASP - Contrastive Language-Aminoacid Sequence Pretraining

Requirements

Usage

Resources

Citations

Owner

Michael Pieler

Distributed Evolutionary Algorithms in Python

SpecAugmentPyTorch - A Pytorch (support batch and channel) implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

code release for USENIX'22 paper `On the Security Risks of AutoML`

Source code for the plant extraction workflow introduced in the paper “Agricultural Plant Cataloging and Establishment of a Data Framework from UAV-based Crop Images by Computer Vision”

CodeContests is a competitive programming dataset for machine-learning

Source code for From Stars to Subgraphs

Open Source Differentiable Computer Vision Library for PyTorch

SuperSDR: multiplatform KiwiSDR + CAT transceiver integrator

FIRM-AFL is the first high-throughput greybox fuzzer for IoT firmware.

A U-Net combined with a variational auto-encoder that is able to learn conditional distributions over semantic segmentations.

Source code for CAST - Crisis Domain Adaptation Using Sequence-to-sequence Transformers (Accepted to ISCRAM 2021, CorePaper).

The final project of "Applying AI to 3D Medical Imaging Data" from "AI for Healthcare" nanodegree - Udacity.

Machine learning and Deep learning models, deploy on telegram (the best social media)

Generic image compressor for machine learning. Pytorch code for our paper "Lossy compression for lossless prediction".

ChainerRL is a deep reinforcement learning library built on top of Chainer.

TensorFlow implementation of ENet

DeepDiffusion: Unsupervised Learning of Retrieval-adapted Representations via Diffusion-based Ranking on Latent Feature Manifold

Modular Probabilistic Programming on MXNet

ECLARE: Extreme Classification with Label Graph Correlations

PyTorch implementation of DD3D: Is Pseudo-Lidar needed for Monocular 3D Object detection?