Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Related tags

Deep Learningesm
Overview

Evolutionary Scale Modeling

This repository contains code and pre-trained weights for Transformer protein language models from Facebook AI Research, including our state-of-the-art ESM-1b and MSA Transformer. Transformer protein language models were introduced in our paper, "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences" (Rives et al., 2019).

ESM-1b outperforms all tested single-sequence protein language models across a range of structure prediction tasks. The MSA Transformer (ESM-MSA-1) can improve performance further by leveraging MSA information.

Citation
@article{rives2019biological,
  author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob},
  title={Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences},
  year={2019},
  doi={10.1101/622803},
  url={https://www.biorxiv.org/content/10.1101/622803v4},
  journal={bioRxiv}
}
Table of contents
What's New

Main models you should use

Shorthand esm.pretrained. Dataset Description
ESM-1b esm1b_t33_650M_UR50S() UR50 SOTA general-purpose protein language model. Can be used to predict structure, function and other protein properties directly from individual sequences. Released with Rives et al. 2019 (Dec 2020 update).
ESM-MSA-1b esm_msa1b_t12_100M_UR50S() UR50 + MSA MSA Transformer language model. Can be used to extract embeddings from an MSA. Enables SOTA inference of structure. Released with Rao et al. 2021 (ICML'21 version, June 2021).
ESM-1v esm1v_t33_650M_UR90S_1() ... esm1v_t33_650M_UR90S_5() UR90 Language model specialized for prediction of variant effects. Enables SOTA zero-shot prediction of the functional effects of sequence variations. Same architecture as ESM-1b, but trained on UniRef90. Released with Meier et al. 2021.

For a complete list of available models, with details and release notes, see Pre-trained Models.

Comparison to related works

Task Unsupervised contact prediction Supervised contact prediction SSP
Test set Large valid CASP13-FM CAMEO CASP13-FM CAMEO CB513
Gremlin (Potts) 39.3 16.9 24.0 40.1 47.3
UniRep 11.2 17.8 58.4
SeqVec 13.8 22.5 62.1
TAPE 11.2 5.5 6.8 12.3 15.9 58.0
ProtBert-BFD 34.1 13.5 23.9 24.7 37.0 70.0
Prot-T5-XL-BFD 35.6 16.5 25.9 25.0 40.8 71.4 ± 0.3
ESM-1 33.7 13.6 21.4 (todo) (todo) 69.2
ESM-1b 41.1 17.0 30.9 28.2 44.4 71.6 ± 0.1
ESM-1v 35.3 14.2 24.4
ESM-MSA-1b 57.4 44.8 43.5 54.6 55.8 73.4 ± 0.3

Comparison to related protein language models on structure prediction tasks.

  • All contact numbers are the top-L,LR precision metric, where long range means sequence separation of at least 24 residues
  • For unsupervised contact prediction, a sparse linear combination of the attention heads is used to directly predict protein contacts, fitted with logistic regression on 20 structures. For more details on the method, see Rao et al. 2020.
  • Supervised contact prediction all uses the same resnet (32 layers) and trRosetta training data, cf Rao et al. 2021.
  • (SSP) Secondary structure Q8 accuracy on CB513, transformer finetuned with convolution + LSTM head.
  • Direct coupling analysis methods (Gremlin, mfDCA, Psicov) and ESM-MSA-1 use the trRosetta MSAs, while other methods predict from single sequence.

Usage

Quick Start

As a prerequisite, you must have PyTorch 1.5 or later installed to use this repository.

You can use this one-liner for installation:

$ pip install fair-esm

We also support PyTorch Hub, which removes the need to clone and/or install this repository yourself:

import torch
model, alphabet = torch.hub.load("facebookresearch/esm:main", "esm1b_t33_650M_UR50S")

Then, you can load and use a pretrained model as follows:

import torch
import esm

# Load ESM-1b model
model, alphabet = esm.pretrained.esm1b_t33_650M_UR50S()
batch_converter = alphabet.get_batch_converter()

# Prepare data (first 2 sequences from ESMStructuralSplitDataset superfamily / 4)
data = [
    ("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
    ("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
    ("protein2 with mask","KALTARQQEVFDLIRD<mask>ISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
    ("protein3",  "K A <mask> I S Q"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Extract per-residue representations (on CPU)
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[33], return_contacts=True)
token_representations = results["representations"][33]

# Generate per-sequence representations via averaging
# NOTE: token 0 is always a beginning-of-sequence token, so the first residue is token 1.
sequence_representations = []
for i, (_, seq) in enumerate(data):
    sequence_representations.append(token_representations[i, 1 : len(seq) + 1].mean(0))

# Look at the unsupervised self-attention map contact predictions
import matplotlib.pyplot as plt
for (_, seq), attention_contacts in zip(data, results["contacts"]):
    plt.matshow(attention_contacts[: len(seq), : len(seq)])
    plt.title(seq)
    plt.show()

Compute embeddings in bulk from FASTA

We provide a script that efficiently extracts embeddings in bulk from a FASTA file. A cuda device is optional and will be auto-detected. The following command extracts the final-layer embedding for a FASTA file from the ESM-1b model:

$ python extract.py esm1b_t33_650M_UR50S examples/some_proteins.fasta examples/some_proteins_emb_esm1b/ \
    --repr_layers 0 32 33 --include mean per_tok

Directory examples/some_proteins_emb_esm1b/ now contains one .pt file per FASTA sequence; use torch.load() to load them. extract.py has flags that determine what's included in the .pt file:

  • --repr-layers (default: final only) selects which layers to include embeddings from.
  • --include specifies what embeddings to save. You can use the following:
    • per_tok includes the full sequence, with an embedding per amino acid (seq_len x hidden_dim).
    • mean includes the embeddings averaged over the full sequence, per layer.
    • bos includes the embeddings from the beginning-of-sequence token. (NOTE: Don't use with the pre-trained models - we trained without bos-token supervision)

Zero-shot variant prediction

See "./variant-prediction/" for code and pre-trained weights for the ESM-1v models described in Language models enable zero-shot prediction of the effects of mutations on protein function. (Meier et al. 2021).

Notebooks

Supervised variant prediction - training a classifier on the embeddings

To help you get started with using the embeddings, this jupyter notebook tutorial shows how to train a supervised variant predictor using embeddings from ESM-1. You can adopt a similar protocol to train a model for any downstream task, even with limited data. First you can obtain the embeddings for examples/P62593.fasta either by downloading the precomputed embeddings as instructed in the notebook or by running the following:

# Obtain the embeddings
$ python extract.py esm1_t34_670M_UR50S examples/P62593.fasta examples/P62593_reprs/ \
    --repr_layers 34 --include mean

Then, follow the remaining instructions in the tutorial. You can also run the tutorial in a colab notebook.

Note this is somewhat outdated: use esm1v_t33_650M_UR90S instead, and see the newer instructions for zero-shot variant prediction, that is without any supervised training.

Unsupervised contact prediction

This jupyter notebook tutorial demonstrates contact prediction with both the ESM-1b and MSA Transformer (ESM-MSA-1) models. Contact prediction is based on a logistic regression over the model's attention maps. This methodology is based on our ICLR 2021 paper, Transformer protein language models are unsupervised structure learners. (Rao et al. 2020) The MSA Transformer (ESM-MSA-1) takes a multiple sequence alignment (MSA) as input, and uses the tied row self-attention maps in the same way. See MSA Transformer. (Rao et al. 2021).

To get unsupervised attention-based contacts, call model.predict_contacts(tokens) or model(tokens, return_contacts=True).

ESMStructuralSplitDataset and self-attention contact prediction

And this jupyter notebook tutorial shows how to load and index the ESMStructuralSplitDataset, and computes the self-attention map unsupervised contact predictions using ESM-1b.

Available Models and Datasets

Pre-trained Models

Shorthand esm.pretrained. #layers #params Dataset Embedding Dim Model URL (automatically downloaded to ~/.cache/torch/hub/checkpoints)
ESM-1v esm1v_t33_650M_UR90S_[1-5] 33 650M UR90/S 2020_03 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1v_t33_650M_UR90S_1.pt
ESM-MSA-1b esm_msa1b_t12_100M_UR50S 12 100M UR50/S + MSA 2018_03 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm_msa1b_t12_100M_UR50S.pt
ESM-MSA-1 esm_msa1_t12_100M_UR50S 12 100M UR50/S + MSA 2018_03 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm_msa1_t12_100M_UR50S.pt
ESM-1b esm1b_t33_650M_UR50S 33 650M UR50/S 2018_03 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt
ESM-1 esm1_t34_670M_UR50S 34 670M UR50/S 2018_03 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50S.pt
esm1_t34_670M_UR50D 34 670M UR50/D 2018_03 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50D.pt
esm1_t34_670M_UR100 34 670M UR100 2018_03 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR100.pt
esm1_t12_85M_UR50S 12 85M UR50/S 2018_03 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t12_85M_UR50S.pt
esm1_t6_43M_UR50S 6 43M UR50/S 2018_03 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t6_43M_UR50S.pt

Here is a chronological list of the released models and the paper they were introduced in:

Shorthand Release Notes
ESM-1 Released with Rives et al. 2019 (Aug 2020 update).
ESM-1b Released with Rives et al. 2019 (Dec 2020 update). See Appendix B.
ESM-MSA-1 Released with Rao et al. 2021 (Preprint v1).
ESM-MSA-1b Released with Rao et al. 2021 (ICML'21 version, June 2021).
ESM-1v Released with Meier et al. 2021.

ESM Structural Split Dataset

This is a five-fold cross validation dataset of protein domain structures that can be used to measure generalization of representations across different levels of structural dissimilarity. The dataset implements structural holdouts at the family, superfamily, and fold level. The SCOPe database is used to classify domains. Independently for each level of structural hold-out, the domains are split into 5 equal sets, i.e. five sets of folds, superfamilies, or families. This ensures that for each of the five partitions, structures having the same classification do not appear in both the train and test sets. For a given classification level each structure appears in a test set once, so that in the cross validation experiment each of the structures will be evaluated exactly once.

The dataset provides 3d coordinates, distance maps, and secondary structure labels. For further details on the construction of the dataset see Rives et al. 2019 Appendix A.10.

This jupyter notebook tutorial shows how to load and index the ESMStructuralSplitDataset.

ESMStructuralSplitDataset, upon initializing, will download splits and pkl. We also provide msas for each of the domains. The data can be directly downloaded below.

Name Description URL
splits train/valid splits https://dl.fbaipublicfiles.com/fair-esm/structural-data/splits.tar.gz
pkl pkl objects containing sequence, SSP labels, distance map, and 3d coordinates https://dl.fbaipublicfiles.com/fair-esm/structural-data/pkl.tar.gz
msas a3m files containing MSA for each domain https://dl.fbaipublicfiles.com/fair-esm/structural-data/msas.tar.gz

Pre-training Dataset Split

The split files establishing which UniRef50 clusters were used as held-out evaluation set for pre-training in Rives et al. 2019 and Rao et al. 2021 can be found here:

These files only contain only the UniRef50 IDs and UniRef100 IDs corresponding to the UniRef database, 2018-03 release which is released by the UniProt Consortium under a Creative Commons Attribution (CC BY 4.0) License.

Citations

If you find the models useful in your research, we ask that you cite the relevant paper:

@article{rives2019biological,
  author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob},
  title={Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences},
  year={2019},
  doi={10.1101/622803},
  url={https://www.biorxiv.org/content/10.1101/622803v4},
  journal={bioRxiv}
}

For the self-attention contact prediction:

@article{rao2020transformer,
  author = {Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander},
  title={Transformer protein language models are unsupervised structure learners},
  year={2020},
  doi={10.1101/2020.12.15.422761},
  url={https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1},
  journal={bioRxiv}
}

For the MSA Transformer:

@article{rao2021msa,
  author = {Rao, Roshan and Liu, Jason and Verkuil, Robert and Meier, Joshua and Canny, John F. and Abbeel, Pieter and Sercu, Tom and Rives, Alexander},
  title={MSA Transformer},
  year={2021},
  doi={10.1101/2021.02.12.430858},
  url={https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1},
  journal={bioRxiv}
}

For variant prediction using ESM-1v:

@article{meier2021language,
  author = {Meier, Joshua and Rao, Roshan and Verkuil, Robert and Liu, Jason and Sercu, Tom and Rives, Alexander},
  title = {Language models enable zero-shot prediction of the effects of mutations on protein function},
  year={2021},
  doi={10.1101/2021.07.09.450648},
  url={https://www.biorxiv.org/content/10.1101/2021.07.09.450648v1},
  journal={bioRxiv}
}

Much of this code builds on the fairseq sequence modeling framework. We use fairseq internally for our protein language modeling research. We highly recommend trying it out if you'd like to pre-train protein language models from scratch.

License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

Comments
  • Provide pre-training code?

    Provide pre-training code?

    Hi there!

    I'm trying to compare ESM to UniRep, the embedding from the Church lab, for variant function prediction. Eventually, there are a few proteins our lab would like to optimize, and ESM has some advantages over UniRep. I need to "evolutionarily fine tune" ESM, as the Church lab does for UniRep: refine the global model's weights by continuing training on a small neighborhood (~100k sequences) around the target protein.

    Could y'all provide any of the code you used in the pre-training task? Eg, your implementations of noising / masking, your loss function, or your gradient descent function?

    Thank you, I think ESM is super cool! Best, Jacob

    opened by Jacoberts 14
  • ESMAtlas api access: Rate limit?

    ESMAtlas api access: Rate limit?

    Hi,

    I would like to automatically retrieve structures from ESMAtlas. Basically I have a list of MGinfy IDs that I want to retrieve.

    I do that using the following:

    def get_esm_pdb_file(mgnify_id, out_dir):
        url = f'https://api.esmatlas.com/fetchPredictedStructure/{mgnify_id}.pdb'
        response = requests.get(url, stream=True)
        if response.status_code == 200:
            pdb_file = os.path.join(out_dir,f'{mgnify_id}.pdb')
            with open(pdb_file, 'w') as f:
                f.write(response.text)
        else:
            print(mgnify_id, response.status_code)
            print(response.text)
    

    To speed it up, I parallelize it over my list using a ProcessPool. For some IDs I then get the following:

    MGYP000547115894 403
    {"message":"Forbidden"}
    MGYP000526977164 403
    {"message":"Forbidden"}
    MGYP003555372774 403
    {"message":"Forbidden"}
    

    Could you provide some details if/what rate limiting is in place for querying ESMAtlas, or if there is another way to retrieve structure files?

    Thanks!

    opened by fteufel 11
  • Predicted ESM2 logits depend on other elements within a batch

    Predicted ESM2 logits depend on other elements within a batch

    Thank you so much for all the work on ESM and ESM2. I ran into some surprising behaviour:

    Bug description ESM2 predicts slightly different logits even when in eval mode depending on other elements within a batch.

    Reproduction steps

    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    print(f'device: {device}')
    
    model_path = "facebookresearch/esm:main"
    model_name = f"esm2_t36_3B_UR50D"
    esm_model, alphabet = torch.hub.load(model_path, model_name)
        
    esm_model = esm_model.eval().cuda()
    batch_converter = alphabet.get_batch_converter()
    
    # Those are arbitrary sequences, doesn't matter which ones are used
    sequences = [
        'A' * 255,
        'Y' * 310
    ]
    
    model_input = batch_converter([(None, seq) for seq in sequences[:2]])[2]
    model_input = model_input.to(device)
    
    # Here is the surprising part:
    logits1 = esm_model(model_input[[0]])['logits']
    logits2 = esm_model(model_input)['logits']
    
    torch.linalg.norm(logits1 - logits2[0])
    
    tensor(0.3426, device='cuda:0')
    

    This gives roughly 0.3426 - with many values significantly different than zero. I was expecting this to be due to some kind of batch norm like functionality, but, the model is in eval mode.

    opened by FedericoV 11
  • Error when loading esm transformer

    Error when loading esm transformer

    model, alphabet = esm.pretrained.esm_msa1b_t12_100M_UR50S() results in:

    Traceback (most recent call last):
      File "/home/kevyan/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/esm/pretrained.py", line 27, in load_hub_workaround
        data = torch.hub.load_state_dict_from_url(url, progress=False, map_location='cpu')
      File "/home/kevyan/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/torch/hub.py", line 504, in load_state_dict_from_url
        raise RuntimeError('Only one file(not dir) is allowed in the zipfile')
    RuntimeError: Only one file(not dir) is allowed in the zipfile
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      File "/home/kevyan/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
        exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-15-00c0cd3f832b>", line 1, in <module>
        model, alphabet = esm.pretrained.esm_msa1b_t12_100M_UR50S()
      File "/home/kevyan/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/esm/pretrained.py", line 191, in esm_msa1b_t12_100M_UR50S
        return load_model_and_alphabet_hub("esm_msa1b_t12_100M_UR50S")
      File "/home/kevyan/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/esm/pretrained.py", line 47, in load_model_and_alphabet_hub
        model_data = load_hub_workaround(url)
      File "/home/kevyan/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/esm/pretrained.py", line 32, in load_hub_workaround
        f"{torch.hub.get_dir()}/checkpoints/{fn}",
    AttributeError: module 'torch.hub' has no attribute 'get_dir'
    

    As a workaround, I tried downloading the weights directly and loading them: model, alphabet = load_model_and_alphabet('/home/kevyan/.cache/torch/checkpoints/esm_msa1b_t12_100M_UR50S.pt')

    Traceback (most recent call last):
      File "/home/kevyan/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
        exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-16-75482e322f76>", line 1, in <module>
        model, alphabet = load_model_and_alphabet('/home/kevyan/.cache/torch/checkpoints/esm_msa1b_t12_100M_UR50S.pt')
      File "/home/kevyan/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/esm/pretrained.py", line 21, in load_model_and_alphabet
        return load_model_and_alphabet_local(model_name)
      File "/home/kevyan/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/esm/pretrained.py", line 57, in load_model_and_alphabet_local
        if _has_regression_weights(model_name):
    NameError: name 'model_name' is not defined
    
    opened by yangkky 11
  • the 41 deep mutational scanning datasets

    the 41 deep mutational scanning datasets

    Hi, @joshim5 Thanks for your work. I was reading your paper 'Language models enable zero-shot prediction of the effects of mutations on protein function' for understanding the model and also want to apply the datasets into my modified model. I searched on the github but couldn't find where the 41 datasets are located (as in Figure 3)? Can you guide me where it is or any links that will get me access to them? Thanks a lot.

    I can

    opened by lzhangUT 9
  • Provide MGnify sequences available in ESM Metagenomic Atlas

    Provide MGnify sequences available in ESM Metagenomic Atlas

    It would be useful if you could offer a download of the 500 million sequences of the atlas as a fasta file. I want this because it allows me to do fast local searches. The online search capability provided by the atlas takes many minutes to do a single search and we have much faster (but less sensitive) search methods. I'd like to allow users of our ChimeraX visualization software to quickly search the atlas.

    I'm aware that I can get the sequences from the EBI MGnify database (2.5 billion sequences), then filter them using the stats.parquet file for the atlas to just the ones used in the atlas. I am pursing that currently, but there appear to be no mirrors of the MGnify database and downloading it from EBI to the United States will take about 10 days to transfer this 250 Gbytes apparently bottlenecked by EBI providing only ~1 Mbit/sec. A download directly from Meta would be 1/5 the size, and if you use better compression than what EBI is using (e.g. bzip2 or xzip instead of gzip) it could be 1/10th the size ~25 Gbytes and on a much faster connection could be downloaded in hours instead of 10 days.

    I realize I could scrape the sequences from the 15 Tbytes of structure prediction files that the Atlas provides -- but I'd prefer to not download 500 times more data than I actually need.

    Thanks!

    Tom Goddard University of California, San Francisco ChimeraX molecular visualization software developer

    opened by tomgoddard 8
  • unable to import esm

    unable to import esm

    I install by pip install fair-esm. Then try the example code:

    
    import torch
    import esm
    
    # Load ESM-2 model
    model, alphabet = esm.pretrained.esm2_t48_15B_UR50D()
    batch_converter = alphabet.get_batch_converter()
    model.eval()  # disables dropout for deterministic results
    
    

    The following error is raised:

    ModuleNotFoundError                       Traceback (most recent call last)
    Untitled-1.ipynb Cell 1 in <cell line: 2>()
          [1](vscode-notebook-cell:Untitled-1.ipynb?jupyter-notebook#W0sdW50aXRsZWQ%3D?line=0) import torch
    ----> [2](vscode-notebook-cell:Untitled-1.ipynb?jupyter-notebook#W0sdW50aXRsZWQ%3D?line=1) import esm
          [4](vscode-notebook-cell:Untitled-1.ipynb?jupyter-notebook#W0sdW50aXRsZWQ%3D?line=3) # Load ESM-2 model
          [5](vscode-notebook-cell:Untitled-1.ipynb?jupyter-notebook#W0sdW50aXRsZWQ%3D?line=4) model, alphabet = esm.pretrained.esm2_t48_15B_UR50D()
    
    File ~/anaconda3/envs/torch/lib/python3.8/site-packages/esm/__init__.py:9, in <module>
          6 from .version import version as __version__  # noqa
          8 from .data import Alphabet, BatchConverter, FastaBatchedDataset  # noqa
    ----> 9 from .model.esm1 import ProteinBertModel  # noqa
         10 from .model.esm2 import ESM2  # noqa
         11 from .model.msa_transformer import MSATransformer  #noqa
    
    ModuleNotFoundError: No module named 'esm.model.esm1'; 'esm.model' is not a package
    
    
    opened by pengzhangzhi 8
  • What model is needed for regression_data in esm.pretrained.load_model_and_alphabet_local?

    What model is needed for regression_data in esm.pretrained.load_model_and_alphabet_local?

    Bug description Use esm.pretrained.load_model_and_alphabet_local( 'esm1_t12_85M_UR50S.pt'), i need to load esm1_t12_85M_UR50S-contact-regression.pt, but in fact this is not provided.

    Logs image

    Output goes here
    

    Additional context If you want to load the local esm model, besides the esm1_t12_85M_UR50S.pt you provided, do you need any regression model? What is the meaning of regression_data in your code esm.pretrained.load_model_and_alphabet_local .

    opened by xmhh 8
  • Broken pip install

    Broken pip install

    Hello,

    thank you for the amazing work!

    I just would like to let you know, that the latest modification of README has broken pip install, as there is a README.rst among data_files in the setup.py, but now the repo has README.md instead.

    Best regards, Raman

    opened by SamusRam 8
  • ModuleNotFoundError -

    ModuleNotFoundError - "model" directory is missed in setup.py

    Bug description "model" directory is missed in "packages" argument in setup.py script. That issue raised ModuleNotFoundError: No module named 'esm.model' exception.

    Reproduction steps Try to install esm package from PyPI or from the GitHub repository.

    Expected behavior I was expected successful installation.

    Logs

    ---------------------------------------------------------------------------
    ModuleNotFoundError                       Traceback (most recent call last)
    Input In [1], in <cell line: 2>()
          1 import torch
    ----> 2 import esm
    
    File ~/protein_embeddings/env/lib/python3.8/site-packages/esm/__init__.py:9, in <module>
          6 from .version import version as __version__  # noqa
          8 from .data import Alphabet, BatchConverter, FastaBatchedDataset  # noqa
    ----> 9 from .model.esm1 import ProteinBertModel  # noqa
         10 from .model.esm2 import ESM2  # noqa
         11 from .model.msa_transformer import MSATransformer  #noqa
    
    ModuleNotFoundError: No module named 'esm.model'
    
    opened by ptynecki 7
  • download the model weights to local

    download the model weights to local

    Hi, @rmrao, I am interested in using the esm-iv models (1-5), it takes a long time to download the model/model weights every time when I run one sequence. I wonder if I can download the model weights (or model) into my local workspace, as in Azure Databricks? so It would read the model very quickly when I need to run a lot of sequences. if so, what would be the code to do that? THank you.

    opened by lzhangUT 7
  • Only CPU Malfunction -- HF Colab Notebook

    Only CPU Malfunction -- HF Colab Notebook

    Hello, I am trying to utilize ESM as made available in the HuggingFace Colab Notebook. I do not have a local GPU to use, so have been attempting the only CPU mode. The first issue is in preparing the model and tokenizer where it states " If using GPU, use model.cuda() to transfer the model to GPU." and the code below is:

    from transformers import AutoTokenizer, EsmForProteinFolding

    tokenizer = AutoTokenizer.from_pretrained("facebook/esmfold_v1") model = EsmForProteinFolding.from_pretrained("facebook/esmfold_v1", low_cpu_mem_usage=True)

    model = model.cuda()**

    To get around this, I simply removed the "model = model.cuda()" and the code ran fine. Next, down the line, I got to the Tokenization step which requires a GPU:

    tokenized_input = tokenized_input.cuda()

    Obviously when ran the error of "NO GPU" was displayed. So I went ahead and tried to run actual prediction lines:

    import torch

    with torch.no_grad(): output = model(tokenized_input)**

    This displayed the following error:

    RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

    So how are you supposed to run this sheet without a GPU? Is there something in the tokenization stage I should alter to provide good inputs for the prediction model?

    opened by coopersvajda 0
  • download model from huggingface

    download model from huggingface

    Hi dear support team

    I want to use your awesome transformer, but our HPC cluster does not have internet access due to firewall restriction. I download model and config, ... files from huggingface (https://huggingface.co/facebook/esm2_t33_650M_UR50D/tree/main), i try to load model with: path = Path(file).resolve().parent.joinpath('esm') model, alphabet = esm.pretrained(path) This trick dont work! and this error occur: TypeError: 'module' object is not callable

    How i can to load model?

    Thanks

    opened by saeid976 1
  • Adjust header in output file

    Adjust header in output file

    From some research, it looked to me that the pertinent part of the name in the header was the "Seita.9G099800" part, so I wanted to split that from the rest. First, see if the header is longer than one element and then isolate the "Seita.9G099800" portion:

    if len(header.split()) > 1:
                    first_elem = header.split()[0]
                    name = '.'.join(first_elem.split('.')[:2])
    

    If the header is just one word, simply assign the header to the name and put that in the output_file:

    else:
       name = header
    
    output_file = args.pdb / f"{name}.pdb"
    

    Please let me know if this is satisfactory, or if you had something else in mind.

    CLA Signed 
    opened by dawsonhunt 3
  • EMSIF1-Encoder output as structure representation

    EMSIF1-Encoder output as structure representation

    rep = esm.inverse_folding.multichain_util.get_encoder_output_for_complex( model, alphabet, coords, target_chain_id). I found that target_chain_id can only be a single chain. Can the input (target_chain_id) improve the code to get multi-chain structure embeddings?

    opened by LDAIprotein 0
  • can't download the contact-regression.pt for those ESM models

    can't download the contact-regression.pt for those ESM models

    Hi,thank you for your great jobs. Though the big language model e.g. <esm2_t33_650M_UR50D.pt> can be download directly from the repo, when I init the model, it seems need the <esm2_t33_650M_UR50D-contact-regression.pt> which can't download directly from the repo. Could you please tell me how to download those <contact-regression.pt> files directly in the repo?

    opened by imSeaton 0
Releases(v1.0.3)
Owner
Meta Research
Meta Research
DeFMO: Deblurring and Shape Recovery of Fast Moving Objects (CVPR 2021)

Evaluation, Training, Demo, and Inference of DeFMO DeFMO: Deblurring and Shape Recovery of Fast Moving Objects (CVPR 2021) Denys Rozumnyi, Martin R. O

Denys Rozumnyi 139 Dec 26, 2022
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intenti

NVIDIA Corporation 6.9k Jan 03, 2023
The implementation of 'Image synthesis via semantic composition'.

Image synthesis via semantic synthesis [Project Page] by Yi Wang, Lu Qi, Ying-Cong Chen, Xiangyu Zhang, Jiaya Jia. Introduction This repository gives

DV Lab 71 Jan 06, 2023
CoRe: Contrastive Recurrent State-Space Models

CoRe: Contrastive Recurrent State-Space Models This code implements the CoRe model and reproduces experimental results found in Robust Robotic Control

Apple 21 Aug 11, 2022
NNR conformation conditional and global probabilities estimation and analysis in peptides or proteins fragments

NNR and global probabilities estimation and analysis in peptides or protein fragments This module calculates global and NNR conformation dependent pro

0 Jul 15, 2021
This repo provides the source code & data of our paper "GreaseLM: Graph REASoning Enhanced Language Models"

GreaseLM: Graph REASoning Enhanced Language Models This repo provides the source code & data of our paper "GreaseLM: Graph REASoning Enhanced Language

137 Jan 02, 2023
AniGAN: Style-Guided Generative Adversarial Networks for Unsupervised Anime Face Generation

AniGAN: Style-Guided Generative Adversarial Networks for Unsupervised Anime Face Generation AniGAN: Style-Guided Generative Adversarial Networks for U

Bing Li 81 Dec 14, 2022
Visual Tracking by TridenAlign and Context Embedding

Visual Tracking by TridentAlign and Context Embedding (TACT) Test code for "Visual Tracking by TridentAlign and Context Embedding" Janghoon Choi, Juns

Janghoon Choi 32 Aug 25, 2021
A unified framework to jointly model images, text, and human attention traces.

connect-caption-and-trace This repository contains the reference code for our paper Connecting What to Say With Where to Look by Modeling Human Attent

Meta Research 73 Oct 24, 2022
Background Matting: The World is Your Green Screen

Background Matting: The World is Your Green Screen By Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve Seitz, and Ira Kemelmacher-Shlizerman Th

Soumyadip Sengupta 4.6k Jan 04, 2023
Retrieve and analysis data from SDSS (Sloan Digital Sky Survey)

Author: Behrouz Safari License: MIT sdss A python package for retrieving and analysing data from SDSS (Sloan Digital Sky Survey) Installation Install

Behrouz 3 Oct 28, 2022
Joint deep network for feature line detection and description

SOLD² - Self-supervised Occlusion-aware Line Description and Detection This repository contains the implementation of the paper: SOLD² : Self-supervis

Computer Vision and Geometry Lab 427 Dec 27, 2022
Implementation of Convolutional LSTM in PyTorch.

ConvLSTM_pytorch This file contains the implementation of Convolutional LSTM in PyTorch made by me and DavideA. We started from this implementation an

Andrea Palazzi 1.3k Dec 29, 2022
FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection

FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection This repository contains an implementation of FCAF3D, a 3D object detection method introdu

SamsungLabs 153 Dec 29, 2022
paper list in the area of reinforcenment learning for recommendation systems

paper list in the area of reinforcenment learning for recommendation systems

HenryZhao 23 Jun 09, 2022
Predicting Axillary Lymph Node Metastasis in Early Breast Cancer Using Deep Learning on Primary Tumor Biopsy Slides

Predicting Axillary Lymph Node Metastasis in Early Breast Cancer Using Deep Learning on Primary Tumor Biopsy Slides Project | This repo is the officia

CVSM Group - email: <a href=[email protected]"> 33 Dec 28, 2022
Multi-Agent Reinforcement Learning (MARL) method to learn scalable control polices for multi-agent target tracking.

scalableMARL Scalable Reinforcement Learning Policies for Multi-Agent Control CD. Hsu, H. Jeong, GJ. Pappas, P. Chaudhari. "Scalable Reinforcement Lea

Christopher Hsu 17 Nov 17, 2022
Text2Art is an AI art generator powered with VQGAN + CLIP and CLIPDrawer models

Text2Art is an AI art generator powered with VQGAN + CLIP and CLIPDrawer models. You can easily generate all kind of art from drawing, painting, sketch, or even a specific artist style just using a t

Muhammad Fathy Rashad 643 Dec 30, 2022
RARA: Zero-shot Sim2Real Visual Navigation with Following Foreground Cues

RARA: Zero-shot Sim2Real Visual Navigation with Following Foreground Cues FGBG (foreground-background) pytorch package for defining and training model

Klaas Kelchtermans 1 Jun 02, 2022
A Python module for parallel optimization of expensive black-box functions

blackbox: A Python module for parallel optimization of expensive black-box functions What is this? A minimalistic and easy-to-use Python module that e

Paul Knysh 426 Dec 08, 2022