Bulk2Space is a spatial deconvolution method based on deep learning frameworks

Overview

Bulk2Space

Spatially resolved single-cell deconvolution of bulk transcriptomes using Bulk2Space

python 3.8

Bulk2Space is a spatial deconvolution method based on deep learning frameworks, which converts bulk transcriptomes into spatially resolved single-cell expression profiles.

Image text

Installation

For bulk2space, the python version need is over 3.8. If you have installed Python3.6 or Python3.7, consider installing Anaconda, and then you can create a new environment.

conda create -n bulk2space python=3.8.5
conda activate bulk2space

cd bulk2space
pip install -r requirements.txt 

Usage

Run the demo data

If you choose the spatial barcoding-based data(like 10x Genomics or ST) as spatial reference, run the following command:

python bulk2space.py --project_name test1 --data_path example_data/demo1 --input_sc_meta_path demo1_sc_meta.csv --input_sc_data_path demo1_sc_data.csv --input_bulk_path demo1_bulk.csv --input_st_data_path demo1_st_data.csv --input_st_meta_path demo1_st_meta.csv --BetaVAE_H --epoch 10 --spot_data True

else, if you choose the image-based in situ hybridization data(like MERFISH, SeqFISH, and STARmap) as spatial reference, run the following command:

python bulk2space.py --project_name test2 --data_path example_data/demo2 --input_sc_meta_path demo2_sc_meta.csv --input_sc_data_path demo2_sc_data.csv --input_bulk_path demo2_bulk.csv --input_st_data_path demo2_st_data.csv --input_st_meta_path demo2_st_meta.csv --BetaVAE_H --epoch 10 --spot_data False

Run your own data

When using your own data, make sure

  • the bulk.csv file must contain one column of gene expression

    Sample
    Gene1 5.22
    Gene2 3.67
    ... ...
    GeneN 15.76
  • the sc_meta.csv file must contain two columns of cell name and cell type. Make sure the column names are correct, i.e., Cell and Cell_type

    Cell Cell_type
    Cell_1 Cell_1 T cell
    Cell_2 Cell_2 B cell
    ... ... ...
    Cell_n Cell_n Monocyte
  • the st_meta.csv file must contain at least two columns of spatial coordinates. Make sure the column names are correct, i.e., xcoord and ycoord

    xcoord ycoord
    Cell_1 / Spot_1 1.2 5.2
    Cell_2 / Spot_2 5.4 4.3
    ... ... ...
    Cell_n / Spot_n 11.3 6.3
  • the sc_data.csv and st_data.csv files are gene expression matrices

Then you will get your results in the output_data folder.

For more details, see user guide in the document.

About

Bulk2Space manuscript is under major revision. Should you have any questions, please contact Jie Liao at [email protected], Jingyang Qian at [email protected], or Yin Fang at [email protected]

Comments
  • Data availability

    Data availability

    Hey team, thanks for coming up with this useful tool. I'm looking to follow your tutorial on hypothalamus deconvolution, and it seems the lcm.gz data file on your Github only contains a single file, without all the processes count matrices and cell metadata table. Is that supposed to be the case? If so, I wonder how I should process this single file to generate the input data I need. Thanks for any heads up!

    opened by loganminhdang 6
  • Cannot locate the bulk2space.py script and directory after installation

    Cannot locate the bulk2space.py script and directory after installation

    Hi, I'm writing to seek your assistance on an issue I'm having. After installation of the conda environment, I cannot locate the bulk2space directory, which should contain the bulk2space python script to run the algorithm. The installation also seems incomplete, seeing that after I manually retrieve the python script from your Github page, I received the following error message: Traceback (most recent call last): File "bulk2space.py", line 2, in from utils.tool import * ModuleNotFoundError: No module named 'utils'

    I would appreciate any guidance. Thanks!

    opened by loganminhdang 5
  • Preproccessed PDAC data

    Preproccessed PDAC data

    Hello,

    I am trying to understand how to use bulk2space by going though the tutorials. I am currently going though the first tutorial with the PDAC datasets. I would like to know how you generated the preprocessed files "st_data" and "st_meta".

    I went to the original data from Moncada et al. (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111672) but I don't know which files you used from there to make the above preprocessed files. Could you clarify that and explain a bit more in detail how you generated "st_data" and "st_meta"? This will be helpful to understand how to process other reference datasets.

    opened by AlexUOM 4
  • the question of

    the question of "quick start" section

    Dear professors, We are very sorry to bother you. We recently downloaded the bulk2space and used the test data of demo1, but we don't know why there are no result output, and we don't know whether the data are written normally. After operation, the Bulk2space-1.0.0-Py3.8egg displayed empty. Some information are as follows. I would wonder if you can help check it in your busy schedule or if there is any other step guidance. bulk2space

    opened by coconutll 2
  • Only obtain three cell types such as tumor cell, macrophages and neutrophils from bulk data?

    Only obtain three cell types such as tumor cell, macrophages and neutrophils from bulk data?

    Hi, thanks for coming up with this useful tool.  I have bulk RNAseq data and scRNA-seq data from the same patient which was made by our lab. I want to convert bulk transcriptomes into spatially resolved single-cell expression profile. Here are my questions: 1.Why do I only obtain three cell types such as tumor cell, macrophages and neutrophils from bulk data? However, there are many other celltypes like Fibroblasts, T cell and B cell in my scRNA reference. 2.How to normalize my bulk data?

    Thanks, Qi.

    opened by zhangqi234 2
  • convert bulk transcriptomes into spatially resolved single-cell expression profile

    convert bulk transcriptomes into spatially resolved single-cell expression profile

    Hi, I'm new to bulk2space, and I only have bulk RNAseq data from mouse brain which was made by our lab. I want to convert bulk transcriptomes into spatially resolved single-cell expression profile. I know how to convert bulk RNAseq data into single cell data. Here are my questions:

    1. How to get the spatial information from my bulk RNAseq data, do I have to do some experiments about spatial information by Laser capture microdissection (LCM) technology?
    2. since my bulk RNAseq data are form brain tissue, tissues contain many layers of cells. How do you distinguish between different layers of cells? Or do I have to do bulk RNAseq from single layers?

    Thanks, Echo.

    opened by Echoloria 2
  • Cannot import CascadeForestClassifier from deepforest

    Cannot import CascadeForestClassifier from deepforest

    I am running the bulk2space.py script via Python 3.8.5. The deepforest package is installed and imports successfully, but I am still receiving the following error message:

    ImportError: cannot import name 'CascadeForestClassifier' from 'deepforest'

    I would appreciate any help you could offer.

    opened by sarah-chapin 2
  • Effect of Irrelevant Bulk RNA-Seq Sample and Selection of Optimal Projects for Test Data

    Effect of Irrelevant Bulk RNA-Seq Sample and Selection of Optimal Projects for Test Data

    Hi,

    Thank you very much for putting together this code.

    I would like to better understand when Bulk2Space might help versus when there are limits to applicability to Bulk2Space, following a journal club presentation where I learned more about the paper and method.

    I apologize that I am not sure how best to precisely ask my question, but I have tried to use a few examples to try and give a sense of what I am asking about.

    Example 1 (Exact Code for Concrete Test):

    In the spirit of a GitHub “issue,” I tried to start with concrete examples for discussion based upon issue #8 .

    I have attached a summary of that analysis (PDAC_Test.pdf), and I have also attached any input files not already provided on this repository.

    However, when I changed the bulk RNA-Seq gene symbols in order to use the same gene symbols for both the PDAC example and the demo1 example, I lost the Ductal cells in the PDAC example that otherwise still used only files derived from the same samples used for the PDAC example. I also have some more details notes in the uploaded PDF.

    Nevertheless, if that might possibly help the discussion, I have provided those.

    If there are any other relatively small files that it would help to upload to GitHub, then I would also be happy to add those. For example, I also ran the analysis with epoch_num=1000 instead of epoch_num=3500. I am currently not providing those results, but my impression is that they look qualitatively similar in terms of cancer cell and ductal cell assignments (for all of the provided PDAC files).

    Example 2 (Theoretical Question):

    Is it possible to run bulk2spatial as described below?

    1) Use bulk RNA-Seq + scRNA-Seq + spatial data that all come from Patient A.

    2) Export model from Patient A.

    3) Only provide bulk RNA-Seq data from patient B, and test how predictions from model defined on Patient A compare to scRNA-Seq and spatial data generated for Patient B.

    • Additionally, if I understand correctly, then I think an image for the tissue for Patient B can not be provided. If so, I think the shape of the issue section for Patient B can’t be known, and I would guess the spatial coordinates from Bulk2Space might not be directly applicable to interpret Patient B. However, if I might be misunderstanding anything, then please let me know.

    Example 3 (Summary Questions):

    Am I correctly understanding that consecutive slides are often used in the paper? For example, the 2 slices in Figure S17f already have different shapes, and it looks like you a projection of estimations on the histology image for slide 2 was not (or could not?) be provided.

    Data from different patients would be even more different. So, is it reasonable and/or correct to say that there is a preference to use all 3 data types generated from the same experiment? Even if the exact slice is not the same, the true composition of the multiple data types can hopefully be as close as possible?

    For example, I am not sure if the difference is sufficiently extreme, but let’s say Patient A has histology like the “Inflammation” sample in Figure 6 and Patient B has histology like the “Cancer” sample in Figure 6. If you didn’t have a spatial transcriptomics (ST) dataset for Patient B, then I think use of the ST data from Patient A might not be of much benefit to Patient B. Do you think that is a fair conclusion?

    Similarly, if your training sample had 90% tumor, then I would expect limitations is looking at the projection from a spatial transcriptomics project where the tissues had a very different percent tumor such as closer to 20% tumor. I would also expect there often could be a challenge in even knowing the general shape of an independent/unrelated tumor sample, and I believe that you should not be able to know the spatial information for the tumor cells within an independent tissue without a more direct measurement.

    I am not sure if the points above might also possibly relate to the shift in the frequency of cancer cells per spot with the reduced/matching gene symbols in the uploaded PDF for Example 1.

    However, if I am then understanding correctly, then might that be at least somewhat contradictory to what I believe is a recommendation to use public data in issue #7? If I might be misunderstanding anything, then please let me know.

    Thank you very much for your help!

    Sincerely, Charles

    Code.zip demo1_bulk-FALSE_PDAC_LABEL.csv demo1_bulk-FALSE_PDAC_LABEL-MATCHING_SUBSET.csv pdac_bulk-MATCHING_SUBSET.csv

    SC Cell_Type_Counts.pdf SC Cell_Type_Correlation.pdf ST Spot_Deconvolution.pdf ST Cancer_Cells_per_Spot.pdf

    PDAC_Test.pdf

    opened by cwarden45 0
  • Confused about the train/test steps

    Confused about the train/test steps

    Dear Professors,

    Thanks for coming up with this great tool. However, I'm confused with how to use it by the tutorial. In PDAC deconvolution, the tutorial only uses the train_vae function, however, in demo1 tutorial for example, it uses additional load_vae_and_generate function from the .pth vae model from train_vae function.

    So here comes to my question, if I only focus on the first step to transform bulkRNA to single-cell RNA (i.e., no consideration of further scRNA to spatial RNA):

    If I have e.g., two bulkRNAseq from 2-month-old and 7-month-old mice lung cancer tissue, say bulkA and bulkB. I also have one single-cell RNA reference, say scRNAref. When I deconvolute bulkA using scRNAref to a new, bulk2space-generated scRNA data (name it "generated-scRNA from bulkA"), I will get a .pth vae model (name it "A.pth"). Next, when I'd like to deconvolute bulkB, which step should I use? Should I 1) use "load_vae_and_generate" function that use the previous A.pth model, or 2) use "train_vae" function that will generate a new B.pth model?

    I believe this is crucial because it directly guides us how to use this tool. In CIBERSORT, we provide only two variables, the bulkRNAseq and the reference immune cell expression profile. The reference would not change most of the time, thus we just feed CIBERSORT with many bulkRNAseq dataset and it will return many generated immune cell expression dataframes. Simple and easy. But in Bulk2space, we got a new .pth model everytime if we follow step 2, and to be honest, I don't know what this .pth model is used for if not following step1 to use it to load and generate new scRNA dataset.

    Besides issues above, if we use step 1), there'll also be problems. What if bulkA and bulkB are from different status of tissues as the example above? I see that in the article, you mentioned that "the state of each cell type still fluctuates within a relatively stable high-dimensional space". But if bulkA was from a pre-cancerous tissue, and bulkB was from a cancerous tissue, would bulk2space still work fine? This is important because if we'd like to deconvolute bulkRNAseq from longitudinal dataset, for example, a series of bulkRNAseq data from 10 timepoints along cancer progression that contains normal, pre-cancerous, turning stage and finally cancerous tissue, or a series of bulkRNAseq data from different development stages of liver, what is the correct way of using bulk2space if I want single-cell RNA dataset from bulkRNA? Would bulk2space still work under this scenario?

    Also, does bulk2space requires that scRNA ref and bulkRNA are from similar status of tissue? For example, can bulk2space deconvolute bulkRNA derived from cancer lung using the reference scRNA derived from normal lung?

    Actually I've tried to use step 1 (i.e., the same model) to deal with my longitudinal dataset but the results seemed very identical concerning the distribution of cell types that bulk2space returned (which should have some difference at least in immune cell types since I'm deconvoluting bulkRNA from normal and cancer tissues using the same scRNA ref). Also, another key issue is, I don't know whether the generated sc_cell_type and sc_data dataframe can be treated as a standard Seurat object that we can use standard analysis pipeline (like filtering nfeature and nCount, scaling, centering, pca, umap, or newly assign cell types according to FindMarkers function, etc. Acturally I've tried on them but the PCA, tSNE or UMAP can't efficiently separate cell types well), and whether different scRNA datasets generated by bulk2space can be supported to integrate into a single Seurat object like other normal single-cell data do?

    Thank you so much and it would be of a great help if the experts in your team who developped this nice tool could answer the issues above.

    opened by Bennylikescoding 1
  •  β-VAE  algorithm in the paper

    β-VAE algorithm in the paper

    Hello, author, In Figure 1b of your paper,I don't know why β-VAE can analyze the rate of cells of each cell type. I have studied this algorithm carefully and its input and output should correspond, so I don't understand why the input cell type is changed into the output of a single cell. Could you please answer it, or what is the input data of this step? image

    opened by wxpbioinfo 0
  • Question: Scalability

    Question: Scalability

    Good day,

    I am eager to test this excellent tool on our data. I have seen in the tutorial and demo data that the vignette uses only one bulk RNA sample as well as an ST experiment.

    Is it possible to scale up and process several bulk RNA samples and ST experiments in one go? and for the inferred single-cell data derived from the bulk, can we have those integrated across multiple biological replicates, as if they were truly scRNA-seq data?

    Thanks in advance!

    opened by ccruizm 2
  • model.train_df_and_spatial_deconvolution error

    model.train_df_and_spatial_deconvolution error

    Hi, thanks for coming up with this useful tool. When I conducted the model.train_df_and_spatial_deconvolution function to decompose ST data into spatially resolved single-cell transcriptomics data, I found the error like "pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False". I don't know what caused this error.

    1668324980352

    opened by zhangqi234 7
Releases(v1.0.0)
Owner
Dr. FAN, Xiaohui
single-cell omics; spatial transcriptomics; TCM network biology
Dr. FAN, Xiaohui
A state-of-the-art semi-supervised method for image recognition

Mean teachers are better role models Paper ---- NIPS 2017 poster ---- NIPS 2017 spotlight slides ---- Blog post By Antti Tarvainen, Harri Valpola (The

Curious AI 1.4k Jan 06, 2023
"Segmenter: Transformer for Semantic Segmentation" reproduced via mmsegmentation

Segmenter-based-on-OpenMMLab "Segmenter: Transformer for Semantic Segmentation, arxiv 2105.05633." reproduced via mmsegmentation. We reproduce Segment

EricKani 22 Feb 24, 2022
GPOEO is a micro-intrusive GPU online energy optimization framework for iterative applications

GPOEO GPOEO is a micro-intrusive GPU online energy optimization framework for iterative applications. We also implement ODPP [1] as a comparison. [1]

瑞雪轻飏 8 Sep 10, 2022
A Planar RGB-D SLAM which utilizes Manhattan World structure to provide optimal camera pose trajectory while also providing a sparse reconstruction containing points, lines and planes, and a dense surfel-based reconstruction.

ManhattanSLAM Authors: Raza Yunus, Yanyan Li and Federico Tombari ManhattanSLAM is a real-time SLAM library for RGB-D cameras that computes the camera

117 Dec 28, 2022
RoBERTa Marathi Language model trained from scratch during huggingface 🤗 x flax community week

RoBERTa base model for Marathi Language (मराठी भाषा) Pretrained model on Marathi language using a masked language modeling (MLM) objective. RoBERTa wa

Nipun Sadvilkar 23 Oct 19, 2022
Unofficial implementation of the Involution operation from CVPR 2021

involution_pytorch Unofficial PyTorch implementation of "Involution: Inverting the Inherence of Convolution for Visual Recognition" by Li et al. prese

Rishabh Anand 46 Dec 07, 2022
Interpretable-contrastive-word-mover-s-embedding

Interpretable-contrastive-word-mover-s-embedding Paper Datasets Here is a Dropbox link to the datasets used in the paper: https://www.dropbox.com/sh/n

0 Nov 02, 2021
Official implementation of "Articulation Aware Canonical Surface Mapping"

Articulation-Aware Canonical Surface Mapping Nilesh Kulkarni, Abhinav Gupta, David F. Fouhey, Shubham Tulsiani Paper Project Page Requirements Python

Nilesh Kulkarni 56 Dec 16, 2022
OpenMMLab Text Detection, Recognition and Understanding Toolbox

Introduction English | 简体中文 MMOCR is an open-source toolbox based on PyTorch and mmdetection for text detection, text recognition, and the correspondi

OpenMMLab 3k Jan 07, 2023
Comp445 project - Data Communications & Computer Networks

COMP-445 Data Communications & Computer Networks Change Python version in Conda

Peng Zhao 2 Oct 03, 2022
Old Photo Restoration (Official PyTorch Implementation)

Bringing Old Photo Back to Life (CVPR 2020 oral)

Microsoft 11.3k Dec 30, 2022
Code for our paper "Graph Pre-training for AMR Parsing and Generation" in ACL2022

AMRBART An implementation for ACL2022 paper "Graph Pre-training for AMR Parsing and Generation". You may find our paper here (Arxiv). Requirements pyt

xfbai 60 Jan 03, 2023
Repository for tackling Kaggle Ultrasound Nerve Segmentation challenge using Torchnet.

Ultrasound Nerve Segmentation Challenge using Torchnet This repository acts as a starting point for someone who wants to start with the kaggle ultraso

Qure.ai 46 Jul 18, 2022
PyTorch code for 'Efficient Single Image Super-Resolution Using Dual Path Connections with Multiple Scale Learning'

Efficient Single Image Super-Resolution Using Dual Path Connections with Multiple Scale Learning This repository is for EMSRDPN introduced in the foll

7 Feb 10, 2022
PyTorch Kafka Dataset: A definition of a dataset to get training data from Kafka.

PyTorch Kafka Dataset: A definition of a dataset to get training data from Kafka.

ERTIS Research Group 7 Aug 01, 2022
An implementation on "Curved-Voxel Clustering for Accurate Segmentation of 3D LiDAR Point Clouds with Real-Time Performance"

Lidar-Segementation An implementation on "Curved-Voxel Clustering for Accurate Segmentation of 3D LiDAR Point Clouds with Real-Time Performance" from

Wangxu1996 135 Jan 06, 2023
EM-POSE 3D Human Pose Estimation from Sparse Electromagnetic Trackers.

EM-POSE: 3D Human Pose Estimation from Sparse Electromagnetic Trackers This repository contains the code to our paper published at ICCV 2021. For ques

Facebook Research 62 Dec 14, 2022
A gesture recognition system powered by OpenPose, k-nearest neighbours, and local outlier factor.

OpenHands OpenHands is a gesture recognition system powered by OpenPose, k-nearest neighbours, and local outlier factor. Currently the system can iden

Paul Treanor 12 Jan 10, 2022
Supervised 3D Pre-training on Large-scale 2D Natural Image Datasets for 3D Medical Image Analysis

Introduction This is an implementation of our paper Supervised 3D Pre-training on Large-scale 2D Natural Image Datasets for 3D Medical Image Analysis.

24 Dec 06, 2022
General Virtual Sketching Framework for Vector Line Art (SIGGRAPH 2021)

General Virtual Sketching Framework for Vector Line Art - SIGGRAPH 2021 Paper | Project Page Outline Dependencies Testing with Trained Weights Trainin

Haoran MO 118 Dec 27, 2022