An adaptable Snakemake workflow which uses GATKs best practice recommendations to perform germline mutation calling starting with BAM files

Overview

Germline Mutation Calling

This Snakemake workflow follows the GATK best-practice recommandations to call small germline variants.

The pipeline requires as inputs aligned BAM files (e.g. with BWA) where the duplicates are already marked (e.g. with Picard or sambamba). It then performed Base Quality Score Recalibration and joint genotyping of multiple samples, which is automatically parallized over user defined intervals (for examples see intervals.txt) and chromosomes.

Filtering is performed using GATKs state-of-the-art Variant Quality Score Recalibration

At the end of the worklow, the Variant Effect Predictor is used to annotate the identified germline mutations.

A high level overview of the performed steps can be seen below:

DAG

As seen by the execution graph, an arbitrary number of samples/BAM files can be processed in parallel up to the joint variant calling.

Installation

Required tools:

The majority of the listed tools can be quite easily installed with conda which is recommanded.

Usage

First, modify the config_wgs.yaml and resources.yaml files. Both files contain detailed description what is expected. The config_wgs.yaml also contains links to some reference resources. Be careful, they are all specific for the GRCh37/hg19/b37 genome assembly.

After setting up all the config files and installing all tools, you can simply run:

snakemake --latency-wait 300 -j 5 --cluster "sbatch --mem={resources.mem_mb} --time {resources.runtime_min} --cpus-per-task {threads} --job-name={rule}.%j --output snakemake_cluster_submit_log/{rule}.%j.out --mail-type=FAIL"

This assumes that the cluster you are using is running SLURM. If this is not the case, you have to adjust the command after --cluster. The log information of each job will be safed in the snakemake_cluster_submit_log directory. This directory will not be created automatically.

-j specifies the number of jobs/rules should be submitted in parallel.

I recommand running this command in a detached session with tmux or screen.

Output

Below is the output of the tree command, after the workflow has finished for one patient H005-00ML. Usually you would include many patients simultaneously (>50). This is just to illustrate the created output files.

.
├── cohort
│ ├── benchmark
│ │ ├── ApplyVQSR_indel.txt
│ │ ├── ApplyVQSR_snp.txt
│ │ ├── CombineGVCFs.txt
│ │ ├── GenotypeGVCFs.txt
│ │ ├── MergeCohortVCFs.txt
│ │ ├── SelectVariants.txt
│ │ ├── VEP.txt
│ │ ├── VQSR_indel.txt
│ │ └── VQSR_snp.txt
│ ├── cohort.recalibrated.pass.vep.vcf.gz
│ ├── cohort.recalibrated.pass.vep.vcf.gz_summary.html
│ ├── cohort.recalibrated.vcf.gz
│ ├── cohort.recalibrated.vcf.gz.tbi
│ └── logs
│     ├── ApplyVQSR_indel.out
│     ├── ApplyVQSR_snp.out
│     ├── CombineGVCFs
│     ├── CombineGVCFs.1.out
│     ├── CombineGVCFs.2.out
│     ├── ...
│     ├── ...
│     ├── CombineGVCFs.Y.out
│     ├── GenotypeGVCFs.1.out
│     ├── GenotypeGVCFs.2.out
│     ├── ...
│     ├── ...
│     ├── GenotypeGVCFs.Y.out
│     ├── MakeSitesOnly.out
│     ├── MergeCohortVCFs.out
│     ├── SelectVariants.err
│     ├── VEP.out
│     ├── VQSR_indel.out
│     └── VQSR_snp.out
├── config
│ ├── config_wgs.yaml
│ └── resources.yaml
├── H005-00ML
│ ├── benchmark
│ │ ├── ApplyBQSR.txt
│ │ ├── BaseRecalibrator.txt
│ │ ├── GatherBQSRReports.txt
│ │ ├── GatherRecalBamFiles.txt
│ │ ├── HaplotypeCaller.txt
│ │ ├── IndexBam.txt
│ │ ├── MergeHaplotypeCaller.txt
│ │ └── SortBam.txt
│ ├── H005-00ML.germline.merged.g.vcf.gz
│ ├── H005-00ML.germline.merged.g.vcf.gz.tbi
│ └── logs
│     ├── ApplyBQSR
│     ├── ApplyBQSR.0000-scattered.interval_list.out
│     ├── ApplyBQSR.0001-scattered.interval_list.out
│     ├── ...
│     ├── ...
│     ├── ApplyBQSR.0049-scattered.interval_list.out
│     ├── BaseRecalibrator
│     ├── BaseRecalibrator.0000-scattered.interval_list.out
│     ├── BaseRecalibrator.0001-scattered.interval_list.out
│     ├── ...
│     ├── ...
│     ├── BaseRecalibrator.0049-scattered.interval_list.out
│     ├── GatherBQSRReports.out
│     ├── GatherRecalBamFiles.out
│     ├── HaplotypeCaller
│     ├── HaplotypeCaller.0000-scattered.interval_list.out
│     ├── HaplotypeCaller.0001-scattered.interval_list.out
│     ├── ...
│     ├── ...
│     ├── HaplotypeCaller.0049-scattered.interval_list.out
│     ├── IndexBam.out
│     ├── MergeHaplotypeCaller.out
│     └── SortBam.out
├── rules
│ ├── BaseQualityScoreRecalibration.smk
│ ├── JointGenotyping.smk
│ ├── VEP.smk
│ └── VQSR.smk
├── Snakefile
├── snakemake_cluster_submit_log
│ ├── ApplyBQSR.24720887.out
│ ├── ApplyVQSR_snp.24777265.out
│ ├── BaseRecalibrator.24710227.out
│ ├── CombineGVCFs.24772984.out
│ ├── GatherBQSRReports.24715726.out
│ ├── GatherRecalBamFiles.24722478.out
│ ├── GenotypeGVCFs.24773026.out
│ ├── HaplotypeCaller.24769848.out
│ ├── IndexBam.24768728.out
│ ├── MergeCohortVCFs.24776018.out
│ ├── MergeHaplotypeCaller.24772183.out
│ ├── SelectVariants.24777733.out
│ ├── SortBam.24768066.out
│ ├── VEP.24777739.out
│ ├── VQSR_indel.24776035.out
│ └── VQSR_snp.24776036.out

For each analyzed patient, a seperate directory gets created. Along with the patient specific gvcf file, this directory contains log files for all the processing steps that were performed for that patient (log directory) as well as benchmarks for each rule, e.g. how long the step took or how much CPU/RAM was used (benchmark directory).

The cohort directory contains the multi-sample VCF file, which gets created after performing the joint variant calling. The cohort.recalibrated.vcf.gz is the product of GATKs Variant Quality Score Recalibration. The cohort.recalibrated.pass.vep.vcf.gz is the filtered and VEP annotated version of cohort.recalibrated.vcf.gz (only variants with PASS are kept).

For most applications, the cohort.recalibrated.pass.vep.vcf.gz file, is the file you want to continue working with.

A minimal Python package that produces slice plots through h5m DAGMC geometry files

A minimal Python package that produces slice plots through h5m DAGMC geometry files Installation pip install dagmc_geometry_slice_plotter Python API U

Fusion Energy 4 Dec 02, 2022
A python script to visualise explain plans as a graph using graphviz

README Needs to be improved Prerequisites Need to have graphiz installed on the machine. Refer to https://graphviz.readthedocs.io/en/stable/manual.htm

Edward Mallia 1 Sep 28, 2021
Browse Dash docsets inside emacs

Helm Dash What's it This package uses Dash docsets inside emacs to browse documentation. Here's an article explaining the basic usage of it. It doesn'

504 Dec 15, 2022
A collection of 100 Deep Learning images and visualizations

A collection of Deep Learning images and visualizations. The project has been developed by the AI Summer team and currently contains almost 100 images.

AI Summer 65 Sep 12, 2022
Movie recommendation using RASA, TigerGraph

Demo run: The below video will highlight the runtime of this setup and some sample real-time conversations using the power of RASA + TigerGraph, Steps

Sudha Vijayakumar 3 Sep 10, 2022
Render Jupyter notebook in the terminal

jut - JUpyter notebook Terminal viewer. The command line tool view the IPython/Jupyter notebook in the terminal. Install pip install jut Usage $jut --

Kracekumar 169 Dec 27, 2022
Eulera Dashboard is an easy and intuitive way to get a quick feel of what’s happening on the world’s market.

an easy and intuitive way to get a quick feel of what’s happening on the world’s market ! Eulera dashboard is a tool allows you to monitor historical

Salah Eddine LABIAD 4 Nov 25, 2022
Here I plotted data for the average test scores across schools and class sizes across school districts.

HW_02 Here I plotted data for the average test scores across schools and class sizes across school districts. Average Test Score by Race This graph re

7 Oct 27, 2021
A Python-based non-fungible token (NFT) generator built using Samilla and Matplotlib

PyNFT A Pythonic NF (non-fungible token) generator built using Samilla and Matplotlib Use python pynft.py [amount] The intention behind this generato

Ayush Gundawar 6 Feb 07, 2022
Designed a greedy algorithm based on Markov sequential decision-making process in MATLAB/Python to optimize using Gurobi solver

Designed a greedy algorithm based on Markov sequential decision-making process in MATLAB/Python to optimize using Gurobi solver, the wheel size, gear shifting sequence by modeling drivetrain constrai

Sabbella Prasanna 1 Jan 11, 2022
View part of your screen in grayscale or simulated color vision deficiency.

monolens View part of your screen in grayscale or filtered to simulate color vision deficiency. Watch the demo on YouTube. Install with pip install mo

Hans Dembinski 31 Oct 11, 2022
Set of matplotlib operations that are not trivial

Matplotlib Snippets This repository contains a set of matplotlib operations that are not trivial. Histograms Histogram with bins adapted to log scale

Raphael Meudec 1 Nov 15, 2021
A Bokeh project developed for learning and teaching Bokeh interactive plotting!

Bokeh-Python-Visualization A Bokeh project developed for learning and teaching Bokeh interactive plotting! See my medium blog posts about making bokeh

Will Koehrsen 350 Dec 05, 2022
Sentiment Analysis application created with Python and Dash, hosted at socialsentiment.net

Social Sentiment Dash Application Live-streaming sentiment analysis application created with Python and Dash, hosted at SocialSentiment.net. Dash Tuto

Harrison 456 Dec 25, 2022
Simple Inkscape Scripting

Simple Inkscape Scripting Description In the Inkscape vector-drawing program, how would you go about drawing 100 diamonds, each with a random color an

Scott Pakin 140 Dec 27, 2022
EPViz is a tool to aid researchers in developing, validating, and reporting their predictive modeling outputs.

EPViz (EEG Prediction Visualizer) EPViz is a tool to aid researchers in developing, validating, and reporting their predictive modeling outputs. A lig

Jeff 2 Oct 19, 2022
Generate "Jupiter" plots for circular genomes

jupiter Generate "Jupiter" plots for circular genomes Description Python scripts to generate plots from ViennaRNA output. Written in "pidgin" python w

Robert Edgar 2 Nov 29, 2021
A Python package for caclulations and visualizations in geological sciences.

geo_calcs A Python package for caclulations and visualizations in geological sciences. Free software: MIT license Documentation: https://geo-calcs.rea

Drew Heasman 1 Jul 12, 2022
Generating interfaces(CLI, Qt GUI, Dash web app) from a Python function.

oneFace is a Python library for automatically generating multiple interfaces(CLI, GUI, WebGUI) from a callable Python object. oneFace is an easy way t

NaNg 31 Oct 21, 2022
A guide for using Bootstrap 5 classes in Dash Bootstrap Components V1

dash-bootstrap-cheatsheet This handy interactive cheatsheet makes it easy to use the Bootstrap 5 classes with your Dash app made with the latest versi

10 Dec 22, 2022